Demistifying deidentification of phi in free-formatted text
DeMISTifying Deidentification
of PHI in Free-formatted Text
March 2016
Approved for Public Release; Distribution Unlimited. Case Number 16-0670
2016 The MITRE Corporation. All rights reserved.
Introduction
Tool Rationale
MITRE Identification Scrubber Toolkit (MIST)
Use Case 1 – Deidentification
Hiding in Plain Sight
Use Case 2 – Identification of PHI in e-mail
Privacy Risk Identification and Management
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
We are a (not-for-profit) public interest
company, working with industry and academia
to advance and apply science, technology,
systems engineering, and strategy, enabling
government and the private sector to make
better decisions and implement (publicly
available) solutions to complex challenges of
national and global significance.
…including in the areas of natural language
processing, privacy, and cybersecurity.
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
Open Source
MITRE Identification Scrubber Toolkit (MIST)
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
Use Case #1: Research
Doctor's Notes: research involving
– Treasure trove of information– How to disclose free-formatted text
to external researchers Take advantage of linguistics experts Mitigate hurtles/risk of sharing PHI
– How to mine while respecting patient
Solution: De-identify
Protected Health Information
(PHI) Identifiers
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
PHI in Free-Formatted Text: De-Identification Challenge
• Start with known PHI object, locate PHI elements, and de-identify
HISTORY OF PRESENT ILLNESS: The patient is a 77-year-old-woman
with long standing hypertension who presented as a walk-in to me atthe Sun Hill Medical Center on August 12th. Recently had been started q.o.d. on Clonidine since June 8th to taper off of the drug. Was told to
start Zestril 20 mg. q.d. again. Patient sent to Jones Cardiac Unit for direct admission for cardioversion and anticoagulation, with the Cardiologist, Dr. Pearson to follow.
• Sample PHI Object
• Doctors' notes
• Discharge summaries
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
Consumer Off the Shelf (COTS)
de-identification tools
While some are standalone, many are components of larger, expensive
data and network management tools
For unstructured data, identification tends to rely on ‘brute force'
keywords and regular expressions
– Lists of names
– Hand-crafted patterns requiring skilled developers
There is no solution that is 100% full proof including manual de-identification.
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
MIST: Training a De-identification System
documents
using model
train (better)
train model
model from
from initial
redact or
mark PHI by
hand in initial
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
MIST: Training a De-identification System
HISTORY OF PRESENT ILLNESS: The patient is a 77-year-old-woman
with long standing hypertension who presented as a walk-in to me atthe
YY] Recently had been
started q.o.d. on Clonidine since
Z] to taper off of the drug.
Was told to start Zestril 20 mg. q.d. again. The patient was sent tothe
GG] or direct admission for cardioversion and
anticoagulation, with the Cardiologist, Dr.
] o follow.
Transform the
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
Study
– 8 hours of marking (training) data
– Narrative from patient records
– 95% accurate as a measure of precision (false positives) and recall
(false negatives) (Favorably) comparable to manual reviews
The MIST results
– Top score in the first i2b2 De-identification Challenge Evaluation
– Used to de-identify medical records by hospitals
– Has led to numerous collaborations on MITRE projects
– Rapidly portable – adaptable to other domains
– The MIST TALLAL approach works well for a large corpus
– Precision/recall tradeoff can be adjusted
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
How Good is Manual De-id by Humans?
Counts of overlooked PHI
– In 100 Family Practice notes
– Containing 1,093 PHI
Pairs of reviewers
Trios of reviewers
MIST (Single Model)
From: Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification. D. S. Carrell; D. J. Cronkite; B. A. Malin; J. S. Aberdeen; L. Hirschman, submitted to Meth of Info in Medicine
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
DeMISTifying Deidentification On Steroids…
Hiding in Plain Sight (HIPS)* Research
HISTORY OF PRESENT ILLNESS: The patient is a 77-year-old-woman
with long standing hypertension who presented as a walk-in to me atthe Sun Hill Medical Center on August 12th. Recently had been started q.o.d. on Clonidine since June 8th to taper off of the drug. Was told to
start Zestril 20 mg. q.d. again. Patient sent to Jones Cardiac Unit for direct admission for cardioversion and anticoagulation, with the Cardiologist, Dr. Pearson to follow.
Hypothesis: With 'good'
Initial research results:
resynthesis, it can be nearly
Good resynthesis reduced
with larger sample
impossible to detect ‘leaked
the detection of PHI leaks
sizes is necessary
*HIPS is a collaborative
PHI' manually OR using
by at worse case an
to validate.
research effort among
data mining hackers.
additional 85%.
GroupHealth, Vanderbilt, and MITRE
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
Use Case #2 - Identifying PHI in Free-Formatted Text
"Dr. Famous and his Laptop"(1) - Loss of control over
unencrypted laptop which contains e-mail that may have
protected health information (PHI)
e-mail back-up is available Need to establish extent of PHI content in email for
Health Information Portability and Accountability Act (HIPAA)
Solution: Intensive manual review ($$$$$$)
Research Seedling: Is there a possibility MIST can
facilitate PHI discovery in e-mail?
1. Tagging and Modeling: Can MIST successfully ‘tag'
PHI identifiers in e-mail and be trained to model e-mail tagging?
2. If Step 1 is successful, identify prospective next
steps for re-purposing MIST as an identification tool for locating PHI in email
(1) Halamka, Dr. John D., Surviving the Cybersecurity Cold War: A CIO's Practical Guide for Risk Management, slide 1
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
Problem Scope – Compromised Assets
2012 Ponemon Institute
Study(2) on Provider Breaches:
of healthcare organizations had at least one breach in the prior 2 years
of organizations permit employees and medical staff to
had more than 5 breaches in
use their own personal mobile
the prior 2 years
devices to connect to their
covered entity providers
networks or enterprise systems
of breaches caused by lost or
Enterprise data-at-rest
stolen computing device
encryption solutions offer partial risk mitigation
(1) Department of Health and Human Services, 45 CFR Parts 160 and 164, Federal Register, Vol. 78, No. 17, Part II, January 25, 2013
(2) Ponemon Institute Research Report, Third Annual Benchmark Study on Patient Privacy and Data Security, Dec 2012
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
Identifying other types of sensitive data of in compromised
– Sensitive personal identifiable information (PII) such as financial
information (as per Gramm-Leach-Bliley)
– PII defined in state consumer protection and/or breach notification
– Proprietary data, Sensitive but Unclassified (SBU), etc.
Loss of control over a ‘device' with potentially multiple sources
of sensitive unstructured data;
– Risk assessment requires reconstructing the contents of the device
Backups Information on application servers
– E-mail, SharePoint, Database, Etc.
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
The Challenge of PHI Discovery
Protected Health Information is:
– Created or received by a covered entity AND
– Identifies an individual (or is identifiable) (i.e., contains PHI identifiers)
– Relates to the individual's past, present, or future physical or mental
health, the provisioning of health for the individual, or payment
HIPAA de-identification is governed by a safe-harbor standard
– 17 fields plus a catch-all
PHI identifiers
In e-mail with MIST
E-mail is different from the narrative, clinical note domain
– Structure, context mix of text and control characters
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
MIST E-mail PHI Identification Results
Based on one run
– Precision: Fraction of identified terms that were relevant = 0.733– Recall: Fraction of relevant terms that were identified = 0.738– F-measure: 2 × ((precision × recall) / (precision + recall)) = 0.736
Additional modeling
– Precision: 0.899
– Recall: 0.870
– F-measure: 2 × ((precision × recall) / (precision + recall)) = 0.885
Given (many) constraints, MIST performance was (very)
encouraging for e-mail processing
– Opportunity for significant improvement with an enhanced tag set and
email-aware blocking software
Additional hours/dollars needed for product-ready email solution
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
Privacy Risk Management Automated Tool
Automated tools to support data-associated risk management
– PHI de-identification of doctor's notes
– Hiding in Plain Sight Resynthesis of believable fake data
– Tailorable (TALLAL) tool at the ready for assessing leakage of
{PHI, PII, proprietary information, sensitive information,…} in free-formatted text (e.g., email)
How about a tailorable automated tool that supports
? privacy risk management?
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
What is MITRE's Privacy Risk Identification and
Management Engine ((P)RIME*)?
RIME is a foundation for organizations to normalize and manage risk
– ‘Organization' defines RIME-hosted instance
– Web front-end with database backend
– RIME provides engines for:
Risk Managers
PII Owners (Business
Application, Program, System)
Dynamic question-naires
Automated compliance
Cursor sensitive, as-needed
document generation
Risk management (raw vs.
Immediate risk feedback
* (P)RIME - Initially designed with Privacy as the risk focus
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
Risk Identification and Management Engine
User- facing artifacts
Context-dependent
Dynamic Questionnaire
Organizationally-defined
Cursor-sensitive help
Risk Analysis
Dashboard Generation
Privacy Impact
Risk identification, priority
Risk Management
Assessments (PIA's)
Document Creation
Privacy Threshold
(Compliance) document
Assessment (PTA's)
"Data-Rich" Completed
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
[P]RIME ‘Big' Ideas
Move away from Word, etc.,
Push privacy SME-ness
documents as risk ‘tool'
to PII owners
Covert, timely awareness by
making privacy risk more explicit
Separate data gathering
2 from risk analysis and
[compliance or other]
document generation
Empower risk managers
Eliminate redundancies (and wasted time), reduce
with risk metrics and tools
Inject discipline and consistency
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
P[RIME] Demo
Functionality is real
Production Use
Ready for transfer to sponsor and to industry
– For optimal results, MITRE should assist with initial instantiation
Instantiation Examples
– Traditional PIA, breach, inventory support
– ‘Down-select' from a complete set of possible risk issues
– Automated requirements/testing support
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
MIST, HIPS, PRIME, and other Privacy or
Cybersecurity Tools
For more information, please contact Cathy Petrozzino at
2016 The MITRE Corporation. All rights reserved. For Internal MITRE Use.
2016 The MITRE Corporation. All rights reserved.
Source: http://www2.mitre.org/work/health/himss/pdf/HIMSS16-FH27-Petrozzino.pdf
Corporate Information Biographical Details of Directors and Senior Management Chairman's Statement Corporate Governance Report Management Discussion and Analysis Directors' Report Independent Auditor's Report Consolidated Financial Statements Consolidated Income Statement Consolidated Statement of Comprehensive Income Consolidated Statement of Financial Position
Value of License Agreements Maximizing The Value Of License AgreementsBy Louis P. Berneman, Todd C. Davis, D. Patrick O'Reilley and Matthew Raymond agreements appropriate ■ Louis P. Berneman, iopharmaceutical companies and not-for-profit to their commercial po- (academic) research institutions have become tential and inherent risks increasingly adept at structuring license and