Professional Documents
Culture Documents
An Easy-to-Use Clinical Text De-Identification Tool For Clinical Scientists: NLM Scrubber
An Easy-to-Use Clinical Text De-Identification Tool For Clinical Scientists: NLM Scrubber
net/publication/319914511
CITATIONS READS
0 50
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Mehmet Kayaalp on 19 September 2017.
Abstract
Health Insurance Portability and Accountability Act (HIPAA) requires that clinical documents be stripped of
personally identifying information prior to their secondary use for clinical research. We have been studying clinical
text de-identification for more than a decade and developing NLM Scrubber²it is a tool for every clinical scientist
who conducts retrospective research using clinical reports. Although we continuously improve and add new
functionalities to it, it is very simple to install and use.
1. Introduction
The Privacy Rule of Health Insurance Portability and Accountability Act (HIPAA) requires that clinical documents
be stripped of personally identifying information before they can be released to researchers and others; however,
manual clinical text de-identification is an arduous task. Furthermore, human annotators alone are usually not as
accurate as automatic clinical text de-identification systems. Even though no automatic de-identifier is perfect, they
can quickly produce de-identified text, which can then be easily reviewed and verified by the data providers for their
de-identification accuracy. If and when the de-identified text needs revisions, the necessary editing is usually minimal.
2. NLM Scrubber
The major downside of commercial de-identifiers is obviously their cost, which many clinical scientists may be unable
to afford. Most other de-identifiers found in the literature have been developed for research purposes only and are not
available. Besides their no cost to the user, the major advantage of the few freely available de-identifiers is that they
can be easily tested, evaluated and verified by independent third parties. The freely available de-identifiers can be
further divided into two categories depending on their training data requirements. De-identifiers that require training
data impose significant burden on their users demanding a large set of clinical documents annotated in compliance
with their prerequisite format.
NLM Scrubber is a freely available automatic clinical text de-identification tool with full support by its developers.
Furthermore, it does not impose any annotation requirement on clinical scientists; i.e., no text to be annotated to run
the application for producing de-identified clinical reports. Although we continuously add sophisticated functionalities
to NLM Scrubber, we strive to keep the user interface as simple as possible so that novice users can operate it easily.
NLM Scrubber is a product of several years of studies on clinical text de-identification.1 2 We recently rewrote NLM
Scrubber converting it from a pure research product to a consumer product. The user needs to fill out a short form
stating mainly where the text files are located in the computer. At this point in time, it can accept only ASCII text
reports formatted with proper capitalization; i.e., it would not perform well on all lowercase or all uppercase text.
The system is available on three platforms: Windows, Linux, and Mac OS X. It can be downloaded from
http://scrubber.nlm.nih.gov.
Funding and Competing Interests
This work was supported by the Intramural Research Program of the National Institutes of Health, National Library
of Medicine. The first author receives royalties from University of Pittsburgh for his contribution to a de-identification
SURMHFW1/0¶V(WKLFV2IILFe reviewed and approved his appointment.
References
1. Kayaalp M, Browne AC, Callaghan FM, Dodd ZA, Divita G, Ozturk S, et al. The Pattern of Name Tokens in
Narrative Clinical Text and a Comparison of Five Systems for Redacting them. J Am Med Inform Assn 2013.
2. Kayaalp M, Browne AC, Dodd ZA, Sagan P, McDonald CJ. De-identification of Address, Date, and Alphanumeric
Identifiers in Narrative Clinical Reports. Proceedings of the Annual American Medical Informatics Association
Fall Symposium 2014.
1522