Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

MINING AND CLASSIFYING MEDICAL DOCUMENTS

The datas in the documents are extremely huge volume so the important task is to
retrieve the information at low volume. This is done by demonstrating methods for text
mining and document classification used in natural languague processing and presenting
the development and deployment of an application for automatized text classifier generation in
Python using Scikit-Learn and Streamlit. The process includes text consist of string and convert
to mathematical models by converting words to numeric vectors. A common task is mapping
sequences from some domain to sequences in another domain known as sequence-to-
sequence models.  there is the possibility to ignore the sequential aspect of a language and
model a text as a collection of words. This is the most frequent representation in text mining,
which is the process of deriving high-quality information from text. For indentifiacation a
dictionary is constructed but it is not discriminative. To undergo this problem a solution
called term frequency-inverse document frequency (TFIDF) was developed tf-idf is a numerical
statistic that is intended to reflect how important a word is to a document in a collection of
documents. The tf-idf value increases proportionally to the number of times a word appears in
the document and is modified by the number of documents in the corpus that contain the
word, which helps to adjust for the fact that some words appear more frequently in general. tf-
idf is one of the most popular term-weighting schemes today 83% of text-based recommender
systems in digital libraries use tf–idf. The term frequency is usually just the count of a specific
word divided by the total amount of words in a text hence adjusting for document length.
Inverse document frequency is obtained by calculating the inverse fraction of documents. The
tf-idf is defined for each word i in each document k in a collection of N documents. It is
calculated by multiplying tf and idf. This way, a word i that occurs in almost every
document k lowers the tf-idf value.
Electronic Medical Records and Machine Learning in Approaches to Drug Development

EMR is digital tool in hospital to check patient health care. EMRs could help discover
phenotype-genotype associations, enhance clinical trial protocols, automate adverse drug
event detection and prevention, and accelerate precision medicine research. With a manual
approach to identify and extract high value data, drug research on EMRs are not scalable and
are extremely costly to employ domain experts for data extraction. The push for medical
document digitization in conjunction with recent development in ML methods, such as natural
language processing (NLP) that allows for machines to mimic human comprehension of written
text has allowed the outsourcing of these research tasks to machines and further facilitate drug
research. digitization of the traditional paperwork was done on an ad-hoc basis and many
healthcare institutions independently regulate EMRs to create a highly heterogeneous data set.
This heterogeneity makes data pre-processing for ML methods time consuming and financially
costly if domain experts are required for this task. EMRs are influencing three key areas of
biomedical research and drug discovery phenotype-genotype associations, clinical trials
pharmacovigilance. Phenotype-genotype association is the correspondence between a person’s
genetic makeup their genotype and the observable characteristics or pathologies that are a
product of their genetics interacting with the environment their phenotype. In biomedical
research this genomic inclusion in EHRs shows potential in secondary use as raw data from
which to draw medically meaningful results. EMR has adequate phenomic and genomic data on
an individual, algorithms can translate raw data in EMRs to phenotype data, which in turn can
be associated with the genomic data. EMRs often contain a mixture of standardized codes and
free-text. To improve upon methods that only consider codes machine learning tools largely
based upon NLPs, have been developed to collect more phenotypic data from data sources
beyond standardized codes such as textual clinical notes, textual discharge summaries and
radiology reports.To  demonstrated the use of NLP methods and a Convolutional Neural
Network (CNN) method to create word embeddings in clinical notes to automate clinical
phenotyping of prostate cancer patients. One of the major problems is that EMRs generally
suffers from the difficulty in identification and correction of missing or mistaken data. In many
cases, ML methods require large datasets and when EHRs are amalgamated from multiple
sources, a high number of varying kinds of errors are carried over to the data set and therefore
propagate through to the algorithms. Due to the high throughput of data in ML methods, there
is a need for an automatic correction filter, or a complete work around the missing data. NLP is
able to reduce the amount of manual-driven patient identification required. Once the number
of patients eligible for a clinical trial is estimated, the next step is to carry out patient screening
on each individual. There are three methods that can carry out these checks. Meystre et al.
harnessed NLP to directly compare clinical trial screen accuracy between machine learning,
rule-based and cosine-similarity based methods and reported the highest accuracy. Adverse
drug event (ADE) detection is a vital step towards effective pharmacovigilance and prevention
of future incidents caused by potentially harmful ADEs. The electronic health records (EHRs) of
patients in hospitals contain valuable information regarding ADEs and hence are an important
source for detecting ADE signals. However, EHR texts tend to be noisy. Yet applying off-the-
shelf tools for EHR text preprocessing jeopardizes the subsequent ADE detection performance,
which depends on a well tokenized text input.  A BiLSTM conditional random field network was
used for entity recognition and a BiLSTM-Attention network for entity relation extraction.

E-discovery

. Electronic information is usually accompanied by metadata that is not found in paper


documents and that can play an important part as evidence. It refers to discovery in legal
proceedings such as litigation, government investigations. The preservation of metadata from
electronic documents creates special challenges to prevent spoliation.

There are certain stages in ediscovery:

Identification: Responsive documents are identified for further analysis and review.

Preservation: A duty to preserve begins upon the reasonable anticipation of litigation.


During preservation, data identified as potentially relevant is placed in a legal hold. This
ensures that data cannot be destroyed. Care is taken to ensure this process is
defensible, while the end-goal is to reduce the possibility of data spoliation or
destruction.

Collection: Collection is the transfer of data from a company to their legal counsel, who
will determine relevance and disposition of data.

Processing: Files are prepared to be loaded into a document review platform. Often, this
phase also involves the extraction of text and metadata from the native files. 

Review: Documents are reviewed for responsiveness to discovery requests and for
privilege. This include rapid identification of potentially relevant documents, and the
culling of documents according to various criteria .

Production: Documents are turned over to opposing counsel, based on agreed-upon


specifications. Often this production is accompanied by a load file, which is used to load
documents into a document review platform.
Clinical NLP

. Clinical NLP is a specialization of NLP that allows computers to understand the


meaning that lies behind a doctor’s written analysis of a patient.

There are several requirements that you should expect any clinical NLP system to have:

Entity extraction: to surface relevant clinical concepts from unstructured data.

Contextualization: to decipher the doctor’s meaning when they mention a concept. For
example, when doctors deny a patient has a condition or talk about a patient’s history.

Knowledge graph: to understand how clinical concepts are interrelated, like the fact that both
fentanyl and hydrocodone are opiates.

Clinical NLP engines need to be able to understand the acronyms, and jargon that are medicine
specific. You also need to supplement clinical NLP engines with a knowledge graph, because
doctors rely on the knowledge of other doctors who are reading what they write to fill in
information that they don’t explicitly record. Different words and phrases can have exactly the
same meaning in medicine, for example dyspnea, SOB, breathless and shortness of breath all
have the same meaning. The context of what a doctor is writing about is also very important for
a clinical NLP system to understand. Up to 50% of the mention of conditions and symptoms in
doctor’s writing are actually instances where they are ruling out that condition or symptom for
a patient. A knowledge graph encodes entities, also called concepts, and their relationship to
one another. All of these relationships create a web of data that can be used in computing
applications to help them think about medicine similarly to how a human might. Lexigram’s
Knowledge Graph powers all of our software and is also available directly via our APIs.

You might also like