PBL II (FinalPaper) - Group 73

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Abstract

Automatic Text Summarization or ATS allows us to extract useful information from a large
amount of data with the help of machine learning algorithms and deep neural networks, but
when we try to use it for the biomedical domain things get a bit complicated. These models
fail to provide satisfactory results and preserve the contextual meaning of the data. So, in
this work, we have tried to come up with an alternate solution to this problem. Our problem
statement was, that given a patient’s medical history in the form of various clinical
documents such as medical reports, drug descriptions etc. we must generate a medical
summary of that patient. We have tried to explore various ways in which the spacy library
can help us achieve the required results. Spacy is a popular NLP library in python, it provides
various pipelines such as entity ruler pipeline and part of speech pipeline just to name a few.

1. Introduction must be noted that the accuracy of a


a. What is a text summarizer? particular model heavily relies on the
A text summarizer is an application that quantity and quality of the dataset used.
when provided with an enormous amount
b. Why do we need a text
of data can return the most important bits
summarizer?
of information from it in a very concise
manner, only focusing on the portions that In today’s world the value of data has
provide relevant information while surpassed everything we know. Today it is
discarding the rest. This helps us readers as valuable as the oil was in the previous
save a lot of time. century. We produce terabytes of data
every day.
This difficult process of extracting
important information accurately and in With so much data overload , it has
reasonable amount of time is known as become very hard to keep track of
text summarization. Regardless of the relevant and important information for
field, it is being used it helps people save a the professional. So in order to speed the
lot of time for processing large chunks of processing and research process we
data , be it business , medical domain , or require a machine learning model which
education. can help us pick out the important bits
from the text.
As humans it is much easier for us to scan
a chunk of text and analyse it extract the c. What is a good summary?
important chunks of data or paraphrase it
Important criteria based on which we can
to form a short summary of it, but for
evaluate a summary is :
machines this is a hard task as machines
are not very good with text , they are -It covers all the important topics from the
designed to work with numbers , as a text
result to program a model which
-And the information is presented in the
accurately summarizes the text can be a
most readable format.
bit daunting. First, we need to pre-process
the text into a format that can be fed to The output summary must not contain the
our model then we must train it on a lot of irrelevant information in it , only the
corpuses of a particular domain. Now it useful content and the amount of text
should be minimum. The output must be publication , but for our problem
in format that must also be readable. which is to generate a patient’s
medical summary this method still has
some aspects which may not work
2. Related Works properly.
a. Previous Works b. Previous Algorithms and
Research published by Stanford Models
University suggests that the best way There have been many algorithms and
to summarize the long medical text is models that have been developed to
to first use the extractive text perform various NLP operations on the
summarizer on the text to extract the text, like Recurrent Neural Network
important sentences from text. Then (RNN) , Long-Short Term Memory
feed all those extracted texts to a (LSTM) and other sequence to
abstractive text summarizer which will sequence models to name a few.
then return the text in fluent English
format. RNN was a revolutionary neural
network in the field of NLP as its
  output did not just depend on the
current input but also on the previous
states of the network , although it did
have a minor issue associated with it
which was that it had a poor memory ,
meaning by the time it reached the
end of the sentence it sometimes
missed the context of the whole
sentence and gave undesirable results.
This issue was later resolved by a new
model which build on top of RNN,
which was LSTN neural network. It
resolved the problem of memory
which was faced by the RNN.

This methodology was specially All these models are great for
suggested because of one special particular tasks ,but the one that
limitations of the transformers which stands out most for training is T5
are used for abstractive summarization transformer model. It is an
which is that they are not able to encoder-decoder model which can be
handle a large amount of text at the trained to perform any NLP task on
same time and preserve the text related to any domain. We can
contextual meaning of the text. train the transformer on a large corpus
of any domain and transformer will
This one the best suited methodology train itself without needing to change
for summarizing medical text which its internal structure.
can found in magazines and other
3. Architecture medical conditions, names of the various
a. Diagram drugs , medical diagnoses etc. For this we
can use Natural Entity Recognition or NER.
What NER basically does is identifies the
various entities in each text. Say that we
have a sentence “Rahul is a good student”.
Now in this sentence Rahul is person ,
hence a good NER must be able to
recognize Rahul as a person.
In a similar fashion we can train an NER to
work accurately in identifying the various
medical terms in the text.
And once we have identified all the
medical conditions , diagnosis, and drugs
we can present in the list format to the
doctor in need of a quick summary of a
patient’s history.
c. Steps
Step 1: Source Text as Input
The first step is to gather the source text
that is to be summarized. In our case, we
will be taking a document that gives us a
patient’s medical history. The document
would be in a text file format which will be
read in the code.

b. Description
Step 2: Pre-process the data
A good medical summary usually consists
of two major components : The process of converting raw data into
the desired format is known as
1) a log of all the medication that the
pre-processing of data. Parts of the data
patient has been on and
that do not hold the value that we desire
2) a record of all the medical conditions. are removed. This enhances our overall
performance.
These may include things like surgeries ,
allergies , medical history among other
things.
Step 3: Tokenize the text
So our idea is to first scan through all the
Tokenizing the text literally means
medical text available of a particular
breaking down the texts into chunks,
patient and identify all the important
which are referred to as tokens. We can
medical keywords from the text. Such as
understand tokens as words that are part 4. Methodology and
of a sentence.
Evaluation
Tokenization is a crucial part of any
a.Building our model
NLP pipeline and is a must step to proceed
further. It converts unstructured text into We have used Spacy for creating our NER
a numerical structure of data that can be model. Spacy is a very popular NLP library
used in machine learning. in python it is easy to use and provides a
lot of functionalities and pipelines. Such
We have achieved tokenization
text lemmatization, part of speech (POS)
using spaCy.
tagging and much more. One we are
It parses and understands large volumes focusing on is entity ruler. We have trained
of text our model on several medical documents
from :
https://www.hcup-us.ahrq.gov/reports/st
Step 4: Pass the text through the NER atbriefs/sbtopic.jsp
pipeline
We first take a plain NLP model and then
After the input is tokenized, we pass it add an entity ruler to it. We then take
through the NER pipeline which has been large amount of text , annotated
trained on various medical documents. (manually preferable) , and then then train
our model on it. This kind of NER is also
Combining this with Entity-Ruler as
called as machine learning based NER. In
a pipe, we can find and label the data as
which the model is presented with
medical tests, conditions, or medicines.
enough data for it to understand the
context in which the word can be used.

Step 5: Extraction Once the model has been trained it can


then be directly used on text to identify
Extract the medical tests, conditions, and
various entities. Our model currently gives
medicines into different variables
output as follows:
multiple myeloma MEDICAL CONDITION
Step 6: Assembly dementia MEDICAL CONDITION
We will list all the drugs prescribed to the Delivery MEDICAL CONDITION
patient under different sections.
Congestive heart failure MEDICAL
CONDITION
Step 7: Summarize the text renal failure MEDICAL CONDITION
Display all the medical conditions that the asthma MEDICAL CONDITION
patient has, and we will summarize the
medical tests and medicines in tabular
format with the help of POS tagging and
NER.
As can be seen from the table of output above our model is not yet entirely accurate .
Current it has only attained the accuracy of 0.36 which is quite low . But that doesn’t mean
that methodology is flawed ,it simply means that we need to train our model on more and
better datasets, doing this can improve the accuracy of the model manyfold.

5. Comparative Study processing, but what is makes it so special


is the fact that in terms of NER it performs
We have used spacy for our current model
but there are numerous other alternatives
for that NLTK , Genism to name a few. But much better than spacy. The parameter on
the one that stands out is sparkNLP. which we have compared both are the
SparkNLP like spacy is an open-source time it takes to train both and accuracy
analytic-engine used for large-scale data they both are able to achieve.
Parameters: The only problem to consider here is that
it much harder to develop a dataset for
-Time to train
sparkNLP compared to spacy.
-Accuracy

6. Future Works
In future we would like to improve the
accuracy of our model with the help of
datasets which are more in quantity and
better in quality. Manually annotated text
helps in developing better model. Only set
back with that is that it required an
individual to go through the whole text
data themselves and annotate all the
In terms of accuracy spacy is
entities in it. Also, the more medical text
outperformed by sparkNLP , sparkNLP
we can train our data on the better.
makes half the errors in recognizing
entities. Second, we would like to device an
algorithm which will be able to generate a
table for all the tests and results with the
help of NER , which will be used to identify
the names of various tests performed and
part of speech tagging(POS) which we will
use to find out the relation between the
test and it’s result in order to form key
value pairs.
We can also try to work with sparkNLP it
provides various features such as
extracting text from a document with help
of computer vision. Only challenging task
in training a sparkNLP model will be
SparkNLP also takes less time for training creating a good dataset.
purposes. It almost trains 80 times faster
We can also train transformers on large
than a spacy model. SparkNLP uses BERT
medical corpus to summarize a sequence
(Bidirectional Encoder Representation
of text which will lose their meaning if
from Transformers) under the hood to
separated , such as doctors’ remarks or
achieve the state-of-the-art results.
description of process.
Hence , it can be concluded that in most
cases sparkNLP outperforms spacy model. 7. Conclusion
Summarizing medical text is a challenging
task as summary must not lose any
important data in the process. So, to such as NER and POS and various neural
achieve the desirable results, we can network models working under the hood.
employ different methodologies of NLP One of the major limitations anyone might
face is the absence of standardized dataset. But that too can be dealt with various
annotation tools which are now available to make our own datasets manually.

8. Code Snippets
9. References
1. Page 2, Innovative Document Summarization Techniques: Revolutionizing
Knowledge Understanding, 2014.
2. Moratanch, N. & Gopalan, Chitrakala. (2017). A survey on extractive text
summarization. 1-6. 10.1109/ICCCSP.2017.7944061.
3. Sentence Extraction Based Single Document Summarization by Jagadeesh J,
Prasad Pingali, Vasudeva Varma in Workshop on Document Summarization, 19th
and 20th March, 2005, IIIT Allahabad Report No: IIIT/TR/2008/97
4. Moratanch, N. & Gopalan, Chitrakala. (2016). A survey on abstractive text
summarization. 1-7. 10.1109/ICCPCT.2016.7530193.
5. Abstractive Multi-Document Summarization via Phrase Selection and Merging,
Department of Systems Engineering and Engineering Management, The Chinese
University of Hong Kong †Yahoo Labs, Sunnyvale, CA, USA
6. Leveraging BERT for Extractive Text Summarization on Lectures, Derek Miller,
Georgia Institute of Technology
7. Automated News Summarization Using Transformers Anushka Gupta1 , Diksha
Chugh, Anjum, Rahul Katarya, Delhi Technological University, New Delhi, India
110042
8. Text Summarization in the Biomedical Domain, Department of Electrical and
Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111,
Iran
9. Afzal M, Alam F, Malik KM, Malik GM, Clinical Context–Aware Biomedical Text
Summarization Using Deep Neural Network: Model Development and Validation
10. TEXT2TABLE: Medical Text Summarization System based on Named Entity
Recognition and Modality Identification; Eiji ARAMAKI, Yasuhide MIURA,
Masatsugu TONOIKE, Tomoko OHKUMA, Hiroshi MASHUICHI Kazuhiko OHE
11. Evaluating and Combining Named Entity Recognition Systems, Ridong Jiang, Rafael
E. Banchs, Haizhou Li
12. https://blog.floydhub.com/gentle-introduction-to-text-summarization-in-machine-
learning/
13. https://towardsdatascience.com/named-entity-recognition-ner-with-bert-in-spark
-nlp-874df20d1d77

You might also like