Professional Documents
Culture Documents
PBL II (FinalPaper) - Group 73
PBL II (FinalPaper) - Group 73
PBL II (FinalPaper) - Group 73
Automatic Text Summarization or ATS allows us to extract useful information from a large
amount of data with the help of machine learning algorithms and deep neural networks, but
when we try to use it for the biomedical domain things get a bit complicated. These models
fail to provide satisfactory results and preserve the contextual meaning of the data. So, in
this work, we have tried to come up with an alternate solution to this problem. Our problem
statement was, that given a patient’s medical history in the form of various clinical
documents such as medical reports, drug descriptions etc. we must generate a medical
summary of that patient. We have tried to explore various ways in which the spacy library
can help us achieve the required results. Spacy is a popular NLP library in python, it provides
various pipelines such as entity ruler pipeline and part of speech pipeline just to name a few.
This methodology was specially All these models are great for
suggested because of one special particular tasks ,but the one that
limitations of the transformers which stands out most for training is T5
are used for abstractive summarization transformer model. It is an
which is that they are not able to encoder-decoder model which can be
handle a large amount of text at the trained to perform any NLP task on
same time and preserve the text related to any domain. We can
contextual meaning of the text. train the transformer on a large corpus
of any domain and transformer will
This one the best suited methodology train itself without needing to change
for summarizing medical text which its internal structure.
can found in magazines and other
3. Architecture medical conditions, names of the various
a. Diagram drugs , medical diagnoses etc. For this we
can use Natural Entity Recognition or NER.
What NER basically does is identifies the
various entities in each text. Say that we
have a sentence “Rahul is a good student”.
Now in this sentence Rahul is person ,
hence a good NER must be able to
recognize Rahul as a person.
In a similar fashion we can train an NER to
work accurately in identifying the various
medical terms in the text.
And once we have identified all the
medical conditions , diagnosis, and drugs
we can present in the list format to the
doctor in need of a quick summary of a
patient’s history.
c. Steps
Step 1: Source Text as Input
The first step is to gather the source text
that is to be summarized. In our case, we
will be taking a document that gives us a
patient’s medical history. The document
would be in a text file format which will be
read in the code.
b. Description
Step 2: Pre-process the data
A good medical summary usually consists
of two major components : The process of converting raw data into
the desired format is known as
1) a log of all the medication that the
pre-processing of data. Parts of the data
patient has been on and
that do not hold the value that we desire
2) a record of all the medical conditions. are removed. This enhances our overall
performance.
These may include things like surgeries ,
allergies , medical history among other
things.
Step 3: Tokenize the text
So our idea is to first scan through all the
Tokenizing the text literally means
medical text available of a particular
breaking down the texts into chunks,
patient and identify all the important
which are referred to as tokens. We can
medical keywords from the text. Such as
understand tokens as words that are part 4. Methodology and
of a sentence.
Evaluation
Tokenization is a crucial part of any
a.Building our model
NLP pipeline and is a must step to proceed
further. It converts unstructured text into We have used Spacy for creating our NER
a numerical structure of data that can be model. Spacy is a very popular NLP library
used in machine learning. in python it is easy to use and provides a
lot of functionalities and pipelines. Such
We have achieved tokenization
text lemmatization, part of speech (POS)
using spaCy.
tagging and much more. One we are
It parses and understands large volumes focusing on is entity ruler. We have trained
of text our model on several medical documents
from :
https://www.hcup-us.ahrq.gov/reports/st
Step 4: Pass the text through the NER atbriefs/sbtopic.jsp
pipeline
We first take a plain NLP model and then
After the input is tokenized, we pass it add an entity ruler to it. We then take
through the NER pipeline which has been large amount of text , annotated
trained on various medical documents. (manually preferable) , and then then train
our model on it. This kind of NER is also
Combining this with Entity-Ruler as
called as machine learning based NER. In
a pipe, we can find and label the data as
which the model is presented with
medical tests, conditions, or medicines.
enough data for it to understand the
context in which the word can be used.
6. Future Works
In future we would like to improve the
accuracy of our model with the help of
datasets which are more in quantity and
better in quality. Manually annotated text
helps in developing better model. Only set
back with that is that it required an
individual to go through the whole text
data themselves and annotate all the
In terms of accuracy spacy is
entities in it. Also, the more medical text
outperformed by sparkNLP , sparkNLP
we can train our data on the better.
makes half the errors in recognizing
entities. Second, we would like to device an
algorithm which will be able to generate a
table for all the tests and results with the
help of NER , which will be used to identify
the names of various tests performed and
part of speech tagging(POS) which we will
use to find out the relation between the
test and it’s result in order to form key
value pairs.
We can also try to work with sparkNLP it
provides various features such as
extracting text from a document with help
of computer vision. Only challenging task
in training a sparkNLP model will be
SparkNLP also takes less time for training creating a good dataset.
purposes. It almost trains 80 times faster
We can also train transformers on large
than a spacy model. SparkNLP uses BERT
medical corpus to summarize a sequence
(Bidirectional Encoder Representation
of text which will lose their meaning if
from Transformers) under the hood to
separated , such as doctors’ remarks or
achieve the state-of-the-art results.
description of process.
Hence , it can be concluded that in most
cases sparkNLP outperforms spacy model. 7. Conclusion
Summarizing medical text is a challenging
task as summary must not lose any
important data in the process. So, to such as NER and POS and various neural
achieve the desirable results, we can network models working under the hood.
employ different methodologies of NLP One of the major limitations anyone might
face is the absence of standardized dataset. But that too can be dealt with various
annotation tools which are now available to make our own datasets manually.
8. Code Snippets
9. References
1. Page 2, Innovative Document Summarization Techniques: Revolutionizing
Knowledge Understanding, 2014.
2. Moratanch, N. & Gopalan, Chitrakala. (2017). A survey on extractive text
summarization. 1-6. 10.1109/ICCCSP.2017.7944061.
3. Sentence Extraction Based Single Document Summarization by Jagadeesh J,
Prasad Pingali, Vasudeva Varma in Workshop on Document Summarization, 19th
and 20th March, 2005, IIIT Allahabad Report No: IIIT/TR/2008/97
4. Moratanch, N. & Gopalan, Chitrakala. (2016). A survey on abstractive text
summarization. 1-7. 10.1109/ICCPCT.2016.7530193.
5. Abstractive Multi-Document Summarization via Phrase Selection and Merging,
Department of Systems Engineering and Engineering Management, The Chinese
University of Hong Kong †Yahoo Labs, Sunnyvale, CA, USA
6. Leveraging BERT for Extractive Text Summarization on Lectures, Derek Miller,
Georgia Institute of Technology
7. Automated News Summarization Using Transformers Anushka Gupta1 , Diksha
Chugh, Anjum, Rahul Katarya, Delhi Technological University, New Delhi, India
110042
8. Text Summarization in the Biomedical Domain, Department of Electrical and
Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111,
Iran
9. Afzal M, Alam F, Malik KM, Malik GM, Clinical Context–Aware Biomedical Text
Summarization Using Deep Neural Network: Model Development and Validation
10. TEXT2TABLE: Medical Text Summarization System based on Named Entity
Recognition and Modality Identification; Eiji ARAMAKI, Yasuhide MIURA,
Masatsugu TONOIKE, Tomoko OHKUMA, Hiroshi MASHUICHI Kazuhiko OHE
11. Evaluating and Combining Named Entity Recognition Systems, Ridong Jiang, Rafael
E. Banchs, Haizhou Li
12. https://blog.floydhub.com/gentle-introduction-to-text-summarization-in-machine-
learning/
13. https://towardsdatascience.com/named-entity-recognition-ner-with-bert-in-spark
-nlp-874df20d1d77