Professional Documents
Culture Documents
Experimental Disease Prediction Research On Combining Natural Language Processing and Machine Learning
Experimental Disease Prediction Research On Combining Natural Language Processing and Machine Learning
Abstract— Nowadays Artificial Intelligent (AI) technologies future research path. The proposed approach can efficiently
are applied widely in many different areas to assist knowledge predict various diseases with 90% accuracy. The paper is
gaining and decision-making tasks. Especially, health organised as following:
information system can get most benefits from the AI
advantages. In particular, symptoms based disease prediction Section II will discuss the related work. Section III
research and production became increasingly popular in the introduces the information extracting and NLP process.
healthcare sector recently. Various researchers and Section IV will explore the dataset features and information
organizations have turned their interest in using modern quality for ML. Section V will introduce the prediction ML
computational techniques to analyze and develop new algorithms and experimental results. Section VI will discuss
approaches that can efficiently predict diseases with reasonable the experimental results for further open issues. Finally, the
accuracy. In this paper, we propose a framework to evaluate conclusion will be drawn at section VII.
the efficiency of applying both Machine Learning (ML) and
Nature Language Processing (NLP) technologies for disease II. RELATED WORK
prediction system. As an example, we scraped a disease- Support Vector Machines [4] is a popular supervised
symptom dataset with NLP features from one of the UK most
learning algorithm that can be used for both classification
trustable National Health Service (NHS) website. In addition,
we will exam our data in depth having symptom frequency,
and regression problems. The concept behind this algorithm
similarity and clustering analysis. As result, we can see that the is to find extremities in the dataset by using a hyperplane
prediction can have a very positive efficient rate but still open (decision surface) that can detect and separate different
issues need to be addressed. features into different classes. Due to its higher accuracy
when compared to other machine learning algorithms, SVM
Keywords— Artificial Intelligent, Data analysis, Machine has become one of the popular machine learning algorithms
Learning, health, Nature Language Processing that are used in the medical diagnosis field.
146
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on January 31,2024 at 05:38:11 UTC from IEEE Xplore. Restrictions apply.
Capitalization is an important step in text pre-processing
as the text data often involves a mixture of capitalized words.
For instance, a computer program will consider “Nausea”
and “nausea” as two completely different words even though
they possess the same meaning. A simple approach to solve
this issue would be converting all the data into lowercase.
Stop word removal helps to get rid of the meaningless
words in dataset. For example, filler words like ‘of’ or ‘to’
hold no meaning and are often not important. In addition,
stop word removal will also reduce the dimensional space
and provide a good advantage when using bag of words
method e.g. term-frequency inverse document frequency [13].
Tokenisation is a very straightforward process which
helps in converting paragraphs into sentences or sentence
into words. In the dataset the symptoms for each condition
are stored in paragraphs. Using the NLTK tokenisation
method, the paragraphs are spilt into tokens of sentences
making it easier to use for the clustering and classification
Fig. 1. Condition information example from NHS website algorithms.
Figure 2 shows the process of our data extracting process, Stemming, in simple words, is a process of removing the
which majorly include three steps: Website crawling to possible suffix of a word and converting into its root stem.
generate raw data of conditions and their symptoms (see This process helps normalise the data and improves the
Figure 3), data cleansing/merging and nature language quality of dictionary. In order to achieve this, an inbuilt
processing (NLP) for ML. stemming algorithm named Snowball stemmer was used.
IV. INFORMATION EVALUATION
From data extracting process, we extracted 3691 records
of raw data and cleaned up 298 condition records with their
related symptoms and NLP tokens. We would like to have
two evaluations on our disease-symptom dataset: finding the
most frequent symptoms for conditions and clustering the
conditions based on the semantic similarities.
To demonstrate the first evaluation, we use WordCloud
figure (see Figure 4) to show the most frequent appearing
symptoms extracted from NLP word and sentences tokens.
Then we can see that “loss of appetite”, “tiredness”, “a high
Fig. 2. Data extracting process temperature”, “feeling Sick or vomiting”, “diarrhoea”,
“dizziness” and “loss hair/weight” and “confusion” are
By studying the raw data, we found that there are a lots clearly showing as most common symptoms. However,
of repeated condition names with different symptoms. frequent analysis cannot explain the whole story and
Therefore, we need to clean up the raw data to assist NLP symptoms are closed link to the conditions. In another word,
step efficiently. The clearn up tasks include delete duplicated symptoms are scattered according to different types of
raw data and merging the symptoms grouped by condition condition. Therefore, the clustering analysis is required.
names.
The NLP tasks are achieved by coding with Python's
‘nltk’ library. The process involves four major pre-
processing methods 1.) Capitalisation 2.) Stop word removal
3.) Tokenisation and 4.) stemming. Figure 3 shows parts of
the final results.
147
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on January 31,2024 at 05:38:11 UTC from IEEE Xplore. Restrictions apply.
algorithm is more suitable for this clustering task. We TABLE I. SMALLEST CLUSTERS
automatically tested different numbers of clusters to find the Example clusters Disease names in the cluster
best k-mean clustering result. The automation testing process Cluster 8 1. Acute lymphoblastic leukaemia: Children
is presented in Figure 5 and the k-mean clustering results 2. Acute lymphoblastic leukaemia: Teenagers and
represent in Figure 6 as plotpot graph with 15 clusters. The young adults
largest size cluster contains 32 conditions and the smallest 3. Acute myeloid leukaemia: Children
4. Acute myeloid leukaemia: Teenagers and young
two cluster contains 6 and 9 conditions. To make the adults
understanding more easily on the quality of the clusters, two 5. Non-Hodgkin lymphoma: Children
smallest clusters are discussed here and they are cluster 8 and 6. Spleen problems and spleen removal
13 (see table 2) . Checking cluster 8 (C8), we can see that Cluster13 1. Anxiety
most of the condition is homologous (e.g. Acute 2. Fibromyalgia
3. Hypoglycaemia (low blood sugar)
lymphoblastic leukaemia) but for different age group or vas 4. Migraine
versa. Cluster 13 (C13) is not too obvious, however we can 5. Obsessive compulsive disorder (OCD)
see that the conditions are related to each other’s (e.g. 6. Panic disorder
Anxiety, Panic disorder, OCD and PTSD). 7. Post-traumatic stress disorder (PTSD)
8. Peripheral neuropathy
9. Menopause
148
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on January 31,2024 at 05:38:11 UTC from IEEE Xplore. Restrictions apply.
to enable specifying the detail information about symptoms
and diseases rather than just calculating the similarities of
Explanation, to find the probability score of the new NLP token matching.
input belonging to a class, we multiply prior probability of
that class with the likelihood of the input within that class. Second key issue is the black box prediction process,
which means the algorithm cannot explain the results in a
For example, in the project’s context, the formula for transparent way to tell why (the reason) one disease has
calculating the probability of a user given symptom can be higher scores than others. This is also the general issues of
calculated like: current ML algorithms [15], where you can only specify the
computation algorithm but difficult to have a traceability for
evaluating the reason behind it. In contrast, finding why is
the most import task in healthcare research. For example, if a
person has more than 7 days high temperature, understanding
what reasons caused this condition is critical, e.g. the reason
The process will be repeated with each class and their of long period high fever can be infections, however what
resulting scores will be used to determine prediction results. causes the infection.
VI. DISCUSSIONS Finally, the learning and prediction framework are static
For the evaluating this project, three similar existing and fixed to certain dataset. In another words, the prediction
online commercial products are chosen and they are model is hard to be updated unless the whole process is
mayoclinic.org, medicinenet.com and heathdirect.gov.au. redone.
Figure 8 shows the interface of our implemented C-Predict Therefore, there is a long way to go in this research area.
system. A comparison on the overall functionalities of each As short-term goal, the future research should integrate
system was made aiming to evaluate the accuracy and semantic technologies like semantic web data (or called
efficiency of proposed method. Testing cases of 20 different knowledge graph) with NLP tokens to structure machine
conditions with specific symptoms was obtained from nhs.uk. understandable graph data, e.g. to indicate position of the
The final result shows all other systems have relative symptom, side effects of using drugs and relations between
accuracy rate around 80% and our proposed framework can different diseases and so on. In addition, new logic models
reach 93% as the highest. However, here we only talk about like causal analysis logic should be considered to enhance
relative accuracy and there are many issues around current the causal reasoning capability. Transparent probability
technologies. distribution algorithm should also been injected into the
framework.
VII. CONCLUSION
In the paper, we presents an experimental evaluation
research framework for developing a symptom-based disease
prediction system. We extracted the 298 disease information
from UK most trust NHS website and applied advanced NLP
and ML algorithms in the framework. The relative accuracy
can achieve 93%. However, there are many remaining issues
around this research area, which are determination, black box
reasoning and difficulties of updating knowledge. Therefore,
we need to guide our research to directions of semantic
enhancement, knowledge graph, Open Linked Data with
traceable causal analysis logic frameworks.
REFERENCES
[1] Kaggle.com, “Heart Disease UCI”, 2018. [Online]. Available at:
https://www.kaggle.com/cdabakoglu/heart-disease-classifications-
machine-learning/data. [Accessed: 23- July- 2019].
[2] M. Karabatak and M. C. Ince, “An Expert System for Detection of
Breast Cancer Based on Association Rules and Neural Network,”
Expert Systems with Applications, 2009, 36(2), 2009, 3465-3469
Fig. 8. The C-Predict system interface [3] B. A. Thakkar, M. I. Hasan, M. A. Desai, “Healthcare decision
support system for swine flu prediction using naïve bayes classifier”,
First is the undetermined prediction problem. If there is In Proceedings of the 2010 International Conference on Advances in
only couple of symptoms provided, then the prediction result Recent Technologies in Communication and Computing (ARTCOM
'10). IEEE Computer Society, Washington, DC, USA, 2010, pp. 101-
will have multiple suggestions with same probability scores. 105.
For example, if the inputs are fever, dizzy and cough, then [4] H. William, S. A. Teukolsky, A. Saul, T. W. Vetterling, B. P.
about 20 results will be displayed with very similar Flannery, "Support Vector Machines". Numerical Recipes: The Art of
prediction scores. However, many of the prediction results Scientific Computing (3rd ed.). New York: Cambridge University
have their specific context. For instance, the highest score is Press. ISBN 978-0-521-88068-8. Section 16.5, Archived from the
Sinusitis, but its common major signs are thick nasal mucus, original on 2011-08-11.
a plugged nose and pain in the face. Fever, dizzy and cough [5] Chronic kidney disease: KDIGO CKD-MBD guideline update:
evolution in the face of uncertainty. Wei Chen, David A. Bushinsky,
can be the side effects of the major signs. Thus, the Nat Rev Nephrol. 2017 Aug 21 Published online 2017 Aug 21. doi:
comprehensive knowledge modelling techniques are required 10.1038/nrneph.2017.118
149
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on January 31,2024 at 05:38:11 UTC from IEEE Xplore. Restrictions apply.
[6] N. Almansour, H. Syed, N. Khayat, R. Altheeb, R. Juri, J. Alhiyafi, S.
Alrashed and S. Olatunji. “Neural network and support vector
machine for the prediction of chronic kidney disease: A comparative
study”, Computers in Biology and Medicine, 2019, 109, pp.101-111.
[7] I. Rish, “ An empirical study of the naive Bayes classifier”. IJCAI
Workshop on Empirical Methods in AI, 2001. [Online]. Available at:
https://www.cc.gatech.edu/~isbell/reading/papers/Rish.pdf [Accessed:
27- July- 2019].
[8] S. Pattekari and A. Parveen, (2019). “Prediction System for Heart
Disease Using Naïve Bayes”. International Journal of Advanced
Computer and Mathematical Sciences, [online] 3(3), pp.290-294.
Available at: https://pdfs.semanticscholar.org/ d32e/ e90a5de89093
a4fc95f43e0409cb91414726. pdf [Accessed 31 May 2019].
[9] Md. T. R. Laskar, Md. T. Hossain, A. R. M. Kamal and N. Rashid,
“Automated Disease Prediction System (ADPS): A User Input-based
Reliable Architecture for Disease Prediction”. International Journal
of Computer Applications 133(15):24-29, January 2016. Published by
Foundation of Computer Science (FCS), NY, USA
[10] D. Coomans, D. L. Massart. "Alternative k-nearest neighbour rules in
supervised pattern recognition: Part 1. k-Nearest neighbour
classification by using alternative voting rules". Analytica Chimica
Acta. 1982, 136: 15–27. doi:10.1016/S0003-2670(01)95359-0.
[11] J. Thomas and R. T. Princy, "Human heart disease prediction system
using data mining techniques," 2016 International Conference on
Circuit, Power and Computing Technologies (ICCPCT), Nagercoil,
2016, pp. 1-5. doi: 10.1109/ICCPCT.2016.7530265
[12] S. Tayeb, M. Pirouz, J. Sun, K. Hall, A. Chang, J. Li, C. Song, A.
Chauhan, M. Ferra, T. Sager, J. Zhan and S. Latifi, “Toward
predicting medical conditions using k-nearest neighbors”. 2017 IEEE
International Conference on Big Data (Big Data). [online] Available
at: https://ieeexplore.ieee.org/document/8258395 [Accessed 31 May
2019].
[13] A. Aizawa, "An information-theoretic perspective of tf–idf measures".
Information Processing and Management. 39 (1): 45–65. doi: 10.
1016/ S0306-4573(02)00021-3.
[14] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003).
Lafferty, John (ed.). "Latent Dirichlet Allocation". Journal of
Machine Learning Research. 3 (4–5): pp. 993–1022. doi: 10. 1162/
jmlr. 2003. 3.4-5.993.
[15] C. Zednik, “Solving the Black Box Problem: A Normative
Framework for Explainable Artificial Intelligence”, 2019. Available
at: https://arxiv.org/ftp/arxiv/papers/1903/1903.04361.pdf [Accessed
31 July 2019]
150
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on January 31,2024 at 05:38:11 UTC from IEEE Xplore. Restrictions apply.