Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2019 IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT)

Experimental Disease Prediction Research on


Combining Natural Language Processing and
Machine Learning
Hong Qing Yu
School of Computer Science and
Technolgy
University of Bedfordshire
Luton, United Kingdom
hongqing.yu@beds.ac.uk

Abstract— Nowadays Artificial Intelligent (AI) technologies future research path. The proposed approach can efficiently
are applied widely in many different areas to assist knowledge predict various diseases with 90% accuracy. The paper is
gaining and decision-making tasks. Especially, health organised as following:
information system can get most benefits from the AI
advantages. In particular, symptoms based disease prediction Section II will discuss the related work. Section III
research and production became increasingly popular in the introduces the information extracting and NLP process.
healthcare sector recently. Various researchers and Section IV will explore the dataset features and information
organizations have turned their interest in using modern quality for ML. Section V will introduce the prediction ML
computational techniques to analyze and develop new algorithms and experimental results. Section VI will discuss
approaches that can efficiently predict diseases with reasonable the experimental results for further open issues. Finally, the
accuracy. In this paper, we propose a framework to evaluate conclusion will be drawn at section VII.
the efficiency of applying both Machine Learning (ML) and
Nature Language Processing (NLP) technologies for disease II. RELATED WORK
prediction system. As an example, we scraped a disease- Support Vector Machines [4] is a popular supervised
symptom dataset with NLP features from one of the UK most
learning algorithm that can be used for both classification
trustable National Health Service (NHS) website. In addition,
we will exam our data in depth having symptom frequency,
and regression problems. The concept behind this algorithm
similarity and clustering analysis. As result, we can see that the is to find extremities in the dataset by using a hyperplane
prediction can have a very positive efficient rate but still open (decision surface) that can detect and separate different
issues need to be addressed. features into different classes. Due to its higher accuracy
when compared to other machine learning algorithms, SVM
Keywords— Artificial Intelligent, Data analysis, Machine has become one of the popular machine learning algorithms
Learning, health, Nature Language Processing that are used in the medical diagnosis field.

I. INTRODUCTION Chen et al. conducted a study using three multivariate


models to find patients with Chronic Kidney Disease (CKD)
There are more and more medical and healthcare related [5]. The study used three ML algorithms K-Nearest
products available on mobile and web applications Neighbour (KNN), Support Vector Machine (SVM) and Soft
nowadays. Many machine learning research work start to use Independent Modelling of Class Analogy (SIMCA) using the
the information they captured from online websites such as clinical dataset from UCI Machine learning repository. It is
social media information, forum communications and many found out that SVM had 99% accuracy rate in predicting
other resources to create AI supported healthcare CKD when compared with KNN, SIMCA with 98.2% and
recommended applications. The research results were very 85.9% accuracy rate respectively. Besides, it is also essential
encouraging and these AI applications can provide helpful to specify that the SVM had a quicker processing time
tips or even pre-diagnose advices based on very simple compared to the other two models. Almansor et al., on the
datasets e.g. condition and symptom relational datasets [1]. other side, suggest that when comparing ANN and SVM in
However, most of the work lacks of transparent framework predicting CKD, ANN performed better with higher
for managing the data and applying ML algorithms on the accuracy, also taking longer runtime [6]. However, the study
trust and reliable data. Researches in the past like, predicting finally concluded that there is no significant difference in the
cancer using machine learning and clustering, predicting accuracy rate.
dermatological conditions using naïve Bayes classifier [2]
and predicting occurrence of swine flu using naïve Bayes Based on the previous researches, it is safer to conclude
classification as well [3] have been done and these that the SVM tends to produce higher accuracy rate while
approaches seem to produce reasonably good results. comparing with other machine learning models. However,
However, most of these researches are narrowed down one SVM has few significant drawbacks; the higher accuracy rate
or a few diseases or medical conditions that are very crucial comes with the cost of a long training period. Also, SVM can
in the healthcare field. Some investigations, however, be only be applied for the two-class classification problem,
proposed new methods which can predict more disease based i.e. given the symptoms SVM can only be able to classify if
on symptoms, but their reliability and accuracy are still being the given symptoms relate to the disease or not.
a significant concern. Therefore, the research we do in this Naive Bayes [7] is also a popular ML algorithm primarily
paper is to develop an evaluable and transparent disease used for text classification. It uses probabilistic classifiers
prediction framework by applying combinative most based on the principles of Bayes Theorem. It works based on
advanced technologies as experimental case study to exam the probability distributions of variables in a dataset. While
the quality and issues of the result in order to conclude a

978-1-7281-3299-0/19/$31.00 ©2019 IEEE 145 Dalian, China•Oct 19-20, 2019


Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on January 31,2024 at 05:38:11 UTC from IEEE Xplore. Restrictions apply.
providing new attributes, the algorithm will predict the Various researchers in recent times have turned their
probability of the response variable data belonging to a interest in using the KNN algorithm for disease prediction
particular class. problems. Princy & Thomas in 2016 proposed a heart disease
prediction system using data mining techniques focusing
Pattekari & Parveen [8] conducted research to predict mainly on the KNN and Iterative Dichotomiser 3 (ID3)
heart disease using machine learning and data mining algorithm [11]. The research suggests that KNN produces
techniques, and this research aims to develop an intelligent good consistency result in disease prediction when compared
support system for the doctors which can predict the heart to other machine learning algorithms. The study, however,
disease provided using Naïve Bayes Algorithm, based on the didn’t provide any information about the dataset which was
various input attributes. The work was concluded by stating used. It is concluded that this approach provided a prediction
that the proposed approach is most accurate solution to accuracy of 80.6%.
predict patients with heart disease. However, the work didn’t
seem to use any real-world datasets or use cases to support Tayeb et al. in 2017 researched predicting chronic kidney
the statement and also, no evidence of results can be found. disease and heart disease using the KNN algorithm [12]. This
Thus, the question around the work is the reliability. research used the datasets from UCL’s Irvine Machine
learning repository. The most crucial step of the approach is
An approach that can predict breast cancer using
to choose the value of K for KNN. It is suggested that the
Weighted Naï ve Bayes (WNB) algorithm was proposed [2], value of K is critical in this type of approach as it made a
along with the Wisconsin Breast cancer dataset. To reach a substantial change in the accuracy rate. The research
higher accuracy rate, the author used a grid search algorithm concluded that the approach could successfully predict CKD
to find an optimum weight value for each attribute in the and Heart disease with 90% accuracy rate, which is better
dataset. While comparing with the traditional Naïve Bayes than previous studies which also uses the same dataset.
approach, the proposed method resulted in performing better
with higher accuracy of 98.54%, sensitivity and specificity KNN, however, has some common drawbacks comparing
rate although it is also to be noted that the proposed plan was to other classification algorithms, Tayeb et al. suggest that
computationally expensive and application dependent. the algorithm will be computationally expensive as it needs
to calculate the distance of each test instance to all training
Rashid et al. in 2016 proposed an Automatic disease samples. Another drawback is that the accuracy rate will be
prediction system (ADPS) using Naïve Bayes Algorithm [9], primarily affected when using datasets with low proximity
unlike other researches, focusing on one particular disease, between different classes. Also, the accuracy rate is affected
the proposed study concentrates on general disease when using multidimensional datasets with noisy data.
prediction. The approach uses Natural Language Processing
(NLP) to process the user input and tagging the keywords Our proposed framework will investigate, analyze
based on their attribute type, i.e. 1.) Symptom 2.) Time 3.) existing solutions and develop an efficient framework that
Intensity 4.) Organ 5.) Duration and stores them as Relevant can be used to predict general diseases or medical conditions
Attribute Array (RA). This helps the algorithm to compute based on the symptoms provided. The project uses NLP, K-
efficiently with higher accuracy and ranking the possible Means and Multinomial Naï ve Bayes Algorithm together to
diseases list. The research concluded with 14.35% higher predict 298 different medical conditions. The framework has
accuracy in comparison with other existing solutions. Also, three major components:
only four conditions were tested in the system, and the
dataset used in this approach is not available for other 1. Data extracting and NLP processing
researchers to evaluate. 2. Data evaluation and clustering
However, there is one major drawback in Rashid et al. 3. Prediction
approach. The author states that if any word or term in user
input cannot be determined during the NLP process, then that We will discuss components details in next three sections.
particular word or phrase will be assumed to have no III. INFORMATION EXTRACTING
apparent significance and will be avoided. Though it is
relatively reasonable to opt-in such an approach to prevent For the data extracting and processing component, we
the noisy inputs and improve the prediction accuracy, developed a data “web-crawling” engine first that can
however, chances are there that this approach might avoid extract disease information automatically and provide
some critical factors due to the misspelling of input data. In symptom descriptions from UK NHS website (see Figure 1
that case, this would affect the overall accuracy of the shows Hay fever information as an example).
prediction process.
K-Nearest Neighbour (KNN) [10] is a popular machine
learning algorithm used to solve both classification and
regression problems. The algorithm aims to find the nearest
neighbours of a given unclassified input for a given feature
space. This is achieved by measuring the distance between
the input data and the classified data which already exists
using various distance metric techniques like Euclidian,
Manhattan or Weighted etc. A critical factor in this algorithm
is determining the value of K. K is user-defined constant and
also known as the query point that determines the number of
neighbours ‘voting’ for new input data.

146
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on January 31,2024 at 05:38:11 UTC from IEEE Xplore. Restrictions apply.
Capitalization is an important step in text pre-processing
as the text data often involves a mixture of capitalized words.
For instance, a computer program will consider “Nausea”
and “nausea” as two completely different words even though
they possess the same meaning. A simple approach to solve
this issue would be converting all the data into lowercase.
Stop word removal helps to get rid of the meaningless
words in dataset. For example, filler words like ‘of’ or ‘to’
hold no meaning and are often not important. In addition,
stop word removal will also reduce the dimensional space
and provide a good advantage when using bag of words
method e.g. term-frequency inverse document frequency [13].
Tokenisation is a very straightforward process which
helps in converting paragraphs into sentences or sentence
into words. In the dataset the symptoms for each condition
are stored in paragraphs. Using the NLTK tokenisation
method, the paragraphs are spilt into tokens of sentences
making it easier to use for the clustering and classification
Fig. 1. Condition information example from NHS website algorithms.
Figure 2 shows the process of our data extracting process, Stemming, in simple words, is a process of removing the
which majorly include three steps: Website crawling to possible suffix of a word and converting into its root stem.
generate raw data of conditions and their symptoms (see This process helps normalise the data and improves the
Figure 3), data cleansing/merging and nature language quality of dictionary. In order to achieve this, an inbuilt
processing (NLP) for ML. stemming algorithm named Snowball stemmer was used.
IV. INFORMATION EVALUATION
From data extracting process, we extracted 3691 records
of raw data and cleaned up 298 condition records with their
related symptoms and NLP tokens. We would like to have
two evaluations on our disease-symptom dataset: finding the
most frequent symptoms for conditions and clustering the
conditions based on the semantic similarities.
To demonstrate the first evaluation, we use WordCloud
figure (see Figure 4) to show the most frequent appearing
symptoms extracted from NLP word and sentences tokens.
Then we can see that “loss of appetite”, “tiredness”, “a high
Fig. 2. Data extracting process temperature”, “feeling Sick or vomiting”, “diarrhoea”,
“dizziness” and “loss hair/weight” and “confusion” are
By studying the raw data, we found that there are a lots clearly showing as most common symptoms. However,
of repeated condition names with different symptoms. frequent analysis cannot explain the whole story and
Therefore, we need to clean up the raw data to assist NLP symptoms are closed link to the conditions. In another word,
step efficiently. The clearn up tasks include delete duplicated symptoms are scattered according to different types of
raw data and merging the symptoms grouped by condition condition. Therefore, the clustering analysis is required.
names.
The NLP tasks are achieved by coding with Python's
‘nltk’ library. The process involves four major pre-
processing methods 1.) Capitalisation 2.) Stop word removal
3.) Tokenisation and 4.) stemming. Figure 3 shows parts of
the final results.

Fig. 4. Symptom Wordcloud

Based on the number of conditions (298), we evaluated


the two different clustring algorithms that are k-mean
clustering and LDA (Latent Dirichlet Allocation) algorithms
Fig. 3. Cleaned data with NLP semantic tokens [14]. The evaluation result shows that k-mean clustering

147
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on January 31,2024 at 05:38:11 UTC from IEEE Xplore. Restrictions apply.
algorithm is more suitable for this clustering task. We TABLE I. SMALLEST CLUSTERS
automatically tested different numbers of clusters to find the Example clusters Disease names in the cluster
best k-mean clustering result. The automation testing process Cluster 8 1. Acute lymphoblastic leukaemia: Children
is presented in Figure 5 and the k-mean clustering results 2. Acute lymphoblastic leukaemia: Teenagers and
represent in Figure 6 as plotpot graph with 15 clusters. The young adults
largest size cluster contains 32 conditions and the smallest 3. Acute myeloid leukaemia: Children
4. Acute myeloid leukaemia: Teenagers and young
two cluster contains 6 and 9 conditions. To make the adults
understanding more easily on the quality of the clusters, two 5. Non-Hodgkin lymphoma: Children
smallest clusters are discussed here and they are cluster 8 and 6. Spleen problems and spleen removal
13 (see table 2) . Checking cluster 8 (C8), we can see that Cluster13 1. Anxiety
most of the condition is homologous (e.g. Acute 2. Fibromyalgia
3. Hypoglycaemia (low blood sugar)
lymphoblastic leukaemia) but for different age group or vas 4. Migraine
versa. Cluster 13 (C13) is not too obvious, however we can 5. Obsessive compulsive disorder (OCD)
see that the conditions are related to each other’s (e.g. 6. Panic disorder
Anxiety, Panic disorder, OCD and PTSD). 7. Post-traumatic stress disorder (PTSD)
8. Peripheral neuropathy
9. Menopause

V. THE PREDICTION METHODOLOGY

Fig. 7. Clustering plot graph

Figure 7 shows the disease prediction methodology by


giving a set of symptoms. We had discussed NLP and
Clustering parts in previous sections. This section focuses on
the last prediction component using Multinomial Naï ve
Bayes (MNB) probability classification method.
The first step is calculating the prior probability of each
class in the training dataset. This can be achieved by using
the formula below.

Explanation, in order to calculate the prior probability of


each class in the training dataset, we take total number of
times the class occurred divided by size of training set.
The second step is to find the likelihood of a symptom
Fig. 5. Clustering process occurring in a particular class this can be achieved by a
simple calculation.

Explanation, in order to find the likelihood of a symptom


occurring in a particular class, we take the total number of
times the symptom occurred in that class with Laplace
smoothing 1 and divide it by the sum of total number of
symptoms in that class and size of the vocabulary of that
class i.e. total number of distinct symptoms in that class.
After calculating the required parameters, it is now
possible to determine the class of new unseen data. This can
Fig. 6. Clustering plot graph be achieved by the following technique.

148
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on January 31,2024 at 05:38:11 UTC from IEEE Xplore. Restrictions apply.
to enable specifying the detail information about symptoms
and diseases rather than just calculating the similarities of
Explanation, to find the probability score of the new NLP token matching.
input belonging to a class, we multiply prior probability of
that class with the likelihood of the input within that class. Second key issue is the black box prediction process,
which means the algorithm cannot explain the results in a
For example, in the project’s context, the formula for transparent way to tell why (the reason) one disease has
calculating the probability of a user given symptom can be higher scores than others. This is also the general issues of
calculated like: current ML algorithms [15], where you can only specify the
computation algorithm but difficult to have a traceability for
evaluating the reason behind it. In contrast, finding why is
the most import task in healthcare research. For example, if a
person has more than 7 days high temperature, understanding
what reasons caused this condition is critical, e.g. the reason
The process will be repeated with each class and their of long period high fever can be infections, however what
resulting scores will be used to determine prediction results. causes the infection.
VI. DISCUSSIONS Finally, the learning and prediction framework are static
For the evaluating this project, three similar existing and fixed to certain dataset. In another words, the prediction
online commercial products are chosen and they are model is hard to be updated unless the whole process is
mayoclinic.org, medicinenet.com and heathdirect.gov.au. redone.
Figure 8 shows the interface of our implemented C-Predict Therefore, there is a long way to go in this research area.
system. A comparison on the overall functionalities of each As short-term goal, the future research should integrate
system was made aiming to evaluate the accuracy and semantic technologies like semantic web data (or called
efficiency of proposed method. Testing cases of 20 different knowledge graph) with NLP tokens to structure machine
conditions with specific symptoms was obtained from nhs.uk. understandable graph data, e.g. to indicate position of the
The final result shows all other systems have relative symptom, side effects of using drugs and relations between
accuracy rate around 80% and our proposed framework can different diseases and so on. In addition, new logic models
reach 93% as the highest. However, here we only talk about like causal analysis logic should be considered to enhance
relative accuracy and there are many issues around current the causal reasoning capability. Transparent probability
technologies. distribution algorithm should also been injected into the
framework.
VII. CONCLUSION
In the paper, we presents an experimental evaluation
research framework for developing a symptom-based disease
prediction system. We extracted the 298 disease information
from UK most trust NHS website and applied advanced NLP
and ML algorithms in the framework. The relative accuracy
can achieve 93%. However, there are many remaining issues
around this research area, which are determination, black box
reasoning and difficulties of updating knowledge. Therefore,
we need to guide our research to directions of semantic
enhancement, knowledge graph, Open Linked Data with
traceable causal analysis logic frameworks.
REFERENCES
[1] Kaggle.com, “Heart Disease UCI”, 2018. [Online]. Available at:
https://www.kaggle.com/cdabakoglu/heart-disease-classifications-
machine-learning/data. [Accessed: 23- July- 2019].
[2] M. Karabatak and M. C. Ince, “An Expert System for Detection of
Breast Cancer Based on Association Rules and Neural Network,”
Expert Systems with Applications, 2009, 36(2), 2009, 3465-3469
Fig. 8. The C-Predict system interface [3] B. A. Thakkar, M. I. Hasan, M. A. Desai, “Healthcare decision
support system for swine flu prediction using naïve bayes classifier”,
First is the undetermined prediction problem. If there is In Proceedings of the 2010 International Conference on Advances in
only couple of symptoms provided, then the prediction result Recent Technologies in Communication and Computing (ARTCOM
'10). IEEE Computer Society, Washington, DC, USA, 2010, pp. 101-
will have multiple suggestions with same probability scores. 105.
For example, if the inputs are fever, dizzy and cough, then [4] H. William, S. A. Teukolsky, A. Saul, T. W. Vetterling, B. P.
about 20 results will be displayed with very similar Flannery, "Support Vector Machines". Numerical Recipes: The Art of
prediction scores. However, many of the prediction results Scientific Computing (3rd ed.). New York: Cambridge University
have their specific context. For instance, the highest score is Press. ISBN 978-0-521-88068-8. Section 16.5, Archived from the
Sinusitis, but its common major signs are thick nasal mucus, original on 2011-08-11.
a plugged nose and pain in the face. Fever, dizzy and cough [5] Chronic kidney disease: KDIGO CKD-MBD guideline update:
evolution in the face of uncertainty. Wei Chen, David A. Bushinsky,
can be the side effects of the major signs. Thus, the Nat Rev Nephrol. 2017 Aug 21 Published online 2017 Aug 21. doi:
comprehensive knowledge modelling techniques are required 10.1038/nrneph.2017.118

149
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on January 31,2024 at 05:38:11 UTC from IEEE Xplore. Restrictions apply.
[6] N. Almansour, H. Syed, N. Khayat, R. Altheeb, R. Juri, J. Alhiyafi, S.
Alrashed and S. Olatunji. “Neural network and support vector
machine for the prediction of chronic kidney disease: A comparative
study”, Computers in Biology and Medicine, 2019, 109, pp.101-111.
[7] I. Rish, “ An empirical study of the naive Bayes classifier”. IJCAI
Workshop on Empirical Methods in AI, 2001. [Online]. Available at:
https://www.cc.gatech.edu/~isbell/reading/papers/Rish.pdf [Accessed:
27- July- 2019].
[8] S. Pattekari and A. Parveen, (2019). “Prediction System for Heart
Disease Using Naïve Bayes”. International Journal of Advanced
Computer and Mathematical Sciences, [online] 3(3), pp.290-294.
Available at: https://pdfs.semanticscholar.org/ d32e/ e90a5de89093
a4fc95f43e0409cb91414726. pdf [Accessed 31 May 2019].
[9] Md. T. R. Laskar, Md. T. Hossain, A. R. M. Kamal and N. Rashid,
“Automated Disease Prediction System (ADPS): A User Input-based
Reliable Architecture for Disease Prediction”. International Journal
of Computer Applications 133(15):24-29, January 2016. Published by
Foundation of Computer Science (FCS), NY, USA
[10] D. Coomans, D. L. Massart. "Alternative k-nearest neighbour rules in
supervised pattern recognition: Part 1. k-Nearest neighbour
classification by using alternative voting rules". Analytica Chimica
Acta. 1982, 136: 15–27. doi:10.1016/S0003-2670(01)95359-0.
[11] J. Thomas and R. T. Princy, "Human heart disease prediction system
using data mining techniques," 2016 International Conference on
Circuit, Power and Computing Technologies (ICCPCT), Nagercoil,
2016, pp. 1-5. doi: 10.1109/ICCPCT.2016.7530265
[12] S. Tayeb, M. Pirouz, J. Sun, K. Hall, A. Chang, J. Li, C. Song, A.
Chauhan, M. Ferra, T. Sager, J. Zhan and S. Latifi, “Toward
predicting medical conditions using k-nearest neighbors”. 2017 IEEE
International Conference on Big Data (Big Data). [online] Available
at: https://ieeexplore.ieee.org/document/8258395 [Accessed 31 May
2019].
[13] A. Aizawa, "An information-theoretic perspective of tf–idf measures".
Information Processing and Management. 39 (1): 45–65. doi: 10.
1016/ S0306-4573(02)00021-3.
[14] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003).
Lafferty, John (ed.). "Latent Dirichlet Allocation". Journal of
Machine Learning Research. 3 (4–5): pp. 993–1022. doi: 10. 1162/
jmlr. 2003. 3.4-5.993.
[15] C. Zednik, “Solving the Black Box Problem: A Normative
Framework for Explainable Artificial Intelligence”, 2019. Available
at: https://arxiv.org/ftp/arxiv/papers/1903/1903.04361.pdf [Accessed
31 July 2019]

150
Authorized licensed use limited to: REVA UNIVERSITY. Downloaded on January 31,2024 at 05:38:11 UTC from IEEE Xplore. Restrictions apply.

You might also like