Professional Documents
Culture Documents
Springer Parkinson
Springer Parkinson
1 1 1 1
*
Janvi Malhotra , *Khushal Thakur , Divneet Singh Kapoor , Kiran Jot Singh and
1
Anshul Sharma
1
Kalpana Chawla Center for Research in Space Science & Technology,
Chandigarh Univesity, Mohali – 140413, Punjab, India
1*khushal.ece@gmail.com
* Corresponding Author
1 Introduction
are responsible for the disease. People who have some family members with
Parkinson’s disease are at risk. Other groups of people at elevated risk are the
people who have been exposed to some pesticides or have suffered from some
prior head injuries. Parkinson’s disease occurs in the age group of over 60 [7, 9].
Males are more affected by this disease and the ratio stands 3:2. Parkinson’s
disease in the age group of before 50 is called early onset P.D. Average life
expectancy after diagnosis is said to be in between 7-15 years. Some of the
Parkinson’s disease statistics are as follows:
The cure for this disease is not known but medications and therapies like
physiotherapy and speech therapy, especially in preliminary stages, can
significantly improve life quality. If Parkinson's disease is detected in preliminary
stages, it can also reduce estimated cost of pathology. One common and early-
stage symptom of Parkinson disease is degradation of voice [16]. Also, the
analysis of voice measurement is simple and non-invasive. Thus, for diagnosis of
Parkinson's disease measurement of voice can be used. Data Science is one
approach to diagnose Parkinson’s disease in its preliminary stages. Data Science is
the study of enormous amounts of data that uses systems, algorithms, and
processes to extract meaningful and useful information from raw, structured, and
unstructured data [4, 6]. Data is a precious asset for any organization data science,
data is manipulated to get some new and meaningful information. In Data Science,
knowledge from datasets (typically large) is extracted and applying that
knowledge, several significant problems related to that dataset can be solved. Data
science is playing a significant role in the health care industry. In healthcare,
physicians use data science to Analyse their patient’s health make weighty
decision. It helps hospital managing teams to enhance care and reduce waiting
time[25]. For this purpose, voice data has been collected via telemonitoring and
tele-diagnosis systems which are economical and easy to use. Furthermore,
advances in technologies like Internet of things, Wireless sensor networks and
computer vision can be used to develop newer multi domain solutions [10, 11, 19,
22–24][a-f].
3
2 Background
The first step for performing classification using machine learning algorithms is
collection of data. For this research, the data is downloaded from an online
website www.kaggle.com, which originally was collected from the UCI, a
machine learning repository. Some common features of the data are as follows.
This dataset has a total of 756 instances and 754 features. This dataset has data
both Parkinson’s Disease affected patients and healthy people. Voice based data of
188 Parkinson’s Disease Patients (107-men, 81-women) and 64 healthy
individuals (41-women, 23-men) is collected in this dataset. Three repetitions of
4
sustained phonations were performed and this is how 756 instances are formed.
The four models that were opted to work on this problem statement were Support
Vector Machine (kernel=’linear’), Decision Tree Classifier, Random Forest
Classifier and Extra Trees Classifier. All these models are explained in detail
below.
1. Support Vector Machine (SVM) [26]: Support Vector Machine is an ML
algorithm which is used in both classification as well as in regression problem
statements. It is one of most used machine learning algorithms. SVM
algorithm creates a line (decision boundary), as shown in Fig. 1. This line will
segregate data points in N-dimensional space and it will be able to put new
data points will be given a particular category. Extreme points also called
vectors are considered while creating this decision boundary (hyper plane).
The extreme cases are support vectors of this Support Vector Machine model.
If we have a dataset to be classified into two categories (red circles and green
stars) then there can be many decision boundaries for that like red and green.
The work of SVM machine learning algorithm is to find the best decision
boundary (called hyperplane) that separates given two data most efficiently.
Hyperplane is a decision boundary with largest margin (distance between
decision boundary and closest points).
2. Decision Tree Classifier (dt) [5]: Decision tree classifier classifies the data
like human technique of thinking. Its method is given below in the flowchart,
as shown in Fig. 2. We need to be familiar with some terms to understand logic
behind Decision Tree Classifier:
ROOT NODE (parent node): It is the starting point of the whole tree. It is
the node from where the entire dataset will be divided into sets.
LEAF NODE: These are the final nodes. These nodes cannot be divided
further.
5
3. Random Forest Classıfıer (rf) [18]: Random Forest Classifier works on the
principle of ensemble learning. In ensemble learning, multiple classifiers work
together to predict the result and eventually performance of model will be
improved. In Random Forest Classifier there are multiple decision trees that
will work together and maximum votes from all the decision trees will be final
output will Working of Random Forest classifier have two phases. In the first
6
phase, the machine learning algorithm will combine N decision trees and in
other results from all the trees in first phase will be taken into account. First
the algorithm will build dt for some randomly selected points from training
data for N number of trees. Second, for new data all decision trees will make
their prediction and the final classified category will be one with maximum
number of votes . It is explained in Fig. 3.
4. Extra Trees Classifier (et) [1]: Extra Trees Classifier is also based on Deci-
sion Tree Classifier and in concept it is remarkably like that of Random Forest
Classifier. In this algorithm, many decision trees (without pruning) will be
trained using training data and final output will be the majority of prediction
made by all decision trees individually. There are two differences between Ex-
tra Trees Classifier and Random Forest Classifier and that are in Extra Tress
classifier there is no Bootstrapping of observations and the nodes are also
Random splits.
3 Methodology
The flow diagram representation for building a model for voice feature-based
detection of Parkinson's disease using machine learning algorithms is presented in
the form of steps for model building are given in Fig. 4.
7
All the steps in the above given diagram are fully explained below individually.
A. Data Pre-Processıng
This step is a combination of two processes. One is outliers removal and other is
Feature Selection. Both steps are explained below:
Outlıers Removal [8]: An outlier is something that is different from the rest.
Mostly Machine learning algorithms are affected when some value of attribute is
not in the same range. These outliers are mostly the result of errors while data
collection may be measuremental or executional. These extreme values are far
away from other observations. Machine learning models can be misled by outliers
that cause various problems while training and eventually a less efficient machine
learning algorithm model. There are a lot of diverse ways to remove outliers.
Some outliers were observed in the Parkinson's Disease Dataset and an attempt to
remove these outliers has been made. Outliers were removed considering the most
key features for specific machine learning algorithm as a base. After performing
this step, the number of instances in the dataset will decrease.
Feature Selectıon [17]: In today's data the collected data is high dimensional and
rich in information. It is quite common to go through a data frame with hundreds
of attributes. Feature selection is a technique to select most prominent features
from n given features. Feature selection is important because of some given
reasons:
• While training models, with an increase in the number of features time taken
too increase exponentially.
8
C. Model Evaluatıon
For finding the best model out of all our proposed models, model evaluation is
necessary. Once all the models are trained and tested, the next step is evaluation to
find out the best machine learning model which is best suitable for the given
problem. To this problem different performance metrics are used to evaluate the
most efficient machine learning model namely accuracy, precision, F1 score,
recall, AUC-ROC curve. Performance metrics will judge if the models are
improving or not. To get the correct evaluation of machine learning model., the
metrics should be chosen very carefully.
Confusıon Matrıx: The problem that we are solving is a classification-based
problem. One of the most preferred ways to evaluate a classification-based
machine learning model is confusion matrix. Confusion matrix summarizes the
performance of machine learning models. Confusion matrix is the most intuitive
performance metric because it evaluates the performance of a model with count
value. This technique is used in both multiclass and binary classification . The
concept of confusion matrix of binary classification is explained below (in Fig. 5).
Confusion matrix is a two-dimension table and the table is divided into four parts.
All the four parts of the confusion metrics are explained below:
True Posıtıve: It is efficiency of a machine learning algorithm to classify positive
instances as positive. In the true positive section there comes the count of
instances which are predicted as of Label 1 and they actually belong to Label 1
too. It is expressed as True Positive Rate (TPR), and is often called sensitivity
which is proportion of correctly predicted positive samples to actual positive
samples.
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃/ (𝑇𝑃 + 𝐹𝑁) (1)
True Negatıve: It is efficiency of a machine learning algorithm to classify
negative instances as negative. In the true negative section there comes the count
of instances which are predicted as of Label 0 and they actually belong to Label 0
too. It is expressed as True Negativity Rate (TNR), and is often called specificity
which is proportion of correctly predicted negative samples to actual negative
samples .
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁/ (𝑇𝑁 + 𝐹𝑃) (2)
False Posıtıve: This is a case when a model makes false prediction of the
classification problem. In this section there comes a count of instances which are
predicted as of Label 1 and they actually belong to Label 0. It is expressed as True
Negativity Rate (TNR) False positive rate (FPR) is the proportion of the negative
cases that is predicted as positive to the actual negative cases .
𝐹𝑃𝑅 = 𝐹𝑃/(𝑇𝑁 + 𝐹𝑃) (3)
False Negatıve: This is a case when a model makes false prediction of the
classification problem. In this section there comes a count of instances which are
predicted as of Label 0 and they actually belong to Label 1. It is expressed as
False Negativity Rate (FNR) False positive rate (FPR) is the proportion of the
positive cases that is predicted as negative to the actual positive cases.
𝐹𝑃𝑅 = 𝐹𝑁/(𝑇𝑃 + 𝐹𝑁) (4)
Accuracy: It will decide how accurately or ML model is working. Accuracy is all
about correct predictions. Hence it is proportion of correct predictions to that of
total predictions .
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (𝑇𝑃 + 𝑇𝑁)/(𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁) (5)
Precısıon: Precision considers accuracy of positively predicted classes. Precision
is ratio of correctly predicted positive instances to total predicted positive
instances.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃) (6)
Recall: Recall is another name for sensitivity of confusion matrix.
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁) (7)
F1-Score: It is another performance metric formed by harmonic mean of recall
and precision.
𝐹1 − 𝑠𝑐𝑜𝑟𝑒 = 2. (𝑇𝑃)/(2. (𝑇𝑃) + 𝐹𝑃 + 𝐹𝑁) (8)
AUC ROC Curve: AUC (area under curve) ROC (receiver operating
characteristics) curve is one important way to evaluate performance of a
classification problem. It is one of the most important performance metrics when
it comes to model evaluation. ROC is a probability curve and AUC measures
separability. Better machine learning models have more value of AUC than others.
10
Its value lies between 0 to 1. ROC is curve between True Positivity Rate(y-axis)
and False Positivity Rate (X-axis). Let us suppose AUC for a machine learning
based model comes out to be 0.8 that means there are 80% chances that the model
is able distinguish label 1 and label 0 classes.
4 Results
in our trained model if there are total 106 patients out of which 92 are Parkinson's
disease infected and 14 healthy people. Then our trained model is predicting 99 as
Parkinson’s positive and 7 as healthy subjects. The confusion matrix of Extra
Trees Classifier after feature selection is given below (Fig. 7). Improvement is
clearly visible in the confusion matrix after feature selection. Without feature
selection when there were 7 cases that were not Parkinson's disease positive put
predicted as Parkinson's affected. Now the number has reduced to six and
eventually one more true negative case increases (now 98 cases are predicted
positive (out of which 96 true positive) and 8 cases are predicted negative.
Random Forest classifier after applying feature selection are depicted in Fig. 8 and
Fig. 9. Area under curve for Random Forest Classifier is 0.91 while for Extra
Trees Classifier its 0.96. Both results are quite satisfactory but if considered, Extra
Trees Classifier is better than all other machine learning models in our work. The
performance metrics of ExtraTReesClassifier combined with SelectFromModel
are performing excellently and even if we compare this machine learning model to
some already proposed models it is performing significantly better. There is
comparative analysis of our proposed model and some already existing models for
Parkinson’s Disease detection (Table 3). So, it can be observed that Extra Trees
Classifier when combined with SelectFromModel feature selection technique is
working very efficiently.
Parkinson’s Disease and its efficiency increase too when combined with
SelectFromModel feature selection technique.
• It is strongly recommended to apply feature selection technique to machine
learning because there can be large no of feature and it. make the training
process complex.
Though, the results for our proposed Machine learning model are quite
satisfactory but there is always a scope for improvement. In future, the accuracy of
the model can be increased by applying some other techniques like data balancing,
Cross- Validation. Looking at the performance metrics of the model that we
proposed with this dataset is still efficient and can be relied upon for the problem
statement.
References
https://doi.org/10.1063/1.4977376.
19. Sachdeva, P., Singh, K.J.: Automatic segmentation and area calculation of optic disc
in ophthalmic images. 2015 2nd Int. Conf. Recent Adv. Eng. Comput. Sci. RAECS
2015. (2016). https://doi.org/10.1109/RAECS.2015.7453356.
20. Sakar, B.E. et al.: Collection and analysis of a Parkinson speech dataset with multiple
types of sound recordings. IEEE J. Biomed. Heal. informatics. 17, 4, 828–834 (2013).
https://doi.org/10.1109/JBHI.2013.2245674.
21. Sakar, C.O. et al.: A comparative analysis of speech signal processing algorithms for
Parkinson’s disease classification and the use of the tunable Q-factor wavelet
transform. Appl. Soft Comput. 74, 255–263 (2019).
https://doi.org/10.1016/J.ASOC.2018.10.022.
22. Sharma, A. et al.: Exploration of IoT Nodes Communication Using LoRaWAN in
Forest Environment. Comput. Mater. Contin. 71, 2, 6240–6256 (2022).
https://doi.org/10.32604/CMC.2022.024639.
23. Sharma, A., Agrawal, S.: Performance of Error Filters on Shares in Halftone Visual
Cryptography via Error Diffusion. Int. J. Comput. Appl. 45, 23–30 (2012).
24. Singh, K. et al.: Image retrieval for medical imaging using combined feature fuzzy
approach. 2014 Int. Conf. Devices, Circuits Commun. ICDCCom 2014 - Proc.
(2014). https://doi.org/10.1109/ICDCCOM.2014.7024725.
25. Subrahmanya, S.V.G. et al.: The role of data science in healthcare advancements:
applications, benefits, and future prospects. Ir. J. Med. Sci. 1–11 (2021).
https://doi.org/10.1007/S11845-021-02730-Z/FIGURES/5.
26. Zhang, Y.: Support vector machine classification algorithm and its application.
Commun. Comput. Inf. Sci. 308 CCIS, PART 2, 179–186 (2012).
https://doi.org/10.1007/978-3-642-34041-3_27.