Asthma Attack Prediction Models IEEE Conference

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Machine Learning Models for Early Prediction of

Asthma Attacks Based on Bio-signals and


Environmental Triggers

Abstract—Asthma is a common respiratory disease affected by a reliable asthma attack prediction model. We discovered that
different bio-signals and environmental triggers. Early prediction few models had used bio-signals and environmental triggers
of asthma attacks is crucial to saving a patient’s life. Several to predict asthma attacks [15]–[17]. In addition, less consid-
machine learning models have been designed to predict asthma
attacks. However, few models have exploited bio-signals and en- eration has been paid to feature selection algorithms and the
vironmental triggers to build an asthma attack prediction model. variation of machine learning models.
Additionally, little attention has been devoted to feature selection To fill this gap, this study uses both bio-signals and
algorithms and the variation of machine learning models. This environmental triggers to create an effective asthma attack
study develops an asthma attack prediction model by testing prediction model. This study aims to find an optimum classifier
different machine learning classifiers. The used dataset includes
two main parts; the bio-signals dataset, which is recorded daily for asthma attack prediction. Five different classifiers are
from 21 volunteers for three months, and the environmental compared regarding the accuracy and recall metrics. These
dataset, which is available online. The utilized machine learning classifiers are support vector machine (SVM), logistic regres-
classifiers are support vector machine, logistic regression, decision sion (LR), decision Tree (DT), Random Forest (RF), and
tree, random forest, and gradient boosting model. Each classifier Gradient Boost Model (GBM). Besides, the L1-based feature
was grid searched to find the best value for the primary hyper-
parameters, then we used five-fold cross-validation to train each selection algorithm is applied with these techniques to provide
model. Results show that the gradient boost model outperforms more accurate models. In addition, this study introduces a new
the other classifiers when training it with 0.5 for the depth asthma dataset that includes many bio-signal and environmen-
parameter and 9 for the sub-sample parameter. The prediction tal triggers.
of the testing set produces 97.2% accuracy and 97.1% recall. The structure of this article is as follows. The related work is
Index Terms—Asthma attack, Bio-signals, Environmental, Pre-
diction, Machine learning. introduced in Section II. The study’s methodology is explained
in Section III. The findings of all classifiers are introduced in
Section IV. Section V introduces the conclusion and future
I. I NTRODUCTION
research.
Asthma is one of the most prevalent inflammatory diseases
that can affect anyone at any age [1]. Many researchers inves- II. R ELATED W ORK
tigated the causes and risk factors of asthma and the likelihood Recent studies utilized different triggers and classifiers to
of having an asthma attack. Bio-signals and environmental build a prediction model for asthma attacks. Finkelstein and
triggers were classified as risk triggers [2], [3]. The bio- Jeong [7], [8] developed a model that predicts asthma attacks
signals are related to the patient’s health, including allergies, by utilizing multiple bio-signals triggers without considering
symptoms, and medical history. Any external stimulus, such the environmental triggers. Their first study [7] study included
as weather and air pollution, is considered an environmental a comparison of different machine learning classifiers, which
trigger. Air Quality Index (AQI) is generally used to assess are SVM, naive Bayesian (NB), and adaptive Bayesian net-
air pollution’s overall state, which is calculated using the work. The outcomes demonstrated that the ABN classifier
concentrations of existing air contaminants. Asthmatic people had greater specificity and sensitivity (100 %) than other
may experience health concerns if the AQI exceeds 51 [4]. classifiers. Their second study [8] developed a prediction
Early prediction of asthma attacks is essential for enhancing model for asthma exacerbation using the Classification and
the patient’s quality of life. Artificial intelligence (AI) has a Regression Tree (CART) algorithm. The model predicts the
sub-field called machine learning (ML) that uses algorithms asthma for the upcoming day whether it could be normal or
to predict outcomes using a large amount of disease-related abnormal. The final model achieved 80 % accuracy, 64 %
data [5]. Asthma attack prediction models based on machine sensitivity, and 97 % specificity.
learning approaches have been proposed in recent publications. Another study by Lee et al. [15] demonstrated how ap-
Most of these studies [6]–[8] used bio-signals as triggers for plying several predictors from various sources can improve
predicting asthma attacks. Others [9]–[14] used weather and the model’s performance. Thirty-seven attributes were used,
AQI to predict asthma attacks without considering the bio- including multiple bio-signals and environmental triggers. The
signals triggers. Modern models should recognize individuals’ utilized ML algorithms were Pattern-Based Class-Association
bio-signals and environmental triggers, which are required for Rule (PBCAR) and Pattern-Based Decision Tree (PBDT).
According to the model’s performance, asthma attacks may be historical and daily records. The historical data consist of the
predicted with an accuracy of 0.87. With an accuracy of 87 % patient’s’ diagnostic data for the previous four weeks and the
and 86 %, respectively, the experimental results demonstrate daily records include asthma symptoms that were recorded for
that PBDT performs somewhat better than PBCAR. next three months (from 24-march – 30-June 2021). Most of
Kaffash-Charandabi et al. [16] developed another prediction the collected data is categorical and is based on a scale of five:
model for asthma attacks. They used a dataset that includes never, rarely, sometimes, often, and always. The environmental
bio-signals data obtained from three participants in addition to data was collected online [19]. It includes weather variables
the environmental triggers. The SVM algorithm was used, and and AQI. The two datasets were then integrated based on
the model’s accuracy was 93 %. However, it is uncertain how patients’ locations and the date.
well the model would perform if tested on a more significant The integrated dataset includes 655 entries with various
population because it was only trained and tested on three bio-signals and environmental features as shown in Table II.
people. The target is whether the use had an asthma attack (class
Khasha et al. [17] conducted another comparison between 1) or no (class 0). The data was cleaned by replacing the
five different classifiers: RF, DT, SVM, LR, and GBM. The missing values with the means then standard normalization
decision tree classifier outperformed all other classifiers. These was used to generate the final dataset. The integrated dataset
classifiers’ performance was not accurately reflected. Only was imbalanced since the ratio between the class 0 and class
figures, however, were utilized to convey the results. 1 is 7:58. To solve this issue, the Synthetic Minority Over-
Hosseini et al. [18] proposed another model that utilizes sampling Technique using Support Vector Machine (SMOTE-
only two bio-signals triggers with AQI. They used the RF SVM) [20] was applied to reach a balanced dataset with a
classifier and 10-fold cross-validation to categorize the like- ratio of 1:1.
lihood of experiencing an asthma risk into low, medium, or
B. Stage 2: Feature Selection and Classifiers Training
high risk. The suggested model achieved an accuracy of 80%.
According to these studies, there is a lack of utilizing Feature selection reduces the dataset’s dimensionality by
both bio-signal and environmental triggers in building asthma deleting irrelevant and unnecessary features from the original
attack prediction models. Adding extra triggers as features dataset [21]. The remaining features are fed into the prediction
significantly improves model performance, as seen in [15]. model as input. This pre-processing stage can improve the
Also, none of the related works utilized any of the feature classification accuracy of the prediction model.
selection algorithms, which may provide better model perfor- LASSO Regularization (L1) [22] is one of the most common
mance. Table. I provides a summary of the related work with feature selection algorithms. It applies a penalty to the various
their limitations. In this paper, multiple triggers are considered machine learning model parameters to prevent over-fitting.
besides the application of an additional feature selection stage. The penalty is applied over the coefficients that multiply
We compare different machine learning classifiers to build each feature in a linear model regularization. Some of the
an effective asthma attack prediction model since different coefficients from the various forms of regularization can be
classifiers produce different results. shrunk to zero using the Lasso or L1 technique. As a result,
the model can be trained without that feature.
III. R ESEARCH M ETHODOLOGY This work applies the L1 algorithm to the integrated dataset
The research methodology includes three main stages. The to reduce the set of features. The resulting dataset is split into
first stage consists of generating and preparing an integrated training and testing sets with the ratio 80:20. The data was then
dataset that includes both bio-signals and environmental trig- trained with 5 classifiers: SVM, LR, DT, RF, and GBM. Grid
gers. The second stage uses the L1-based feature selection search was used to tune hyper-parameters thus increasing the
algorithm, which helps find the best feature set that contributes classifier performance. We provide in what follows a quick
to the model. Then, the selected feature set is used to train the overview on each classifier and the tuned hyper-parameters
model through the five selected classifiers using grid search with the searched values.
and 5-fold cross-validation to find the average accuracy and • Decision Tree (DT) builds incrementally a classification
recall. The third stage is selecting the best-trained model and model as a tree composed of a root, internal and leaf
testing it to find the final prediction results. Fig. 1 shows nodes. It classifies instances by sorting them from the
the study methodology in the form of a schematic flow tree’s root to a leaf node based on a given criterion [23].
chart to facilitate the reader’s understanding. The following The most important parameters tuned in DTs are the tree
subsections explain each stage in detail. depth and the split criterion. We used different depth
values (2,4,6,8,10,12) and two main criteria in the grid
A. Stage 1: Data generation and preparation search, gini and entropy.
The dataset used in this study integrates both patients bio- • Random Forest (RF) is an ensemble machine learning
signal data and environmental factors. The bio-signal data was approach that fits a number of decision tree classifiers on
collected though Google Forms. It involves 21 volunteers older various dataset sub-samples. It use the average to raise
than 12 years and having different levels of asthma disease prediction accuracy and reduce over-fitting [24]. Usually,
(mild, moderate, and severe). The dataset has two main parts: each tree is built using a subset of the dataset based on
TABLE I
S UMMARY OF RELATED WORK MODELS WITH THEIR LIMITATIONS

Reference Utilized Triggers ML classifiers Performance Limitations


[7] Bio-signals NB, ABN, and SVM ABN (best) sensitivity: 100 % Did not used environmental trig-
specificity: 100 % gers.
[8] CART Specificity: 97 %, Sensitivity: 64 Did not used environmental trig-
%, Accuracy: 80 %. gers and has low accuracy.
[15] Bio-signals and environmental. PBDT, and PBCAR PBDT (best) Accuracy: 87 % Accuracy is low and the recall
was not specified.
[16] Bio-signals and environmental. SVM Accuracy: 93 % Participants were limited to three
only.
[17] Bio-signals and environmental. RF, DT, SVM, LR, and GBM. Not specified The performance of models was
not specified in numerical val-
ues.
[18] Bio-signals and environmental. RF Accuracy: 80 % Few utilized triggers.

Fig. 1. The proposed asthma attack prediction model.


TABLE II found at the maximum margin hyperplane, which offers
L IST OF DATASET FEATURES the greatest separation between the classes [27]. The
Category Features support vectors are generated by using limited quadratic
Patients’ diagnostic record family members with asthma, family mem- optimization. Due to the maximum margin hyperplane’s
bers with allergies, controlling asthma dur- relative stability, even in the high-dimensional space
ing last four weeks, existence of chronic
disease other than asthma, allergies, smok- spanned by nonlinear transformations, support vector
ing status, normal PEFR, sleep disturbances machines are advantageous in situations with nonlinear
during last four weeks, cough during last class borders and avoid the overfitting issue. The most
four weeks, chest tightness during last four
weeks, use of asthma medication during last parameters that play a significant role in SVM are C and
four weeks, physician visits during last four the kernel, which transforms the data into different forms.
weeks, work ability during last four weeks, We performed a grid search for different C values (0.1,
and number of asthma attacks during last
four weeks. 1,10,100,1000) and the main kernels (radial basis function
Daily asthma symptoms Cough, chest tightness, sleep disturbances, (RBF), Polynomial, Sigmoid, and Linear).
work ability, breath difficulty, chest tight-
ness, self-smoking or surrounding smoke, Each of these classifiers is trained with two datasets; the
medication usage, exposure to external trig- original data with all features and the L1-based selected
gers, exposure to internal triggers, exposure features. The training process uses 5-fold cross-validation to
to bio-signals triggers, PEFR reading, and
PEFR label. get more information about their performance.
Environmental Dataset Weather Temperature, relative humidity,
wind speed. C. Stage3: Testing
Air pollution AQI After getting the best parameters for each prediction model,
Target Asthma attack
they were tested using 5-fold cross-validation. The perfor-
mance was measured using accuracy, recall, and area under
the curve (AUC-ROC) metrics. The accuracy shows how the
the subsamples value. The used parameters in the grid model generally produces an accurate result. Whereas recall
search are max-sample and max-features. The sub-sample is the most crucial metric in any healthcare model as it
determines the size of the subset given to the individual measures the model’s ability to identify actual positive cases.
tree. The used values were (10, 100, and 1000). The max- Recall close to 1.00 implies that model’s ability to discover
features resemble the features provided to each tree in RF. all patients who have the disease. The area under the curve
The used options were sqrt and log2. (AUC-ROC) is another essential metric which summarizes
• Gradient Boosting Model (GBM) is an ensemble machine the area under the precision-recall curve as an integral or an
learning approach for classification and regression [25]. approximation. AUC close to 1.0 indicates a high level of
It uses a combination of weak prediction models, often separability in a good model. The accuracy and recall were
DTs to build a stronger prediction model. It updates the the primary metrics to compare all the classifiers. The best
predictions using a loss function, such as mean squared results indicate the best model.
error (MSE), so that the sum of the residuals is relatively
close to zero (or minimal), and the projected values are IV. R ESULTS AND D ISCUSSION
sufficiently similar to the actual values. GBM model This study was designed to build an accurate asthma attack
was grid searched to find the best maximum depth that prediction model using multiple bio-signals and environmental
indicates how deep the tree can be and sub-sample that triggers.
sets the fraction of samples to fit each base learner. The The original dataset had 30 features; 26 bio-signals, three
tested values of max depth and sub-sample are (3,7,9,11) weather, and one air pollution. L1-based feature selection has
and (0.5,0.7,1), respectively. been utilized to find the features contributing to the model
• Logistic Regression (LR) is a method that predicts the performance and eliminate unnecessary features. The remain-
target variable (dependent) by analyzing the relationship ing features after L1-based feature selection are 22 features.
between multiple independent variables [26]. The output The eliminated features are: family members with allergies,
of this method is the probability, representing the obser- last four weeks physician visits, last four weeks work ability,
vation categories. We used two main parameters in the existence of chronic disease other than asthma, exposure to
grid search to tune the LR model. These parameters are external triggers, PEFR percentage, visiting physician, and
the solvers and C. The solvers are the algorithms that wind speed.
can be used in the optimization problem, and C is the Both the original dataset and the L1-based selected dataset
inverse of the regularization strength. The used solvers were used to evaluate the performance of DT, RF, GBM, LR,
were (newton-cg, lbfgs, and liblinear) and C values were and SVM classification models. Each of these classifiers was
(0.01,0.1,1,10,100). grid searched with the most important parameters to find the
• Support Vector Machine (SVM) works by deriving sup- best values that fit the model and improve its performance. The
port vectors from a portion of training data. The support results of the grid search provide us with optimized parameters
vectors are the examples that are most likely to be which used to save the best models and then test them with the
test set. The model’s performance was measured by calculating ing classifiers: support vector machine, logistic regression,
the model’s accuracy and recall. decision tree, random forest, and gradient boost model. The
The grid search for the DT classifier using the original testing results demonstrate the gradient boost model achieves
dataset provides the best recall with a depth of 10 and the the best prediction with the highest recall and accuracy,
entropy as a split criterion, see Fig. 2(a). The recall of DT 97.2%and 97.1%, respectively. In future research, we plan to
using the L1-based dataset is increased slightly using the Gini use the created model to develop an application that helps
criterion and depth of 12, see Fig. 2(b). The mean accuracy predict personalized asthma attacks dynamically by obtaining
of the five-fold cross-validation using the L1-based dataset is the potential triggers in real-time using different sensors. Such
88 % , and the mean recall is 89 %. The accuracy and recall an application can alarm the users to survive sudden asthma
of the testing test are 89 %. attacks.
The best parameters for RF are max-samples of 100 and
ACKNOWLEDGMENT
max-features of log2, see Fig. 3(a). The L1-based dataset
increased the model recall significantly using the same pa- The authors would like to express their special gratitude to
rameters, see Fig. 3(b). The mean accuracy of the five-fold the volunteers, and the Ministry of Health in Makkah region
cross-validation using the L1-based dataset is 91 %, and the research center, who agreed to this research and helped us
recall is 92 %. Testing results show the model’s ability to collect the dataset.
predict unseen data with 92 % accuracy and 93 % recall. R EFERENCES
The grid search of the LR classifier produced the best
[1] P. J. Barnes, K. F. Chung, and C. P. Page, “Inflammatory mediators of
performance by using the value 10 for the C parameter, giving asthma: an update,” vol. 50, no. 4, pp. 515–596.
the same results for all kinds of solvers, see Fig. 4(a). However, [2] Asthma triggers and management | AAAAI. [On-
the remaining CV results show a recall between 90 and 94 % line]. Available: https://www.aaaai.org/Tools-for-the-Public/Conditions-
Library/Asthma/Asthma-Triggers-and-Management-TTR
using the original dataset. The cross-validation results show [3] Asthma risk factors | american lung association. [Online].
the recall is between 92.4 and 93.5 %, which demonstrates Available: https://www.lung.org/lung-health-diseases/lung-disease-
an improvement of the model performance, see Fig. 4(b). The lookup/asthma/asthma-symptoms-causes-risk-factors/asthma-risk-
factors
mean accuracy of the five-fold cross-validation using the L1- [4] AQI basics | AirNow.gov. [Online]. Available:
based dataset is 94.3 %, and the recall is 93 %. Testing results https://www.airnow.gov/aqi/aqi-basics/
reveal the model’s ability to predict unseen data with 94 % [5] Machine Learning in Radiation Oncology. [Online]. Available:
https://link.springer.com/book/10.1007/978-3-319-18305-3
accuracy and 95 % recall. [6] H. Tibble, A. Tsanas, E. Horne, R. Horne, M. Mizani, C. R.
The GBM exhibits the best results compared to other Simpson, and A. Sheikh, “Predicting asthma attacks in primary
classifiers. The optimized parameters for the trained models care: protocol for developing a machine learning-based prediction
model,” vol. 9, no. 7, publisher: British Medical Journal Publishing
using the original dataset are 0.5 for depth, and 9 for sub- Group eprint: https://bmjopen.bmj.com/content/9/7/e028375.full.pdf.
sample, see Fig. 5(a). Using the same depth and sub-sample 11 [Online]. Available: https://bmjopen.bmj.com/content/9/7/e028375
with the L1-based dataset generates increased model recall, see [7] J. Finkelstein and I. C. Jeong, “Machine learning approaches to person-
alize early prediction of asthma exacerbations,” vol. 1387, no. 1, pp.
Fig. 5(b). The mean accuracy of the five-fold cross-validation 153–165.
is 97 %, and the recall is 98 %. Testing results show the [8] J. Finkelstein and I. C. Jeong, “Using cart for advanced prediction of
model’s ability to predict unseen data with 97 % accuracy asthma attacks based on telemonitoring data,” in 2016 IEEE 7th Annual
Ubiquitous Computing, Electronics Mobile Communication Conference
and 98 % recall. (UEMCON). IEEE, pp. 1–5.
The grid search result shows that SVM can work better [9] W. P. Jayawardene, A. H. Youssefagha, D. K. Lohrmann, and G. S.
with the polynomial kernel and C = 0.1 using the original El Afandi, “Prediction of asthma exacerbations among children through
integrating air pollution, upper atmosphere, and school health surveil-
dataset, see Fig. 6(a). The same parameters were used with lances,” vol. 34, no. 1, pp. e1–8.
the L1-based dataset, and the recall increased significantly, [10] M. Guarnieri and J. R. Balmes, “Outdoor air pollution and asthma,”
see Fig. 6(b). The mean accuracy and recall of the five-fold vol. 383, no. 9928, p. 1581, publisher: NIH Public Access. [Online].
Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4465283/
cross-validation using the L1-based feature set is 91 %. Testing [11] P. Orellano, N. Quaranta, J. Reynoso, B. Balbi, and J. Vasquez, “Effect
results show the model’s ability to predict unseen data with 91 of outdoor air pollution on asthma exacerbations in children and adults:
% accuracy and 92 % recall. The performance of all classifiers Systematic review and multilevel meta-analysis,” vol. 12, no. 3, p.
e0174050.
using the original dataset is displayed in Table. III, and using [12] E. Alharbi and M. Abdullah, “Asthma attack prediction based on
L1-based selected features is displayed in Table. IV. We can weather factors,” vol. 7, no. 1, pp. 408–419, number: 1. [Online].
see that the DT classifier has the lowest performance compared Available: http://pen.ius.edu.ba/index.php/pen/article/view/422
[13] P. L. Delamater, A. O. Finley, and S. Banerjee, “An analysis of
to classifiers while GBM outperforms all other classifiers. asthma hospitalizations, air pollution, and weather conditions in los
angeles county, california,” vol. 425, pp. 110–118. [Online]. Available:
V. C ONCLUSION https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4451222/
[14] N. Mireku, Y. Wang, J. Ager, R. C. Reddy, and A. P. Baptist, “Changes
This study produces a new dataset including patients’ di- in weather and the effects on pediatric asthma exacerbations,” vol. 103,
agnostic data and recorded daily bio-signals. This dataset is no. 3, pp. 220–224.
coupled with the environmental data according to patients [15] C.-H. Lee, J. C.-Y. Chen, and V. S. Tseng, “A novel data mining mech-
anism considering bio-signal and environmental data with applications
locations to build an integrated dataset. An asthma attack pre- on asthma monitoring,” vol. 101, no. 1, pp. 44–61. [Online]. Available:
diction model is conceived by comparing five machine learn- https://www.sciencedirect.com/science/article/pii/S0169260710001136
[16] N. Kaffash-Charandabi, A. A. Alesheikh, and M. Sharif, “A ubiquitous [22] N. Gauraha, “Introduction to the LASSO,” vol. 23, no. 4, pp. 439–464.
asthma monitoring framework based on ambient air pollutants and [Online]. Available: https://doi.org/10.1007/s12045-018-0635-x
individuals’ contexts,” vol. 26, no. 8, pp. 7525–7539. [23] L. Rokach and O. Maimon, “Decision trees,” in Data
[17] R. Khasha, M. M. Sepehri, S. A. Mahdaviani, and T. Khatibi, Mining and Knowledge Discovery Handbook, O. Maimon and
“Mobile GIS-based monitoring asthma attacks based on L. Rokach, Eds. Springer US, pp. 165–192. [Online]. Available:
environmental factors,” vol. 179, pp. 417–428. [Online]. Available: https://doi.org/10.1007/0-387-25465-X-9
https://www.sciencedirect.com/science/article/pii/S095965261830057X [24] Y. Liu, Y. Wang, and J. Zhang, “New machine learning algorithm:
[18] A. Hosseini, C. M. Buonocore, S. Hashemzadeh, H. Hojaiji, Random forest,” in Information Computing and Applications, ser. Lec-
H. Kalantarian, C. Sideris, A. A. Bui, C. E. King, and M. Sarrafzadeh, ture Notes in Computer Science, B. Liu, M. Ma, and J. Chang, Eds.
“HIPAA compliant wireless sensing smartwatch application for the Springer, pp. 246–252.
self-management of pediatric asthma,” vol. 2016, pp. 49–54. [Online]. [25] J. H. Friedman, “Greedy function approximation: A
Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5772883/ gradient boosting machine.” vol. 29, no. 5, pp. 1189–
[19] Real-time air quality index (AQI) & pollen report - air matters. 1232, publisher: Institute of Mathematical Statistics. [Online].
[Online]. Available: https://air-quality.com/ Available: https://projecteuclid.org/journals/annals-of-statistics/volume-
[20] N. V. Chawla, K. W. Bowyer, L. O. Hall, and 29/issue-5/Greedy-function-approximation-A-gradient-boosting-
W. P. Kegelmeyer, “SMOTE: Synthetic minority over- machine/10.1214/aos/1013203451.full
sampling technique,” vol. 16, pp. 321–357. [Online]. Available: [26] F. C. Pampel, Logistic Regression: A Primer, 1st ed. SAGE Publica-
https://www.jair.org/index.php/jair/article/view/10302 tions, Inc.
[21] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical [27] D. L. Olson and D. Delen, Support Vector Machines. Berlin,
Learning: Data Mining, Inference, and Prediction, Second Edition, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 111–123. [Online].
2nd ed. Springer. Available: https://doi.org/10.1007/978-3-540-76917-0 7
Fig. 2. Grid search for DT model with their hyper-parameters using (a) all features (b) L1-based feature set

Fig. 3. Grid search for RF model with their hyper-parameters using (a) all features (b) L1-based feature set

Fig. 4. Grid search for LR model with their hyper-parameters using (a) all features (b) L1-based feature set

TABLE III
C OMPARISON OF THE ML CLASSIFIERS U SING A LL FEATURE SET

ML classifier Cross validation performance Test set performance


Accuracy Recall AUC-ROC Accuracy Recall AUC-ROC
Decision Tree 87 88 87 88 89 88
Random forest 88 87 88 90 91 90
Gradient boosting model 95 95 95 96 95 95
Logistic regression 92 93 92 93 93 93
Support vector machine 87 89 87 88 90 88
Fig. 5. Grid search for GBM model with their hyper-parameters using (a) all features (b) L1-based feature set

Fig. 6. Grid search for SVM model with their hyper-parameters using (a) all features (b) L1-based feature set

TABLE IV
C OMPARISON OF THE ML CLASSIFIERS U SING L1- BASED SELECTED FEATURE SET

ML classifier Cross validation performance Test set performance


Accuracy Recall AUC-ROC Accuracy Recall AUC-ROC
Decision Tree 88 89 88 89 89 89
Random forest 91 92 91 92 93 92
Gradient boosting model 97 98 97 97 98 97
Logistic regression 93 93 93 94 95 94
Support vector machine 91 91 91 91 92 91

You might also like