Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

A Case Study of Using Machine Learning

Techniques for COVID-19 Diagnosis

Marco Dinacci, Tianhua Chen, Mufti Mahmud, and Simon Parkinson

Abstract The Coronavirus disease (COVID-19) is a worldwide pandemic that has


lead to millions of death and is affecting every corner of the society. The indus-
trial and scientific communities are continuously working to curb the spread of the
pandemic, with efforts in numerous areas including disease detection and diagno-
sis, virology, vaccine and drug development. As a powerful technique, Artificial
Intelligence (AI) and machine learning techniques have been widely incorporated
in COVID-19 related research and development. With the aim to establish a use
case of machine learning techniques for COVID-19 diagnosis, this paper applies the
XGBoost machine learning technique, while examining a number of hyperparam-
eters and data preprocessing techniques, to identify an accurate predictive model,
followed by the use of Shapley value to study predictors that are most informative of
the diagnosis. Evaluated on a collection of anonymised patients data collected out of
the standard Reverse Transcriptase Polymerase Chain Reaction (RT-PCR) and addi-
tional laboratory test results, the best model obtained demonstrates high diagnostic
performance.

1 Introduction

Coronavirus disease (COVID-19) is an infectious disease caused by the SARS-CoV-


2 virus. As of 27 January 2022, the World Health Organisation has reported over 356
million confirmed cases and 5.6 million deaths across the globe [24]. The associated
restrictions as a result of the rising incidence also result in from school closures,
devastated industries and millions of jobs lost, to worsen public mental welling and
undermined progress on global poverty and clean energy [3, 10, 15]. Even with the

M. Dinacci · T. Chen (B) · S. Parkinson


Department of Computer Science, School of Computing and Engineering, University
of Huddersfield, Huddersfield, UK
e-mail: T.Chen@hud.ac.uk
M. Mahmud
Department of Computer Science, Nottingham Trent University, Nottingham, UK

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 201
T. Chen et al. (eds.), Artificial Intelligence in Healthcare, Brain Informatics and Health,
https://doi.org/10.1007/978-981-19-5272-2_10
202 M. Dinacci et al.

advent of vaccines and the gradual ending of lockdowns, the social, economic and
cultural effects of the pandemic will cast a long shadow into the future.
The development of Artificial Intelligence (AI) techniques have been accelerated
as a result of recent advances in machine learning and data analytics [9], which has
led to numerous successful applications in various domains including the healthcare
[7, 12, 19, 20]. In the context of the COVID-19, AI has been widely used in disease
detection and diagnosis, virology and pathogenesis, drug and vaccine development,
and epidemic and transmission prediction [6].
In particular, the diagnosis of virus infection is a significant part of COVID-
19 research and practice. The current detection methods used for COVID-19 dis-
ease mainly include nucleic acid testing, serological diagnosis, chest X-ray and CT
image inspection [6]. Bearing high sensitivity and specificity, the real-time Reverse
Transcriptase Polymerase Chain Reaction (RT-PCR) is the current standard detection
technology in diagnosing the COVID-19 virus. Isothermal nucleic acid amplification
and blood testing methods are also commonly used for rapid screening of SARS-
CoV-2. Medical imaging inspection is another widely used clinical approach for
COVID-19 detection and diagnosis, which generally includes chest X-ray and lung
CT imaging.
The existing testing and detection methods in medical practice also underpin
recent research of utilising AI and machine learning techniques to develop more
robust and accurate computer-assisted techniques as a complementary solution to
medical analysis [18, 21]. While results such as blood test, CT and X-ray scans,
respiratory sound and RT-PCR have been extensively applied, researchers have also
experimented diagnosing with only a questionnaire survey and without any physio-
logical analysis [25]. Despite numerous efforts in the wide AI scientific community,
a recent study [18] suggested overly optimistic performance based on observations
of methodological pitfalls and biases out of the analysis of a large number of papers.
In working towards demonstrating a use case of machine learning techniques for
the diagnosis of COVID-19, this paper aims to establish an effective model for its
automatic diagnosis. Utilising the XGBoost, a powerful machine learning technique,
this paper examines a number of model hyperparameters and data preprocessing
techniques, followed by the use of Shapley value to identify predictors that are most
informative of the diagnosis. With application to a collection of anonymised patients
data of the SARS-CoV-2 RT-PCR and additional laboratory test, the best model
obtained demonstrates high diagnostic performance and point out factors that might
worth further clinical attention.
The remainder of this chapter is structured as follows. Section II reviews the related
work of machine learning for COVID-19 diagnosis in recent literature. Section III
presents the experimental settings, results and discussions. Section IV concludes the
chapter and points out potential future works.
A Case Study of Using Machine Learning Techniques … 203

2 Literature Review

Depending on the use of source materials used for diagnosis, this section therefore
reviews popular machine learning methods applied to COVID-19 predictions, which
can generally include the use of medical imaging, blood tests, and respiratory sound.
For research based on medical imaging, two most common imaging techniques
are chest X-Rays and chest Computed Tomography (CT). In [16] a deep learning
network based on the popular ResNet50 architecture was used to predict COVID-19.
The model produces features from a series of CT slices and combines them into a
max-pooling operation, which is then fed to a fully connected layer and a softmax
activation function to obtain a probability score for each diagnostic category, i.e.,
COVID-19, Community Acquired Pneumonia (CAP), non-pneumonia. Evaluated on
an independent testing set made of 10% of the original image files, the model is able
to achieve the area under the curve of the receiver operating characteristics (AUROC)
of 0.96 from a dataset consisting of 4356 chest CT examinations of 3322 patients.
It is however worth noting that limitations of research include that patients affected
by COVID-19 might show similar imaging characteristics as pneumonia caused by
different viruses, where CAP was the only type of pneumonia used as comparison.
The second limitation is the difficulty in interpreting the results produced by the
neural network, which is a common issue to most deep learning methods, though
it may be significant in this area where predictions may have an direct impact on
human life.
On the other hand, due to being cheap and widespread, there is a lot of interest in
the clinical community in using Chest X-rays (CXR) to discriminate COVID-19. In
[22], an empirical evaluation is conducted for the evaluation of pre-training and trans-
fer learning of standard CNN models including ResNET, COVID-Net, DenseNet)
through six datasets among which COVID Radiographic imaged Data-set for AI
(CORDA), created out of 386 patients that were screened for COVID-19. Whist
promising, the study concluded that the CXR data needed to determine whether
CNNs can be effectively used as an aid in the fight against COVID-19 pandemic,
need to be scaled up by a factor of two, or more. This is also consistent with findings
from a recent survey that concludes being short of large-scale data sets is the main
challenge that hinders the implementation of AI-based imaging inspection [6].
Apart from medical imaging, the diagnosis of COVID-19 may be significantly
facilitated with routine blood tests, which are able to provide numerous impor-
tant indicators that may correlate with patients of COVID-19 [1]. For instance,
[2] has opted for an interpretable model based on decision trees in order to obtain
more insights and concluded that parameters such as white blood cells (WBC),
C-reactive protein (CRP), neutrophils (NEU), lymphocytes (LYM), monocytes
(MONO), eosinophils (EOS), basophils (BAY), aspartate and alanine aminotrans-
ferase (AST and ALT, respectively), lactate dehydrogenase (LDH) and others have
shown high correlations in patients diagnosed with COVID-19. A similar study by [4]
identified prognostic serum biomarkers in patients at greatest risk of mortality from
COVID-19, where a model was developed to predict whether a patient would expire
204 M. Dinacci et al.

within 48 hours. Despite achieving 91% sensitivity and 91 % specificity on a held-out


testing dataset with support vector machine (SVM), Shapley additive explanations
(SHAP) was further implemented in order to be able to capture the contribution of
each feature to the model’s output, though the results may be limited by a unbalanced
sample with a relative minority of positive mortalities from a single institution of
source.
There are alternative uses of machine learning for COVID-19 diagnosis. For
instance respiratory sounds have been collected aiming to diagnose COVID-19 using
audio such as breath and cough of a potential patient. In a recent study [13], where the
data is crowd-sourced by allowing individuals to download a phone app that enables
to record a small audio of a user breathing and coughing, a CNN model was then
trained over 355 patients to detect symptomatic and asymptomatic COVID-19 cases
through these recordings, with a AUROC performance of 0.846. Such innovation is
able to address a significant issue of small data sample, though the senior participants
may be under-represented due to their less familiarity with the app.
Despite massive efforts have been spent to develop ML models to help fight
against COVID-19, [18] concludes that all papers analyzed have methodological pit-
falls and biases, leading to overly optimistic performance. The main issues identified
are reproducibility, following of best practices, and the lack of external validation.
Models performance is generally biased due to small sample sizes, due to the sensi-
tivity nature of clinical data for which there isn’t yet a sufficiently large repository
of international COVID-19 data that can be used to train more complicated deep
models. In addition to the lack of data, another limitation of numerous existing stud-
ies rarely provided details on how the AI model predictions were interpreted and
tracked, which would have provided more insights to facilitate the understanding
and further the treatment and prevention of the disease.

3 Experimentation and Discussion

The following investigative experimentation aims to examine the performance of


machine learning for the COVID-19 diagnosis, with application to a open source
data set that contains anonymised patients data from the Hospital Israelita Albert
Einstein, at São Paulo, Brazil. The samples were collected to perform the SARS-
CoV-2 RT-PCR and additional laboratory tests during a visit to the emergency room.
A high-level view of the steps taken during the experiment is illustrated in Fig. 1.
The loops in the diagram are meant to explain that hill-climbing was done by a com-
bination of different data sampling approaches and hyper-parameters configurations.
A Case Study of Using Machine Learning Techniques … 205

Fig. 1 Experiment flowchart

3.1 Data Preprocessing

The dataset is made of 111 features and 5644 entries. The vast majority of features
are derived from standard blood tests, such as number of red blood cells, platelets,
leukocytes, lymphocites, hematocrytes, but also the presence of other viruses such
as Influenza A and B, Rhinovirus, including coronaviruses such as Coronavirus229E
and CoronavirusOC43. About 15% of the features, such as number of urobilinogen,
ketone bodies, esterase, and others, are obtained from urine samples.
In terms of missing values, approximately 25% of the attributes have less than 1%
of values. Some attributes have a large percentage of invalid values (up to 100%),
so we removed any column where most values were encoded as “Not a Number”
(NaN).
206 M. Dinacci et al.

Some columns that weren’t relevant to the task were therefore removed. These are
’Patient admitted to regular ward’, ’Patient admitted to
semi-intensive unit’ and ’Patient admitted to intensive
care unit’.
Before using the dataset for training, all the values in Portuguese were converted
English, such as “Ausentes” which was translated to “absent”. The translation was
done using Google Translate. Some of the Boolean features were represented with
a mix of strings such as “true” or “false” and some with 0s and 1s. We converted
these features to use a native Boolean representation. String and object based features
have been encoded as integers, which helps normalize labels so that they contain only
values between 0 and n_classes-1.

3.2 Data Sampling

As shown in Fig. 2, the dataset is highly imbalanced since most of the patients resulted
negative to COVID-19 after the tests. It contains 5086 negative use cases and 558
positive ones. To achieve reliable results, the dataset was re-balance through sam-
pling, including the random oversampling from the minority class and random under-
sampling from the majority one. We then simply removing the rows from the nega-
tive use cases which contained multiple null values. In over-sampling, we randomly
duplicated examples from the minority class and added them to the training dataset.
With under-sampling we did the opposite by randomly removing samples from the
majority class.
After balancing the dataset it was split into train (70%), test (15%) and validation
(15%) datasets. The validation set was used to evaluate the model hyperparameters,
the test set was used to evaluate the predictive power of the model against data it
hadn’t seen before (Table 1).

Fig. 2 Negative versus positive cases


A Case Study of Using Machine Learning Techniques … 207

Table 1 Sampling experiments


Sampling technique Area under curve F1 score
Random oversampling 0.6695 0.73
Random undersampling 0.5431 0.55
Nulls removal (threshold = 10) 0.8865 0.85
Nulls removal (threshold = 20) 0.9541 0.95
Nulls removal (threshold = 23) 0.9533 0.96
Nulls removal (threshold = 30) 0.9522 0.95

3.3 Model Selection

The prediction model was developed using XGBoost that belongs to the family of
gradient boosting algorithms [8], for its being a very efficient and flexible distributed
method that has found numerous successful application [23]. The XGBoost can be
used for both regression and classification and produce a model composed of an
ensemble of decision trees.
XGBoost is particularly effective at dealing with imbalanced datasets since it does
not make any assumptions on the data distribution nor about the relationships among
features, and can be configured using the scale_pos_weight hyperparameter to
scale the gradient’s weights for the positive (minority) class during training. Changing
the scale of the weights between positive (minority) and negative (majority) classes
has the effect to over-correct the errors made by the model on the positive class,
ultimately resulting in a better model.
The prediction scores of each individual tree are summed up to get the final score,
in the form:
ΣK
ŷi = f k (xi ), f k ∈ F (1)
k=1

where K is the number of trees, f is a function in the functional space of F, which is


the set of all possible CARTs (classification and regression trees). In XGBoost, the
objective function is:
Σn Σ
t
obj = l(yi , ŷi(t) ) + Ω( f i ) (2)
i=1 i=1

where the first operand is the training loss function and the second the regularization
one which helps the model to avoid overfitting.
Tree boosting is fundamentally similar to Random Forests as both techniques
use tree ensembles as their models. A Random Forests classifier could have also
been a possible choice, but according to Chen [5], there is a high probability that a
bootstrap sample (the data points from the training data from which a decision tree
is fitted) contains few or even none of the minority class, resulting in a tree with
208 M. Dinacci et al.

poor performance for predicting the minority class. This isn’t a good choice since
the minority class is represented by the positive COVID-19 cases which is the main
class to predict.

3.4 Hyperparameters Optimization

The XGBoost can be configured with a large number of hyperparameters. In order to


find a good combination, we used a grid search with cross validation, which enables
to evaluates all possible combinations of the given parameter values.
The grid search was customised with five parameters. We used a range of values
for the number of estimators (the decision trees), the maximum tree depth (6, 7 and
9) and sub-samples ratio, which is used to train the classifier at each interaction on
a sub-sample of the training data. This technique combines gradient boosting with
bootstrap averaging (bagging) and is described in [14]. The summary of grid search
hyperparameters is presented in Table 2. This is integrated with a k-fold (k = 10 in
our case) cross-validation to split the data into multiple groups to reduce bias and
variance.

3.5 Evaluation

To evaluate the model we considered both precision and recall, but considering a
false negative mistake that incorrectly miss diagnose a positive case, the metric of
recall is potentially more significant than precision, as it can be more affordable to
take more tests to find out whether one is actually positive than missing any positive
case which could potentially put more people under risk.
On the other hand, patients who were incorrectly classified as having COVID-19
might have a different illness (in [16] the authors have highlighted the ambiguity
between predictions of COVID-19 and various types of different pneumonias), so
we can’t simply discard precision.

Table 2 Summary of grid search hyperparameters


Hyperparameter Range of values Best value
Estimators count [100, 300, 500, 700] 100
Sub-sample ratios [0.5, 0.7, 1.0] 0.5
Max tree depth [6, 7, 9] 9
Cross-validation 10-fold N/A
Metric ROC AUC score N/A
A Case Study of Using Machine Learning Techniques … 209

Table 3 Results from best model


Model Precision Recall F1 score ROCAUC
XGBoost 96.15% 94.94% 96.00% 95.33%

A good trade-off between recall and precision is the F1 score, which is the har-
monic mean of recall and precision, i.e.

Precision × Recall
F1 =
Precision + Recall

In order to understand how well the model can separate the two classes, we also
computed the Area Under the Receiver Operating Characteristic Curve (ROC AUC)
from the prediction scores in order to determine whether the model can rank a random
positive case of COVID-19 higher than a random negative case.
We experimented with different thresholds and removing columns which con-
tained at least 23 null values produced the model with the best ROCAUC and F1
score, as reported in Table 1.
The best results are presented in Table 3, which is able to high F1 score, with
close capacity in both precision and recall. To better visualize how the model can
separate the two classes, the ROC probabilistic curve is depicted in Fig. 3, which far
outperform the random guess of the diagonal line.

Fig. 3 ROC curve


210 M. Dinacci et al.

Fig. 4 Confusion matrix

In general, the results are very encouraging as we have obtained an F1 score of 0.96
score and 95.33% on AUC measured on a held out test set composed of 15% of the
original data. The imbalance in the dataset was addressed by removing the rows from
the negative use cases which contained multiple null values. This simple approach
was more effective than oversampling from the minority class and undersampling
from the majority class.
To visually assess the quality of the classifier we plotted a confusion matrix and
inspected the results. As we can see in Fig. 4 the model performed well, but incorrectly
predicted a negative outcome instead of the positive outcome 22 times, and predicted
53 times a positive outcome instead of a negative one.
Furthermore, in order to identify the subset of predictors that are more informa-
tive and contributes more towards the decision, the SHapley Additive exPlanations
(SHAP) value [17] is utilised which is a concept used in game theory to represent the
average of all the marginal contributions to all possible coalitions (features in this
case). It enables to explain the prediction of a classifier by computing the Shapley
value of each feature in order to determine how much a feature contribute to the
classifier prediction.
The top 20 most important features are plotted in Fig. 5. The most important
features identified are the patient age quantile and a high count of white cells (leuko-
cytes, eosinophils, monocytes, lymphocytes), which is clearly a sign that a patient’s
body is reacting to a pathogen. This is in line with some of the results we discovered
in the literature such as [2] where positive COVID-19 cases were strongly correlated
with an increased white blood cells count.
A Case Study of Using Machine Learning Techniques … 211

Fig. 5 20 most important features

4 Conclusion

COVID19 is an unprecedented global pandemic against which various strands of


research and practice have been invested to fight. In line with recent trends of employ-
ing machine learning techniques to facilitate its diagnosis, this paper has engineered
a predictive model through data sampling and the hyperparameters that achieves
promising diagnostic accuracy on anonymised results of over 5600 entries. Whilst
promising, additional improvements could be provided by experimenting with alter-
native sampling techniques, and to explore advanced interpolation techniques [11]
in response to the existence of numerous missing values.

References

1. Alves MA, Castro GZ, Oliveira BAS, Ferreira LA, Ramírez JA, Silva R, Guimarães FG (2021)
Explaining machine learning based diagnosis of covid- 19 from routine blood tests with decision
trees and criteria graphs. Comput Biol Med 132:104335
2. Alves MA, Castro GZ, Oliveira BAS, Ferreira LA, Ramírez JA, Silva R, Guimarães
FG (2021) Explaining machine learning based diagnosis of covid-19 from routine blood
tests with decision trees and criteria graphs. Comput Biol Med 132:104335. Accessed
from https://www.sciencedirect.com/science/article/pii/S0010482521001293. https://doi.org/
10.1016/j.compbiomed.2021.104335
212 M. Dinacci et al.

3. Barbier EB, Burgess JC (2020) Sustainability and development after covid-19. World Develop
135:105082
4. Booth AL, Abels E, McCaffrey P (2021). Development of a prognostic model for mortality
in covid-19 infection using machine learning. Modern Pathol 34(3):522–531. Accessed from
https://doi.org/10.1038/s41379-020-00700-x
5. Chen C (2004) Using random forest to learn imbalanced data
6. Chen J, Li K, Zhang Z, Li K, Yu PS (2021) A survey on applications of artificial intelligence
in fighting against covid-19. ACM Comput Surv (CSUR) 54(8):1–32
7. Chen T, Antoniou G, Adamou M, Tachmazidis I, Su P (2021) Automatic diagnosis of attention
deficit hyperactivity disorder using machine learning. Appl Artif Intell 1–13
8. Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the
22nd ACM SIGKDD international conference on knowledge discovery and data mining
9. Chen T, Keravnou-Papailiou E, Antoniou G (2021) Medical analytics for healthcare intelligence
- recent advances and future directions. Artif Intell Med 112:102009
10. Chen T, Lucock M (2022) The mental health of university students during the covid-19 pan-
demic: an online survey in the UK. Plos One 17(1):e0262562
11. Chen T, Shang C, Yang J, Li F, Shen Q (2020) A new approach for transformation-based fuzzy
rule interpolation. IEEE Trans Fuzzy Syst. Accessed from https://doi.org/10.1109/TFUZZ.
2019.2949767
12. Chen T, Su P, Shen Y, Chen L, Mahmud M, Zhao Y, Antoniou G (2022) A dominant set-
informed interpretable fuzzy system for automated diagnosis of dementia. Front Neurosci
13. Coppock H, Gaskell A, Tzirakis P, Baird A, Jones L, Schuller B (2021) End-to-end convolu-
tional neural network enables covid-19 detection from breath and cough audio: a pilot study.
BMJ Innov 7(2):356–362. Accessed from https://innovations.bmj.com/content/7/2/356. http://
orcid.org/10.1136/bmjinnov-2021-000668
14. Friedman J (2002) Stochastic gradient boosting. Comput Stat Data Anal 38:367–378
15. Kaiser MS, Mahmud M, Noor MBT, Zenia NZ, Al Mamun S, Mahmud KA et al (2021)
iworksafe: towards healthy workplaces during covid-19 with an intelligent phealth app for
industrial settings. IEEE Access 9:13814–13828
16. Li L, Qin L, Xu Z, Yin Y, Wang X, Kong B, Xia J (2020) Using artificial intelligence to
detect covid-19 and community-acquired pneumonia based on pulmonary ct: evaluation of
the diagnostic accuracy. Radiology 296(2):E65–E71. Accessed from https://europepmc.org/
articles/PMC7233473. https://doi.org/10.1148/radiol.2020200905
17. Lundberg SM, Lee S-I (2017) A unified approach to interpreting model predictions. In: Guyon
I et al (eds) Advances in neural information processing systems, vol 30, pp 4765–4774. Cur-
ran Associates, Inc. Accessed from http://papers.nips.cc/paper/7062-a-unified-approach-to-
interpreting-model-predictions.pdf
18. Roberts M, Driggs D, Thorpe M, Gilbey J, Yeung M, Ursprung S, AIX-COVNET (2021)
Common pitfalls and recommendations for using machine learning to detect and prognosticate
for covid-19 using chest radiographs and ct scans. Nat Mach Intell 3(3):199–217. Accessed
from https://doi.org/10.1038/s42256-021-00307-0
19. Stirling J, Chen T, Bucholc M (2020) Diagnosing alzheimer’s disease using a self-organising
fuzzy classifier. In: Fuzzy logic recent applications and developments. Springer
20. Su P, Chen T, Xie J, Zheng Y, Qi H, Borroni D, Liu J (2020). Corneal nerve tortuosity grading
via ordered weighted averaging-based feature extraction. Med Phys
21. Syeda HB, Syed M, Sexton KW, Syed S, Begum S, Syed F, Yu Jr F (2021) Role of machine learn-
ing techniques to tackle the covid19 crisis: systematic review. JMIR Med Inform 9(1):e23811.
Accessed from http://medinform.jmir.org/2021/1/e23811/
22. Tartaglione E, Barbano C. A, Berzovini C, Calandri M, Grangetto M (2020) Unveiling covid-19
from chest x-ray with deep learning: a hurdles race with small data. Int J Environ Res Public
Health 17(18). Accessed from https://www.mdpi.com/1660-4601/17/18/6933
23. Wang J, Yue-Xin L, Chun-Ying W (2019) Survey of recommendation based on collaborative
filtering. J Phys: Conf Ser 1314
A Case Study of Using Machine Learning Techniques … 213

24. World Health Organisation (2022) Coronavirus disease (COVID-19) pandemic. https://www.
who.int/emergencies/diseases/novel-coronavirus-2019
25. Zoabi Y, Deri-Rozov S, Shomron N (2021) Machine learning-based prediction of covid-19
diagnosis based on symptoms. npj Digit Med 4(1):3. Accessed from https://doi.org/10.1038/
s41746-020-00372-6

You might also like