Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

EXPLORING THE RELATION BETWEEN BLOOD

TESTS AND COVID-19 USING MACHINE


LEARNING
Jenil Gandhi Jugal Upadhyay
Electronics and Communication engineering Electronics and Communication engineering
Institute of technology, Nirma University, Institute of technology, Nirma University,
Sarkhej – Gandhinagar Highway Sarkhej – Gandhinagar Highway
Ahmedabad - 382482, India Ahmedabad - 382482, India
20bec046@nirmauni.ac.in 20bec048@nirmauni.ac.in

Abstract -The COVID-19 pandemic has had a diagnosing patients quickly and efficiently. In
significant impact on global public health and has response to this challenge, researchers have turned
posed significant challenges for healthcare systems to machine learning as a promising tool to aid in the
worldwide. In this paper, we explore the relationship diagnosis of COVID-19. In this paper, we present
between blood tests and COVID-19 using machine an algorithm to train the given blood report and
learning techniques. Specifically, we analyze the
COVID report of patients to predict the result of
blood test results of COVID-19 patients and
new patients. This approach aims to provide doctors
non-COVID-19 patients to identify potential
biomarkers and their correlation with the disease. We with a quick and reliable method of diagnosing
also investigate the use of machine learning COVID-19, which can potentially save lives and
algorithms to predict COVID-19 infection based on help curb the spread of the disease. Our method
blood test results. Our results show that certain blood involves training a machine learning model on a
test parameters, such as lymphocyte count, large dataset of blood reports and COVID reports of
neutrophil count, and C-reactive protein (CRP) patients to identify key biomarkers that are
levels, are significantly associated with COVID-19. indicative of the disease. We demonstrate the
This research has important implications for the effectiveness of our approach through extensive
early detection and management of COVID-19, as experimentation and evaluation, achieving high
well as the development of personalized treatment
accuracy in predicting COVID-19 infection from
plans for infected individuals.
blood reports. Our work has significant implications
Keywords – COVID-19, Machine Learning, Blood Tests, KNN, for the management of COVID-19 and can be a
SVM, AdaBoost, Random Forest, Ensemble Learning valuable addition to the toolkit of healthcare
professionals in the fight against this global health
I. INTRODUCTION crisis.
The COVID-19 pandemic has had a
significant impact on global public health since its In the following section, we will delve into a dataset
emergence in late 2019. In early 2020, the World acquired from a Brazilian hospital[1] that contains
Health Organization declared it a public health over 5000 blood test results along with their
emergency of international concern, leading to COVID-19 diagnosis. While the dataset covers
urgent efforts to contain its spread and mitigate its various blood tests, the ultimate aim is to establish a
impact. One of the key challenges in the connection between crucial independent variables
management of COVID-19 has been the timely and and the binary classification of COVID-19 (positive
accurate diagnosis of infected individuals. With the or negative). We will discuss various models and
increasing number of cases and limited resources, compare their efficacy in the upcoming sections.
healthcare systems have faced a daunting task of
Coronavirus infections using complete blood count
II. Literature review test results. Three datasets obtained from hospitals
in Italy, Brazil, and Indonesia are used to train the
Machine Learning has been applied in a lot of models. The average AUC scores obtained for the
research, especially considering the ongoing models trained with datasets from San Raphael
COVID-19 pandemic, to help detect patterns and Hospital in Italy, Albert Einstein Hospital in Brazil,
insights leading or related to the infection. This and Pasar Minggu Hospital in Indonesia are 0.87,
section will address some of the published papers 0.90, and 0.88, respectively.
relevant to the subject.

The paper “Detection COVID-19 using Machine III. DATA SET


Learning from Blood Tests”[2] presents a study The research we conducted involved the use of a
aimed at detecting COVID-19 in its early stages dataset obtained from the Albert Einstein Hospital
using blood tests, with the goal of minimizing the located in São Paulo, Brazil. The data comprised
number of deaths caused by the disease and information on patients who had their samples
reducing the economic impact of the pandemic.The collected for COVID-19 tests, as well as other
authors employed machine learning techniques, laboratory tests, all of which were anonymized
including Random Forest, Support Vector Machine, using appropriate procedures. The hospital itself
and Naive Bayes classifiers, to predict COVID-19 uploaded the dataset to Kaggle and it encompasses
status based on the blood test results. the period from 28th March to 3rd April 2020,
consisting of 5644 observations and 111 variables,
The accuracy of the classifiers was evaluated, and one of which is the dependent variable indicating
the Support Vector Machine classifier achieved the the COVID-19 test result (positive or negative).
highest accuracy of 88%. The paper”COVID-19
Infection Detection Using Machine Learning“
[3]discusses the use of machine learning techniques FEATURE RESULT
in the detection of COVID-19.The authors introduce
a machine learning-based COVID infection Number of observations 5644
predictor and measure the prediction accuracy of Total attributes 111
five different machine learning models.

A study applied machine learning to detect


COVID-19 in a fast and accurate manner through POSITIVE VS NEGATIVE CASES
deep learning methods [4]. This research relied on
The dataset we used in our study contained 5644 records of
X-ray and CT scan images based on data obtained COVID-19 tests conducted at the Albert Einstein Hospital in
from Iran. The researchers obtained 84.67% São Paulo, Brazil. Among these records, approximately 10%
accuracy from X-ray images and 98.78% accuracy or 558 cases were found to be positive, while around 90% or
in CT. 5086 cases were negative. One issue with this dataset is that it
is imbalanced, meaning that the number of positive cases is
much lower than the number of negative cases. However, this
“Experiment on Deep Learning Models for reflects the reality of the situation where the number of
COVID-19 Detection from Blood Testing”[5].This individuals infected with COVID-19 is significantly less than
study proposes the use of hematochemical data and the number of uninfected individuals.
deep learning algorithms as an alternative method
for COVID-19 detection. The authors utilize two
different deep learning architectures, namely
custom-built DNN and TabNet, to classify
A heatmap is a graphical representation of data that uses a
color-coding scheme to show the degree of variation or
magnitude of the data. In our case, the heatmap created using
Python language was used to visualize the missing values in a
dataset. By doing so, the heatmap allowed us to easily spot We generated distribution plots consisting of
patterns of missing values and identify any columns or rows
histograms with density curves for each numerical
that contained a large number of missing values. This
information is crucial for data analysis as missing values can (float) column in a given DataFrame. The loop
have a significant impact on the accuracy of statistical models iterates over each numerical column in the
and conclusions drawn from the data. DataFrame, and a new figure is created for each
plot. The seaborn library's distplot function is used
to plot the distribution of values in each column.
The histogram shows the frequency or density of
the values on the x-axis, while the y-axis represents
the values in the column. A normal distribution
curve is also overlaid on top of the histogram to
display the estimated probability density function of
the data. The code is helpful in visualizing the
distribution of numerical variables in the
DataFrame, enabling the identification of any
patterns or anomalies in the data. This technique
can be useful in data analysis and machine learning
applications.
A box plot, also known as a box-and-whisker plot,
is a graphical representation of the distribution of a
dataset. It shows the median, quartiles, and outliers
of the data in a compact and easy-to-read format.
The box in the plot represents the interquartile range
(IQR), which is the range between the first quartile
(Q1) and the third quartile (Q3). The whiskers
extend from the box to the minimum and maximum
values that are within 1.5 times the IQR. Box plots
are useful for comparing the distributions of
multiple datasets or for identifying potential outliers
in a dataset. They are commonly used in statistical
analysis, data visualization, and exploratory data
analysis.
In machine learning, correlation refers to the
relationship between the input features and the
target variable in a dataset. It measures the strength
and direction of the linear association between the
features and the target variable, which is the
variable that the machine learning model is trying to
predict.
Correlation analysis in machine learning is
important because it can help identify which
features are most relevant for predicting the target 1. Data cleaning: This involves removing or
variable. Highly correlated features can be correcting any errors or inconsistencies in
problematic in machine learning because they can the data, such as missing values, incorrect
introduce multicollinearity, which is a situation data types, or outliers.
where two or more features are highly correlated 2. Data normalization: This involves scaling
with each other, making it difficult for the machine the data to a standard range, such as between
learning model to determine the independent effect 0 and 1, to ensure that all features contribute
of each feature on the target variable. In such cases, equally to the model.
it may be necessary to remove one of the correlated 3. Feature selection: This involves selecting the
features from the dataset to avoid overfitting and most relevant features for the model and
improve the model's performance. removing any redundant or irrelevant
Correlation analysis is commonly used in feature features.
selection and feature engineering in machine 4. Data splitting: This involves dividing the
learning, where the goal is to select the most data into training, validation, and testing sets
relevant features for the machine learning model to evaluate the model's performance and
and to transform the features to improve the model's prevent overfitting.
performance. By performing these preprocessing steps, the
quality of the data is improved, and the machine
learning model can better identify patterns and
make accurate predictions. Data preprocessing is an
iterative process that requires careful consideration
and experimentation to achieve optimal results.

Pre processing performed by us-

The first line of code is selecting the column names


where the missing value rate is less than 0.9 and
greater than 0.88. This means that it is selecting
columns where the proportion of missing values
falls between 88% and 90%. These columns are
being assigned to a list called 'blood_columns'.
The second line of code is selecting the column
names where the missing value rate is less than 0.88
and greater than 0.75. This means that it is selecting
IV. DATA PREPROCESSING columns where the proportion of missing values
Data preprocessing in machine learning refers to falls between 75% and 88%. These columns are
the process of preparing and cleaning data before it being assigned to a list called 'viral_columns'.
is used to train a machine learning model. This is an
essential step in the machine learning pipeline
because the quality of the data can significantly
impact the performance of the model.
Data preprocessing involves several steps, This line of code is selecting a subset of columns
including: from the pandas DataFrame 'df' and assigning it
back to 'df'.
The selected columns include the ones specified in
three separate lists: 'key_columns', 'blood_columns',
and 'viral_columns'. These lists likely contain
column names that are relevant to some analysis or
modeling task. The '+' operator is used to
concatenate the three lists into a single list of
column names.
By selecting only these columns, the original
DataFrame 'df' is effectively reduced to a smaller
DataFrame that includes only the relevant columns
for the analysis or modeling task. The new
DataFrame is assigned back to 'df'.

V. LEARNING METHODS & RESULTS Result-

Various machine learning methods have been


applied and tested with different parameters to
achieve the best classification outcomes. In this
section, the different applied methods and results
will be discussed.

A. RANDOM FOREST
B. SVM
Random Forest is a popular ensemble learning
algorithm used in machine learning for Support Vector Machine (SVM) is a popular
classification and regression tasks. It is a collection supervised learning algorithm used in machine
of decision trees that work together to improve the learning for classification and regression tasks.
accuracy and stability of predictions. SVM aims to find the optimal hyperplane in a
In the Random Forest algorithm, multiple decision high-dimensional space that maximally separates
trees are built on randomly selected subsets of the the data points of different classes.
training data and features, resulting in a forest of In SVM, each data point is represented as a vector
trees. During training, each tree is grown by in the feature space and is classified based on its
recursively splitting the data into smaller subsets position relative to the hyperplane. The hyperplane
based on the most discriminative features until a is chosen such that the margin between the
stopping criterion is reached. hyperplane and the closest data points of each class
During prediction, the Random Forest algorithm is maximized. The closest data points are called
combines the outputs of individual trees to make a support vectors, and they define the margin of the
final prediction. hyperplane.
One of the main advantages of Random Forest is its SVM has several advantages, including its ability to
ability to handle high-dimensional data with a large handle high-dimensional data with a large number
number of features. of features and its robustness to outliers. SVM can
also provide a clear boundary between classes,
making it easy to interpret and visualize the results.
Result-
Result-

VI. CONCLUSION
C. ADAPTIVE BOOSTING In this paper we investigated various machine
learning algorithms to explore the relation of blood
Adaptive Boosting (AdaBoost) is a popular
tests with COVID-19. We applied SVM, AdaBoost,
ensemble learning algorithm used in machine
random forest, and K nearest neighbor. A real
learning for classification tasks. It works by
dataset obtained from a Brazilian hospital has been
combining multiple weak classifiers into a strong
used to test the models. As accuracy and F measure
classifier, where each weak classifier is trained on a
are considered, the SVM gives the highest accuracy.
different subset of the training data.
Also, the unbalance of data may also reduce the true
During training, AdaBoost assigns higher weights
positive rate. Future research direction would
to misclassified data points, which allows
include applying the used and other machine
subsequent weak classifiers to focus on the
learning models on more and various real datasets.
hard-to-classify examples. The final strong classifier
is a weighted combination of the individual weak
classifiers, with higher weights assigned to the more XIV. REFERENCES
accurate classifiers. [1] N. Crokidakis, "Data analysis and modeling of the evolution of
COVID-19 in Brazil," arXiv preprint arXiv:2003.12150, 2020.
[2]N. Hany, N. Atef, N. Mostafa, S. Mohamed, M. ElSahhar and A.
AbdelRaouf, "Detection COVID-19 using Machine Learning from
Blood Tests," 2021 International Mobile, Intelligent, and Ubiquitous
Computing Conference (MIUCC), Cairo, Egypt, 2021, pp. 229-234, doi:
10.1109/MIUCC52538.2021.9447639.

[3]L. Wang, H. Shen, K. Enfield and K. Rheuban, "COVID-19 Infection


Detection Using Machine Learning," 2021 IEEE International
Conference on Big Data (Big Data), Orlando, FL, USA, 2021, pp.
4780-4789, doi: 10.1109/BigData52589.2021.9671700.

[4]M. Z. Alom, M. M. Rahman, M. S. Nasrin, T. M.


Taha and V. K. Asari, "COVID\_{M}{T}{N}et:
COVID-19 Detection with Multi-Task Deep
Learning Approaches," arXiv preprint
arXiv:2004.03747, 2020.

[5]F. Bismadhika, N. N. Qomariyah and A. A. Purwita, "Experiment on


Deep Learning Models for COVID-19 Detection from Blood Testing,"
2021 IEEE International Biomedical Instrumentation and Technology
Conference (IBITeC), Yogyakarta, Indonesia, 2021, pp. 136-141, doi:
10.1109/IBITeC53045.2021.9649254.

[6]https://www.javatpoint.com/machine-learning-random-forest-algorith
m

[7]https://www.geeksforgeeks.org/boosting-in-machine-learning-boostin
g-and-adaboost/

You might also like