Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Analyzing and Predicting Coronavirus

Infections Using Machine Learning Algorithms


Abstract—Since December 2019, the novel coronavirus Our study shows that in predicting the pandemic
(COVID-19) has caused over 2M deaths with more than 100 situations and the spread of the virus, amidst many
million people being infected worldwide and the numbers machine learning models the Support Vector Regression
are increasing day by day. Bangladesh, being one of the most Model provided the best accuracy. We believe that using
densely populated country in the world, is now under the output of our model the authorities can take decisions
community transmission of the COVID-19 outbreak. This that will lead to the saving of countless lives of the people.
has created huge health, social, and economic burdens for Additionally, this will also help to reduce the
this small country. Bangladesh has reported over 250,000
immeasurable economic burden our country is facing at
infected cases and 3000 deaths till 7th August, 2020. To
prevent further loss, prediction of future consequences is
this time.
crucial. Previous studies manifest that machine learning II. BACKGROUND STUDIES
(ML) algorithms work better in providing precise
information regarding COVID-19. This knowledge helps Researchers all over the world are wholeheartedly
enabling the decision-makers to make certain decisions trying to help the general people to fight this corona virus
accordingly. In this study, we examine several machine pandemic. Researchers from the field of Computer Science
learning regression algorithms to find out the one that can are attempting to conjecture the potential future cases to
make estimations for predicting future cases for help the countries plan for upcoming situations and act as
overpopulated countries like Bangladesh. Based on this needed. Different machine learning and deep learning
result, the government as well as policymakers can make a models [4] [5] have done a noteworthy job till now to
decision about the lockdown, resource mobilization, socio- forecast the COVID-19 confirmed cases, and the number
economic progress, education of children and many more. of deaths and recoveries. Shinde et al. [6] have provided
some important parameters for forecasting or predicting a
Keywords- COVID-19, Coronavirus, Machine Learning,
Time-series Analysis, Data Analysis, Data Regression
pandemic as - big data, mathematical models, machine
learning techniques, statistical, analytical parameters, daily
I. INTRODUCTION death count, daily infected cases, medical facilities,
mobility, transmission rate, age, gender, and geographical
Corona viruses hail from a large family of viruses locations. They also stated the fact that a lack of proper
known to cause illnesses like the common cold to more data can have a negative impact on forecasting and
severe diseases such as Middle East Respiratory Syndrome choosing the wrong prediction algorithm can be
(MERS) and Severe Acute Respiratory Syndrome (SARS). misleading. Researchers in [7] have used some machine
These two diseases are spread by the corona viruses named learning models to forecast COVID-19 confirmed cases,
as MERS-CoV and SARS-CoV. SARS was first seen in the number of death cases, and the number of recoveries
2002 in China and MERS was first seen in 2012 in Saudi for 10 upcoming days. Another group of researchers [8]
Arabia. The latest virus seen in Wuhan, China is called worked on the growth rate of infected cases in India and
SARS-COV-2 and it causes corona virus (COVID-19). used ES and Polynomial Regression (PR) models to
Since the report of the first case in December 2019, the predict the future cases. The study recorded in [9] used
number of cases of corona virus are increasing along with LR, Multilayer perceptron (MLP), and Vector Auto-
high death toll. Corona virus spread from that one city to Regressive (VAR) for forecasting COVID-19 cases in
whole country and then, to the whole world. India. But it is unfortunate for us that not so many
As this COVID-19 is spread from person to person, significant research has yet been made on data of
Artificial Intelligence and Machine Learning based Bangladeshi people. Some of the works regarding this
electronic devices can play a pivotal role in preventing the country can be found in [10] [11] [12] [13].
spread of this virus. The increasing availability of
electronic health data can be used for training machine III. METHODOLOGY
learning algorithms to improve its decision- making in Our research work involves finding out the appropriate
terms of predicting diseases. approaches that can predict the number of COVID-19
cases, the number of patients recovering from COVID-19,
As the rise in number of cases of infected people
and the deaths due to COVID-19 using continuous data.
quickly outnumbered the available medical resources in
The flowchart of implemented system is shown in figure 1
hospitals worldwide, it has resulted a substantial burden on
the health care systems [1] [2] [3]. Due to the limited
availability of resources at hospitals and the time delay for
the results of the medical tests, it is a typical situation for
health workers to give proper medical treatment to the
patients. In our thesis, we used machine learning
techniques to predict the spread of corona virus in patients.
below. closest points of different classes. For our system,
we are using the polynomial kernel as it
outperforms other kernels when tested on our
dataset.

 Bayesian Ridge Regression: Bayesian Ridge


Regression [19] estimates a probabilistic model
of the regression problem. Here the prior for the
coefficient w is given by spherical Gaussian as in
equation 3:
p( w∨λ)=N (w∨0 , λ−1) (3)
C. Data Visualization
Figure 1: Flowchart for COVID-19 analysis and prediction We have used Python as the programming language of
our work. We have used Pandas Visualization [20] and
A. Dataset preparation Matplotlib [21] for data visualization in Python.
We used the time-series dataset by John Hopkins
University who collected data from different D. Performance Evaluation Techniques
government published sources and other sources (e.g.  Accuracy: The accuracy of a test [22] is its ability
WHO, ECDC, US CDC, BNO New). This real-time to differentiate the correct and incorrect scenarios
dataset provided by John Hopkins University, USA can correctly. To estimate the accuracy of a test, we
be found at [14]. The data was preprocessed after should calculate the proportion of true positive
assortment. Keeping in mind the date of the first and true negative in all evaluated cases which can
confirmed case of COVID-19 in Bangladesh was 8th be stated as in equation 4:
March 2020. Hence, we collected the time series data
TP+TN
from 22 January 2020 to 22 January 2021 for our Accuracy=
research. TP+TN + FP+ FN
(4)
B. Training model development where, TP is True positive values, TN is True
At first we broke down the dataset into training set and negative values, FP is False positive values, FN
testing set. Then we examined following three machine is False negative values.
learning models used for regression on continuous data-  Root Mean Square Error (RMSE): RMSE [23] is
find out which algorithm performs best on the dataset. the standard deviation of the residuals or
 Polynomial Regression: Polynomial Regression prediction errors, where residuals are a measure
[15] [16] is a special kind of regression approach of how far from the regression line data points
to find the correlation between two variables are. Lower values of  RMSE indicate a better fit.
found using the nth degree polynomial of the The formula for calculating RMSE is shown in
dependent variable.Equation 1 below shows how equation 5:


the dependent and independent variables are n
related: 1
RMSE= ∑ (f −o )2
n i=1 i i (5)
y = c 0+ c 1x + c 2 x 1...c n x n (1)
here y is the independent variable, x is the where, n is number of samples, f is forecasts, o is
dependent variable, c is a set of coefficients and n observed values
is the number of degrees. For our research, we
used a 6 degree polynomial. IV. RESULTS
 Support Vector Machine: The objective of This section describes the results achieved by
Support Vector Machine (SVM) [17] [18] is to experimenting with the proposed ML models mentioned in
find the hyperplane between the support vectors the previous section. At first we trained and tested each of
and maximize the difference between two the three models with the dataset in use. Their accuracy of
separate classes, and it can be expressed as in predicting infections for the 10 upcoming days were then
equation 2: used to find out the best model. Finally, the pros and cons
−‖x i −x j‖
2 of different models were evaluated on the basis of RMSE
( )
(2) values of their prediction ability. Based on our
f ( x )= ∑ α i y i e
2

+b assessments, SVM has proved itself to be the best model to
2 predict the number of COVID-19 cases for 10 upcoming
where αi is the Lagrangian multiplier, ‖x i−x j‖
days. The results generated by the Regression models are
is the squared Euclidean distance between the given below.
two feature vectors. The σ is the sigma parameter
of SVM which indicates the distance between the
Figure 2 and 3 show the worldwide confirmed cases
prediction using Polynomial Regression model. The
number of confirmed cases from 22nd January, 2020 to
22nd January, 2021 is 112M according to the model’s
prediction, where the real number of confirmed cases in
this time period is 98.204M. So this model is giving
97.8% accuracy.

Figure 5:Prediction of confirmed cases for upcoming 10


days by Bayesian Ridge Regression model

Figure 6 and 7 show the worldwide confirmed cases


prediction using Support Vector Machine (SVM)
Figure 2: Worldwide confirm cases prediction Regression model. The number of confirmed cases from
graph using Polynomial Regression model 22nd January, 2020 to 22nd January, 2021 is 107M
according to the model’s prediction, so it is giving 98.62%
accuracy.

Figure 6: Worldwide confirm cases prediction graph


Figure 3: Prediction of confirmed cases for 10
using SVM Regression model
upcoming days by Polynomial Regression model

Figure 4 and 5 show the worldwide confirmed cases


prediction using Bayesian Ridge Regression model. The
number of confirmed cases from 22nd January, 2020 to
22nd January, 2021 is 113M according to the model’s
prediction, so it is giving 97.3% accuracy.

Figure 7: Prediction of confirmed cases for upcoming 10


days by SVM Regression model

Figure 4; Worldwide confirm cases prediction graph


using Bayesian Ridge Regression model As Polynomial Regression works well in smoothing
time series curves and the COVID-19 growth curve is yet
to start smoothing, hence the PR model did not fit well
with our dataset. Also, according to RMSE values of these
models, SVM Regression outperforms PR and BRR. It has
shown better accuracy than the others too. Table 1 below
shows the comparison of RMSE values of our models.
Table 1: RMSE value of our models V. CONCLUSION
Using the forecasted information provided by our
Models Used RMSE for predicting confirmed case
system, the authorities will be able to take appropriate
SVM 1244128.9155 decisions in advance. Whenever utilized correctly, this
advance preparation will eventually lead to minimal loss of
PR 3190010.5892 lives. Furthermore, it will ensure the maximum utilization
BRR 3530575.996 of our resources to ensure our economy is still booming.
The proposed prediction system will be valuable for the
countries like Bangladesh who are aiming to sustain the
pandemic.
REFERENCES

[1] Z. Ratan, H. Hosseinzadeh, N. Runa, B. Uddin, M. F. Haidere, S. Sarker and S. Zaman, "Novel Coronavirus: A New Challenge for Medical
Scientist?," Bangladesh Journal of Infectious Diseases, vol. 7, pp. 58-60, 2020.
[2] U. Tiwari, A. Bano and M. K. A. Khan, "A review on the Covid-19: Facts and current situation," NeuroPharmac Journal, pp. 180-191, 2021.
[3] D. Lewis, "Is the coronavirus airborne? experts can’t agree," Nature, vol. 580, no. 7802, p. 175.
[4] S. Ardabili, A. Mosavi, P. Ghamisi, F. Ferdinand, A. R. Varkonyi-Koczy, U. Reuter, T. Rabczuk and P. M. Atkinson, "Covid- 19 outbreak
prediction with machine learning," Algorithms, vol. 13, no. 10, 2020.
[5] D. Fanelli and F. Piazza, "Analysis and forecast of covid-19 spreading in," Chaos, Solitons & Fractals, vol. 134, 2020.
[6] G. R. Shinde, A. . B. Kalamkar, P. N. Mahalle, N. Dey, J. Chaki and A. E. Hassanien , "Forecasting models for coronavirus disease (covid-19):
A survey of the state-of-the-art," SN Computer Science, no. 1, 2020.
[7] F. Rustam, A. A. Reshi, A. Mehmood, S. Ullah, B.-W. On, W. Aslam and G. Sang , "COVID-19 Future Forecasting Using Supervised
Machine Learning Models," IEEE Access, vol. 8, pp. 101489 - 101499, 2020.
[8] R. Gupta, S. K. Pal and G. Pandey, "A Comprehensive Analysis of COVID-19 Outbreak situation in India," COVID-19 SARS-CoV-2 preprints
from medRxiv and bioRxiv, 2020.
[9] R. Sujath, J. M. Chatterjee and A. E. Hassanien , "A machine learningforecasting model for covid-19 pandemic in india," Stochastic
Environmental Research and Risk Assessment, p. 959–972 , 2020.
[10] A. K. Mohiuddin , "An Extensive Review of Health and Economy of Bangladesh Amid Covid-19 Pandemic," European Journal of Sustainable
Development Research, vol. 4, no. 4, 2020.
[11] A. A. Zabir, A. Mahmud, M. A. Islam, S. C. Antor, F. Yasmin and A. Dasgupta, "Covid-19 and food supply in bangladesh: A review," South
Asian Journal of Social Studies and Economics, vol. 10, no. 1, pp. 15-23, 2020.
[12] K. A. Mottaleb, Mohammed Mainuddin and T. Sonobe, "COVID-19 induced economic loss and ensuring food security for vulnerable groups:
Policy implications from Bangladesh," PLoS ONE, vol. 15, no. 10, 2020.
[13] M. Saifuzzaman, M. M. Rahman, S. F. Shetu and N. N. Moon, "COVID-19 and Bangladesh: Situation report, comparative analysis, and case
study," Current Research in Behavioral Sciences, vol. 2, 2021.
[14] [Online]. Available: https://github.com/CSSEGISandData/COVID-19.
[15] A. Agarwal, "Polynomial Regression," [Online]. Available: https://towardsdatascience.com/polynomial-regression-bbe8b9d97491.
[16] B. Sun, H. Liu, S. Zhou and W. Li, "Evaluating the Performance of Polynomial Regression Method with Different Parameters during Color
Characterization," Mathematical Problems in Engineering, 2014.
[17] "Understanding Support Vector Machine Regression," [Online]. Available: https://www.mathworks.com/help/stats/understanding-support-
vector-machine-regression.html.
[18] M. Haugh, "Machine Learning for OR & FE: Support Vector Machines (and the Kernel Trick)," [Online]. Available:
http://www.columbia.edu/~mh2078/MachineLearningORFE/SVMs_MasterSlides.pdf.
[19] S. M. Mostafa, A. S. Eladimy, S. Hamad and H. Amano, "CBRL and CBRC: Novel Algorithms for Improving Missing Value Imputation
Accuracy Based on Bayesian Ridge Regression," Symmetry, vol. 12, no. 10.
[20] "Chart Visualization," [Online]. Available: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html.
[21] "Matplotlib: Visualization with Python," [Online]. Available: https://matplotlib.org/.
[22] A. Baratloo, M. Hosseini, A. Negida and G. El Ashal, "Part 1: Simple Definition and Calculation of Accuracy, Sensitivity and Specificity,"
Emergency, vol. 3, no. 2, pp. 48-49, 2015.

You might also like