Professional Documents
Culture Documents
Covid Analysis
Covid Analysis
Seminar Report
On
Submitted to
Submitted by:
Binayak Acharya
Exam Roll No: 19460006
PU Reg. No.: 2019-1-46-0006
Table of Contents
1.0 Introduction......................................................................................................................................1
1.2 Motivation in the context of Nepal...............................................................................................3
1.3 Present status of Covid-19 in Nepal.............................................................................................4
Why Covid19 analysis of Nepal?....................................................................................................7
1.4 Aims and Objectives.....................................................................................................................8
1.5 Research Questions.......................................................................................................................8
1.6 Report Structure............................................................................................................................9
2.0 Literature Review...........................................................................................................................10
2.1 Analysis of Problem domain of Covid19...................................................................................11
2.2 Analyzing the most feasible solution to tackle the aforementioned problem domain................12
2.3 Implementation and analysis of derived results..........................................................................13
2.4 Literature Gap and Contributions...............................................................................................16
3.0 Theoretical Foundation...................................................................................................................17
3.1 Machine Learning.......................................................................................................................17
Why Clustering?...........................................................................................................................19
Clustering Algorithm(K-Means)...................................................................................................19
Pros: 19
Cons:.............................................................................................................................................19
VAR 20
Prophet:.........................................................................................................................................22
Why Prophet and VAR used in this project?................................................................................23
3.2 Data Visualization Features and Classification..........................................................................24
3.3 Research Methods and Techniques Applied...............................................................................25
4.0 Methodology..................................................................................................................................26
4.1 Research Methodology...............................................................................................................26
4.2 Software Development Methodology.........................................................................................28
Why USDP not suitable for this project?......................................................................................29
Principles:......................................................................................................................................30
Reasons for Selecting DSDM for this project...............................................................................30
Limitations of DSDM in this project:...........................................................................................31
5.0 Findings..........................................................................................................................................32
5.1 Data Understanding....................................................................................................................32
5.2 Data Preparation.........................................................................................................................33
5.3 Data Manipulation......................................................................................................................34
5.3 Exploratory Data Analysis..........................................................................................................37
5.4 Machine Learning.......................................................................................................................41
Modelling Data (using VAR method)...........................................................................................48
Checking for Residuals' Autocorrelation......................................................................................49
Forecasting....................................................................................................................................49
Predicted vs Actual Visualization.................................................................................................51
Checking the Correlation of actual with predicted.......................................................................51
Predicting......................................................................................................................................54
Graphical Representation of Predicted Screening.........................................................................54
Modelling......................................................................................................................................56
Predicted........................................................................................................................................56
Graphical Representation of Predicted Confirmation...................................................................57
Recovery Cases Prediction............................................................................................................57
Modelling......................................................................................................................................58
Predicting......................................................................................................................................58
Graphical Representation of Predicted Recovery.........................................................................58
Death Prediction............................................................................................................................59
Data Modelling.............................................................................................................................59
Predicting......................................................................................................................................60
Graphical Representation of Predicted Death...............................................................................60
Future Ratios.................................................................................................................................61
6.0 Discussion......................................................................................................................................62
6.1 Result Summary..........................................................................................................................62
6.2 Justifications of research questions............................................................................................62
6.3 Research Contributions...............................................................................................................63
6.4 Limitations..................................................................................................................................64
7.0 Conclusion......................................................................................................................................65
8.0 References......................................................................................................................................66
Tables of Figure
Figure 1 A timeline of five pandemics since 1918 (liu, kuo, & shih, 2020)............................................1
Figure 2 Protection Motivation Theory framework (Rad, et al.).............................................................2
Figure 3 Covid -19 cases as of Jan in Nepal............................................................................................4
Figure 4 Daily new cases in Nepal (Worldmeter, 2022)..........................................................................4
Figure 5 Active Cases in Nepal (Worldmeter, 2022)...............................................................................5
Figure 6 Total Deaths per day by Covid19 in Nepal (Worldmeter, 2022)..............................................6
Figure 7 New Cases Vs New Recoveries (Worldmeter, 2022)................................................................6
Figure 8 Data analysis and visualization architecture (G & L, 2021)....................................................10
Figure 9 Covid19 data distribution in studies........................................................................................12
Figure 10 Process flow of EDA to COVID 19 data (Dsouza & Velan , 2020).....................................13
Figure 11 Data analysis technique, source, and findings from existing studies (Alsunaidi & Ibrahim ,
2021)......................................................................................................................................................15
Figure 12 Clustering Example (geeksforgeeks, 2020)...........................................................................19
Figure 13 Python Libraries used for visualizations (Dsouza & Velan , 2020)......................................24
Figure 14 Research model for Study......................................................................................................26
Figure 15 Overview of the workflow of ML (Pant, 2019).....................................................................26
Figure 16 USDP Process lifecycle (Wells, 2009)..................................................................................28
Figure 17 DSDM Methodology ( (Iqra & Khan, 2018).........................................................................31
Acknowledgement
I would like to express my deepest appreciation to all those who provided me the possibility
to complete this report. A special gratitude I give to our Sir Mr. Niranjan Khakurel, whose
contribution in stimulating suggestions and encouragement, helped me to complete my
project within stipulated time.
Abstract
The outbreak of the 2019 novel coronavirus disease has adversely affected many countries in
the world. The unexpected large number of covid19-cases has disrupted the health system in
many countries like Nepal. Consequently. Predicting the number of covid19 cases is
imperative for governments to take appropriate actions. The number of covid19 cases can be
accurately predicted by considering historical data or reported cases alongside some external
factors that spread the case of virus. Therefore, the main objective of this study is to
simultaneously consider historical data and the external factors. This can be accomplished by
adopting data analytic and visualization. This shows the relationship between different
variables.
The viability and superiority of the developed algorithm are demonstrated for the prediction
of cases in the future. Moreover, the experiments are extended to make future prediction of
cases during the period from April 2020 until December 2020.
By using such predictions, both the government and people in the affected countries can take
appropriate measures to resume pre-epidemic activities.
Figure 1 A timeline of five pandemics since 1918 (liu, kuo, & shih, 2020)
Page | 1
1.1 Motivation
The deadly impact of covid-19 is driving massive amount of research which aims at
characteristic of various understanding of pandemic. The speed with which the disease has
spread throughout the world demands agile solutions to understand and estimate the disease
progression. The spread of global pandemic covid-19 has generated a huge and varied
amount of data, which is increasing rapidly.
The high prevalence and mortality of covid-19 have made it the most important health and
social challenge around the world. However, this disease can be largely prevented by
adherence to hygienic principles and protective behaviours. It seems that identifying the
processes involved in protective health behaviours can be effective in planning and
implementing suitable interventions to encourage the community toward protective
behaviours. Therefore, the present study aimed to predict the preventive behaviours if covid-
19 according to the Protection Motivation Theory. (Rad, et al.)
Despite being still in the middle of the outbreak, there is an urgent need to understand the
impact of covid-19. The objective is to clarify how it was spread so fast in a short time
worldwide in unprecedented fashion.
Moreover, using data analysis for study of Covid19 will find out the future prediction of
covid19 cases, deaths, and home confinement. This will help to figure out the status of the
infected country. The data analysis of covid-19 describes the insights of data from where we
can analyze the infected areas and most infected areas as well. From that we can care the
people accordingly.
In order to analyze the data successfully, several data requirements should be incorporated:
1. Geography:
It is one of the challenges to incorporate data by area to show case numbers
arrangement that reflects their spatial relationships. To understand this, we need to
learn the cases of different area which shows how the cases are spread through
different region. It also figures out how the different regions are affected from this
pandemic.
Page | 2
2. Absolute number
This indicates the cases number by area. For example, total or cumulative case counts.
This is also necessary for the data analysis. This figures out the total number of cases
in different region of country.
3. Relative number
This also indicates the cases number by area. For example, expressing total or
cumulative case number as share of population size.
4. Rate of change
This shows the growth extent of cases. To some extent, the growth in cases by area is
speeding up or slowing down is figured out from this.
5. Time elapsed
It shows the difference between absolute or relative start point in time. During certain
time, the cases are going up or going down are shown.
Additional to these key requirements, following points to be in successful in supporting fields.
1. Concurrent
All data items must be shown simultaneously to support comparison, exploration, and
other synoptic tasks.
2. Discernible
All marks or data must be discernible with limited or manageable occlusion.
3. Prioritized
Phenomena and patterns that are important must be visually salient.
4. Estimable
Graphical techniques used to encode quantities must allow estimation. (Beecham,
n.d.)
Page | 3
1.3 Present status of Covid-19 in Nepal.
As of Jan 2022, the number of active coronavirus(covid-19) infections in Nepal was 831,748.
Among these, 1,643 infected individuals were being treated in intensive care units.
(statistica.com, 2021) Another 11 thousand individuals infected with coronavirus were
hospitalized with symptoms, while around 819 thousand were in isolation at home.
Number of Cases
900000
800000 819105
831748
700000
600000
500000
400000
300000
200000
100000 1643
11018
0
ICU
Hospitalized
Home quarantine
Total current infections
Number of Cases
Figure 3 Covid -19 cases as of Jan in Nepal
The total number of coronavirus cases in Nepal surpassed 8 million (including active cases,
individuals who recovered, and individuals who died) as of Jan 2022. The region mostly hit
the spread of the virus was Kathmandu which counted more than 256 thousand cases. (Nepal
Goverment, 2020)
Page | 4
From the graph we can see that, the daily new cases of covid19 in Nepal has decreased.
There were high cases in June 2021 where the cases had reached up to 10k per day. Now,
the case has been found lower which is below 300.
From the graph it can be found that there are more than 10k people infected in month of jan
which is less according to the previous month. Mostly, people are infected in May to July
2021. Now the cases seem to be in control.
Page | 5
Figure 6 Total Deaths per day by Covid19 in Nepal (Worldmeter, 2022)
This figure shows the number of deaths in Nepal caused by COVID19 from February 2020 to
Dec 2021. Till now, the death counts reach up to 11,602. Nepal suffered more loss in May
2021 and june 2021. Now, the situation of Nepal is out of danger by seeing these statistics.
This figure shows the number of new cases discovered and number of new recoveries patients
of Covid19 in Nepal. It shows the number of recoveries cases are high than number of new
cases recorded. There was high number of cases found in October 2020 and high number of
recoveries was found from November 2020 to till now.
Page | 6
Why Covid19 analysis of Nepal?
The novel coronavirus that has been spreading worldwide since December 2019 has sickened
millions of people lockdown major cities and some countries, prompted unprecedented global
travel restrictions. The analysis clearly shows the Nepalese covid19 fatality and mortality
rates in line with the world scenario as are the number of tests performed in Nepal and in its
different regions. This up to date analysis may elucidate the evolution of covid19 pandemic
in Nepal. This may lead to predict the future outcome of covid19 in different states of Nepal.
From that, we can find out the number of affected and unaffected region. Hence, this research
could be the first of a kind to reduce covid19 in Nepal and the comprehended methodologies
could be further looked into analyse other affected regions as well.
Page | 7
1.4 Aims and Objectives
The main aim of the dissertation is to analyse the covid19 data of Nepal using
different machine learning algorithms and creating visualization of different region.
The following objectives have been set, based on appropriate research, to achieve the
aforementioned main aim:
1. To analyse and visualize the data of state from Nepal regions.
2. To predict the spread of Covid19 ahead of time to take preventive measures.
3. To provide estimates of basic measures of the infectiousness and severity of Covid19.
4. To investigate the predictive ability of simple mathematical models and provide
simple forecasts for the future incidence of Covid19 in Nepal.
RQ 1. What are the short-term predictions for number of cases in Nepal for the next 2-3
weeks based on the current situation?
Page | 8
1.6 Report Structure
Chapter 1: It deals with introduction to covid19 of Nepal and world. Besides this, it also
covers Motivation of doing this research in Nepal along with aims and objectives. Research
questions are identified as well in this part.
Chapter 2: This chapter covers the literature review. Currently available data analysis of
covid19 are analysed and comparative table is created. Importance of doing this project for
Nepal is realized. The already done analysis and about to be done analysis are compared.
Literature gap is identified.
Chapter 3: Theoretical Foundation required for analysis and visualizations are highlighted.
Classification of Algorithms that are required for this project are discussed.
Chapter 4: This chapter deals with methodology of research. Research methodology as well
as system methodology are discussed. In Research methodology, the results obtained after the
quantitative analysis of collected data is shown. In system methodology, the steps taken to
create analysis is described.
Chapter 5: This chapter describes the interpretation of results obtained after data analysis.
Chapter 6: This chapter deals with the discussion of research, limitation of the research,
future escalations, and research contributions.
Chapter 7: This is the final chapter where the report is concluded properly by providing
suggestions and recommendation.
Page | 9
2.0 Literature Review
The category of EDA about Covid19 is extremely varied in terms of functions, technical
aspects, and architecture. Tangible classification is necessary to understand their
characteristics, contrasts and their respective pros and cons. Studying the systems by singly
classifying them may prove to be improper as study may be focused on single aspects missing
significant components lie complexity. Hence, study is carried out by classifying first and
comparing all the dimensions of data. Focus of the research is based on covid19 analysis and
its investigation in depth.
The figure below demonstrates the general architecture of data analysis and visualization. It is
suggested that this research does not focus on technical aspects as it has already been done
before. The focus of this research is to gather data from various sources with relevant
contents and apply the data to generate results. The data could be of various elements with
multiple dimensions.
Page | 10
2.1 Analysis of Problem domain of Covid19
Numerous authors have worked on Covid19 since the pandemic started. Researchers have
started to extract information about infected cases and analyse the medical information that
could cause the coronavirus spreading. Researchers suggested that the spreading of
coronavirus could relate to sex, birth year or the region they came from.
Another similar work done as on data analysis and visualization of Covid19. The study
focused in covid19 based on freely available datasets in which data analytics was provided
on a number of aspects of covid19 including the symptoms of this disease, the difference of
covid19 with other diseases caused by severe acute respiratory syndrome(SARS), Middle east
respiratory syndrome(MERS) and swine flu (A, S, Z, & Kousa, 2020). Data visualization was
provided on the comparison of infections in males/females which show that males are prone
to this disease and the older people are more at risk. Based on the data, the pattern in the
increase of confirmed cases is found to be an exponential curve in nature. Also, the relative
number of confirmed, recovered and death cases in different countries were shown with the
data visualizations in study. S.C Gamoura, 2020 worked on real time data analytics and
prediction of the covid19 pandemic from February to March 26th, 2020 which visualizes
current covid19 cases in real time (chehbi, 2020)
Mudr discussed the use of data analytics during the covid19 pandemic in which he stated the
number of challenges to the success posed by covid19 to success of clinical trials. (Mudr,
2020). Mohamed and Abdurrahman worked on predicting the covid19 epidemic in Algeria
using the SIR model. The study aimed in predicting the daily infected cases with coronavirus
(COVID-19) in Algeria. SIR model was applied on data from 25 February 2020 to 24 April
2020 for prediction. Based on the simulation of two models, the epidemic peak of COVID-19
was predicted to attained on 24th July 2020 in a worst-case scenario, and the covid-19 disease
was expected to disappear in the period between September 2020 and November 2020 at the
latest. (Boudrioua & Abderrahmane , 2021)
Kraichat discussed the influencing factors of covid19 spreading in Thailand where he stated
the situation of covid19 and spread including influencing factors of spreading and control.
The confirmed cases of covid19 data are obtained from the official website of department of
disease control, ministry of public health. Researchers analyzed the situation from first found
case in Thailand until 15 April 2020 with the timeline of influencing factors. Correlation
coefficients of tourist data and infected case was calculated by person correlation coefficient.
From this, it was found that the number of tourist and their activities were significant
associated with number of infected, confirmed covid19 cases. The public education and
social supporting were the key roles for regulation enforcement and implementation.
(Tantrakarnapa, Bhopdhornangkul, & Nakhaapakorn, 2020, pp. 4-8)
Page | 11
2.2 Analyzing the most feasible solution to tackle the aforementioned problem
domain
Many Solutions have been designed to control the covid19 pandemic, including forecasting
and decision-making solutions. Demographic data is useful in understanding the main
characteristics of the population and can be used to classify study samples into several
categories such as male and female. Social data is also used by solutions that study the impact
of repercussions of the covid19 pandemic on human psychological state. Travel data is used
to identify the suspected covid19 cases that have come from countries where the pandemic
has spread. (Alsunaidi & Ibrahim , 2021)
Likewise, many researchers had utilized machine learning techniques along with spark-based
linear models, Multilayer Perceptron and Long short-term memory with a two-stage
cascading platform to enhance the prediction accuracy in different datasets. They applied
their methods on datasets for resource locator, so their model performed with higher accuracy
and a lower computation time.
In order to overcome the problems, various machine learning algorithm models are used for
forecasting such as VAR. This model describes the future prediction of cases, deaths and
recovered by seeing the current scenario.
Page | 12
2.3 Implementation and analysis of derived results
Exploratory Data Analysis and Machine learning algorithm is a field of data analysis used to
visually represent the knowledge embedded deep in the data set. This technique is widely
used to generate inferences of the dataset. Dataset of current pandemic, covid19 is widely
made available by the standard dataset repository. EDA can be applied to these standard
datasets to generate inferences. Data visualization is a technique applied to dataset and is
used to formulate patterns for better insights on the effects of the pandemic with respect to
variables. A web application tool called jupyter lab is used to generate graphs using python
language as it consists of the libraries which are used for the process of EDA and the
visualization is depicted for the attributes showing higher correlation. Based on the graphs
obtained, we can draw the conclusions from the current situation based on the data available.
Figure 10 Process flow of EDA to COVID 19 data (Dsouza & Velan , 2020)
We can analyze different results and obtained important insights from it. A visual
representation is appeasing and easy to understand, the results or output produced in the form
of graphs can help us comprehend the current situation insights easily. Exploratory data
analysis provides an output which can be an enriched form of data and provide data
visualizations. These outputs are implemented along with various algorithms and models to
make decisions or obtain a product which when used in real time will be advantageous.
Page | 13
Area Aim Technique Used Data Data Source Findings
Type
Diagnosis Develop a Best Worst Symptoms and Body The model
diagnosis Method CT Scans Sensors can
model for differentiate
Covid19 COVID-19
detection and from four
diagnosis of other viral
symptoms to chest
define diseases with
appropriate 98%
care measures accuracy
Design a Symptoms Headsets and The
medical mobile approach
device to phone provided
detect and good and
track stable results
respiratory and can be
symptoms of expanded to
COVID-19 include more
sensors to
detect other
COVID-19
symptoms
Develop a The mixed- Demographics, The remote RPM
remote patient effects Medical data monitoring provides
monitoring logistic program, scalable
program regression pulse remote
(RPM) for model oximeter, monitoring
discharged and capabilities
COVID-19 thermometer and
cases decreases
readmission
risk
Estimate or Specify the The multi - Demographics HERs Cardiac
Predict Risk effect of factor and medical function and
Score COVID-19 on logistic data vital signs
the regression should be
cardiovascular model monitored in
system COVID-19
patients,
especially
those with
hypotension,
pericardial
effusion, or
severe
myocardial
injury
Page | 14
Verify if the Demographics, Guangzhou The virus
COVID-19 medical, CDC can survive
virus can be environmental, database and for a short
transmitted and other data sample period on
through collection surfaces,
indirect allowing
contact indirect
transmission
of infection
to uninfected
people
Healthcare Provide a Weighted Demographics, Mobile app Existing data
Decision- platform for prediction medical, collection
Making data model COVID-19, methods can
collection and and other data be
analysis to repurposed
estimate to track and
disease obtain real-
incidence to time data for
develop risk the
mitigation population
strategies and during any
resource rapid global
allocation health crisis.
Figure 11 Data analysis technique, source, and findings from existing studies (Alsunaidi & Ibrahim , 2021)
Page | 15
2.4 Literature Gap and Contributions
This paper has the following contributions. Firstly, in contrast to the existing approach, which
only focuses on the historical data of infected persons with COVID-19, we propose a more
robust approach. Our approach simultaneously considers the historical data of COVID-19
cases alongside most of the external factors that affect spread of the disease. To consider all
massive number of factors, I have used the time forecasting approach for future prediction for
finding out the number of cases, deaths, and infected patients.
Second, instead of predicting the number of cases only, I use this method for predicting
number of deaths, hospitalized patients, and so on. This is fruitful as it gives wide
information about the spread of COVID-19 in different parts of the Nepal region.
Lastly, it has been observed in the literature that most research papers have not provided
future prediction of number of COVID-19 cases. As opposed to these previous research
papers, we use the trained data produced from algorithm to make future prediction of the
number of COVID-19 cases. By using such predictions, both the government and people in
the affected countries can take appropriate measures to resume pre-epidemic activities.
Based on the succession 2.1 to 2.4 it is hence provable that the research in and research objectives of
this dissertation is based on sound and thorough critical analysis and scrutinization of numerous
scientific and validated research materials.
Page | 16
3.0 Theoretical Foundation
3.1 Machine Learning
Machine Learning and Artificial Intelligence have started to gain attraction over the past
years. Machine learning is an application of artificial intelligence that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it to learn for themselves. (expert.ai, 2021)
Machine learning is the field of study that gives computer the capability to learn without
being explicitly programmed. ML is one of the exciting technologies that one would have
ever come across.
We will discuss about supervised and unsupervised learning in this section as it is related to
the project.
1. Supervised learning
Supervised learning as the name indicates has the presence of supervisor as a teacher.
Basically, supervised learning is when we teach or train the machine sing data that is
well labeled. After that, the machine is provided with a new set of examples so that
supervised learning algorithms analyzes the training data and produces a correct
outcome from labeled data. (Geeks for geeks, 2021)
Pros: -
1. Supervised learning allows collecting data and produces data output from previous
experiences
2. Helps to optimize performance criteria with the help of experience.
3. Supervised machine learning helps to solve various types of real-world
computation problems.
Cons: -
1. Classifying big data can be challenging.
2. Training for supervised learning needs a lot of computation time.
Page | 17
We have two types of supervised learning processes:
Classification
It involves grouping the data into classes. If we are thinking of extending
credit to a person, we can use classification to determine whether or not a
person would be a loan defaulter. When the supervised learning algorithm
labels input data into two distinct classes, it is called binary classification.
Multiple classification means categorizing data into more than two classes.
Regression
In regression, a single output vale is produced using training data. This value
is probabilistic interpretation, which ascertained after considering the strength
of correlation among the input variables. For example, regression can help
predict the price of a house based on its locality, size etc.
In logistic regression, the output has discrete values based on a set of
independent variables. This method can flounder when dealing with non-linear
and multiple decision boundaries. Also, it is not flexible enough to capture
complex relationships in datasets.
2. Unsupervised Learning
Unsupervised learning is the training of machine using information that is neither
classified nor labeled an allowing the algorithm to act on that information without
guidance. In this model, we do not need to supervise the model. Instead we need to
allow the model to work on its own to discover information. In this type of machine
learning, the responsibility of the machine is to work and learn the data on its own and
group the unsorted information according to its similarities and patterns. Here, the
machine is restricted to search for the hidden structure in unlabeled data by our-self.
(Geeks for geeks, 2021)
For example, the data points in graph below clustered together can be
classified into one single group. We can distinguish the clusters and we can
identify that there are 3 clusters in the below picture.
Page | 18
Figure 12 Clustering Example (geeksforgeeks, 2020)
But it is not necessary that the cluster needs to be in spherical. It can be different
shaped grouped together.
Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among the
present unlabelled data. There are no criteria for a good clustering. It depends on the
user, what is the criteria they may use to satisfy their need. For instance, we could be
interested in finding representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties (“natural” data
types), in finding useful and suitable groupings (“useful” data classes) or in finding
unusual data objects (outlier detection). This algorithm must make some assumptions
which constitute the similarity of points and each assumption make different an
equally valid cluster. (geeksforgeeks, 2020)
Clustering Algorithm(K-Means)
K-means algorithm method is an unsupervised machine learning technique used to
identify clusters of data objects in dataset. There are many different types of
clustering methods, but k-means is one of the oldest and most approachable. These
traits make implementing k-means clustering in python reasonably.
Pros:
It is easy to implement k-means and identify unknown groups of data from
complex data sets.
The algorithm used is good at segmenting the large data set. Its efficiency
depends on the shape of the clusters. K-means work well in hyper-spherical
clusters.
K-means segmentation is linear in the number of data objects thus increasing
execution time. It does not take more time in classifying similar characteristics
in data like hierarchical algorithms.
Cons:
The way to initialize the means was not specified. One popular way to start is
to randomly choose k.
Page | 19
The results produced depend on the initial values for the means and it
frequently happens that suboptimal partitions are found. The standard solution
is to try several different starting points.
The results depend on the metric used to measure || x – mi ||. A popular
solution is to normalize each variable by its standard deviation, through this is
not always desirable.
The results depend on the value of k. (Holy python, 2021)
VAR
Vector Autoregression is a forecasting algorithm that can be used when two or more-time
series influence each other which means the relationship between the time series involved is
bi- directional. The basic requirements in order to use VAR are: -
1. We need at least two-time series(variables)
2. The time series should influence each other.
It is considered as an Autoregressive model because each variable (Time Series) is modelled
as function of the past values that is the predictors are nothing but the lags (time delayed
value) of the series (Prabhakaran, 2019)
The vector autoregression (VAR) model extends the idea of univariate autoregression to k
time series regressions, where the lagged values of all k series appear as regressors. Put
differently, in a VAR model we regress a vector of time series variables on lagged vectors of
these variables. As for AR (pp) models, the lag order is denoted by pp so the VAR (pp)
model of two variables XtXt and YtYt (k=2k=2) is given by the equations
Yt=β10+β11Yt−1+⋯+β1pYt−p+γ11Xt−1+⋯+γ1pXt−p+u1t,
Xt=β20+β21Yt−1+⋯+β2pYt−p+γ21Xt−1+⋯+γ2pXt−p+u2t.
The ββs and γγs can be estimated using OLS on each equation. The assumptions for VARs
are the time series assumptions presented in Key Concept 14.6 applied to each of the
equations. It is straightforward to estimate VAR models in R. A feasible approach is to
simply use lm () for estimation of the individual equations (Stock & Watson, 2015).
Pros: -
A systematic but flexible approach for capturing complex real-world behaviour.
It supports better forecasting performance.
It has ability to capture the intertwined dynamics of the time series data.
Cons: -
VARs use little theoretical information about the relationships between the variables
to guide the specification of the model
It is often not clear how the VAR estimates of coefficient should be interpreted.
There are so many parameters to be estimated.
Page | 20
2. Check the data and make the proper adjustments
Once we select the variables, we can some adjustments to the data that will improve
the estimation and interpretation of the model. It is useful to use summary statistics
and see a plot of the series to detect outliers, missing data, and other strange
behaviours.
8. Make Predictions
After we are sure our model is well specified, we can use the predict function as
predict function. We can even plot impulse response to check how the variable
respond to particular shock using irf function.
9. Evaluate predictions
Once the predictions are done, we must evaluate them and compare them against
another model (Stock & Watson, 2015).
Page | 21
Prophet:
Prophet is procedure for forecasting times series data based on additive model where no-
linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works
best with the time series that have strong seasonal effects a several seasons of historical data.
Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
Pros:
It is accurate and fast
It is fully automatic
It supports tuneable forecast.
The procedure makes use of a decomposable time series model with three main model
components: trend, seasonality, and holidays.
Similar to a generalized addictive model, with time as regressor, prophet fits several linear
and non-linear functions of time as components.
Its simplest of equation is:
y(t) = g(t) + s(t) + h(t) + e(t)
where, g(t)
trend models non-periodic changes (growth over time)
s(t)
seasonality presents periodic changers (weekly, monthly, yearly)
h(t)
ties in effects of holidays (on potentiality irregular schedules >= 1 day(s))
e(t)
covers idiosyncratic changes not accommodated by the model
In other form, the procedure’s equation can be written:
(Robson, 2019)
Page | 22
Why Prophet and VAR used in this project?
This method is implemented in this project in order to find out the future outcomes of
covid19 cases in Nepal. These two methods are useful in predicting timeline series of data.
From that we will be able to find out the future positive cases, deaths, and preventive cases of
Nepal. Since both methods are useful for forecasting timeline series, they are selected in for
this project.
Page | 23
3.2 Data Visualization Features and Classification
Confronted with the complicated high-dimensional data, people sometimes discus that we
often don’t understand the implications of the data. People will acknowledge the innovation
of statistical data visualization to render complicated data into simplest form. Therefore, Data
visualization has following characteristics: -
Graphical fidelity: Visualization effectively organize the information.
Multidimensional: It gets data from multiple aspects (characteristics, features) of the
information, but in a one-dimensional manner in front of the people.
Visibility: The data is eventually presented in the form of graphs, diagrams, patterns
and graphs and its correlations and interdependencies are envisioned. Though the
increasing evolution of technological advances, only single-scale information can be
referenced, and vast-scale a higher-dimensional data and information are now being
interpreted and expressed seamlessly. (li & hou, 2017)
Graphical Systems display a large degree of usability, fast and easy usual features, influential
visualization abilities and valuable techniques for handling and utilizing data resources. Data
mining visualization methods and technologies as a hybrid term derive from the combination
of data mining techniques and visualization methods as the creative achievement in the
creation and exploitation of data resources.
Figure 13 Python Libraries used for visualizations (Dsouza & Velan , 2020)
Data mining may help to find out the knowledge of interest in the field more easily or draw
some new conclusion. Data Visualization technology can visualize and simplify complex
data, and make understanding the underlying data laws simple for researcher like us. (keim,
Qu, & Ma, 2013)
Page | 24
3.3 Research Methods and Techniques Applied
Programming Tools
The programming tool can be defined as a software program used for the development
process or testing data. The development tool that is suited would be Jupyter lab. The main
reason for selecting this IDE is it is a simple, user-friendly IDE which helps to write code
easily and asl helps to see the output easily. The main advantage of using Jupyter lab is that it
is a cross platform editing tool – it works across different operating systems. It has built in
package manager.
Page | 25
4.0 Methodology
4.1 Research Methodology
To answer the different research questions, specific methodologies have been used which
covers up different data set, data sources, modelling techniques and outcome. The
overall variables covered up in the study are shown in the figure: -
Covid19 in Nepal
VAR Prediction
Prophet Prediction
For the first research question, the time series forecasting has been done using prophet
method and VAR method. Similarly, For the second question, different analysis and
visualization are done in order to see the relationship between different variables.
The workflow of ML is described below: -
The following steps are carried out during the machine learning calculations: -
1. Gathering data
The process of gathering data depends on the type of project we desire to make, if we
want to make an ML project that uses real time data. The dataset can be collected
Page | 26
from various sources such as file and many other sources, but the collected data
cannot be used directly for performing the analysis process as there might be a lot of
missing data, extremely large values. To solve this, Data Preparation is done.
2. Data pre-processing
It is one of the most important steps in machine learning. It is the most important step
that helps in building machine learning models more accurately. In machine learning,
there is an 80/20 rule. Every data scientist should spend 80% time for data pre-
processing and 20% to actually perform the analysis.
3. Researching the model that will be best for the type of data
Our main goal is to train the best performing model possible using pre-processed data.
It is of two types which are Supervised learning and Unsupervised learning which we
have discussed in earlier topic.
5. Evaluation
Model Evaluation is an integral part of the model development process. It helps to
find the best model that represents our data and how well the chosen model will work
in the future.
Page | 27
4.2 Software Development Methodology
According to the nature of the project, some of the methodologies that could be considered
could be Unified Software Development process (USDP) and Dynamic System Development
Method (DSDM).
Business modelling
Requirements
Analysis and Design
Implementation
Testing
Deployment
Page | 28
Elaboration
The project’s architecture and required resources are further evaluated.
Developers consider possible applications of the software and costs
associated with the development.
Construction
The project is developed and completed. The software is designed, written,
and tested.
Transition
The software or system is released to public. Final adjustments or updates
are made based on feedback form end users.
USDP could deliver only certain module of the system at the end of any
phase, it delivers a complete only at the end i.e. Transition phase. It never
delivers a product at the end of any phase. Hence, we have to wait to see
the final product which is not feasible in today’s world.
Page | 29
2. Dynamic System Development Method (DSDM)
It is an associate degree agile code development approach that provides a framework
for building and maintaining systems. DSDM helps during frequent requirement
changes and focus on early delivery to provide real benefits to the business
(geeksforgeeks, 2019)
Principles:
Focus on the business need
Deliver on time
Collaborate
Never compromise quality
Build incrementally form firm foundations
Develop iteratively
Communicate continuously and clearly
Demonstrate control
Page | 30
Figure 17 DSDM Methodology ( (Iqra & Khan, 2018)
Those features that could enhance the software development process making as single
individual are considered and applied in this project.
Page | 31
5.0 Findings
5.1 Data Understanding
Summary of the dataset:
1. The dataset is in .csv format
2. The dataset is of size 4872 rows and 17 columns
3. Each column represents the relevant information related to Covid19 throughout the
pandemic
4. There were not any missing values in the dataset.
Dataset Column data Description Data type Variable type
covid19_Nepal_regio Date Shows the Non-null Discrete
n
date int64
RegionCode Shows the Non-null Discrete
region code int64
RegionName Shows the Non-null Discrete
country region int64
name
Latitude Shows the Non-null Discrete
distance north float64
or south of
equator
Longitude Shows the Non-null Discrete
distance east float64
or west of
equator
HospitalizedPatients Shows the Non-null Discrete
number of int64
hospitalized
patients due to
covid 19
IntensiveCarePatients Shows the Non-null Discrete
number of int64
serious
patients due to
covid 19
TotalHospitalizedPatients Shows the Non-null Discrete
number of int64
total patients
HomeConfinement Shows the Non-null Discrete
number of int64
arrests from
the restrict
location
CurrentPositiveCases Shows the Non-null Discrete
current int64
number of
covid 19 cases
NewPositiveCases Shows the new Non-null Discrete
Page | 32
number of int64
covid 19 cases
Recovered Shows the Non-null Discrete
recovered int64
number of
covid 19
patients
Deaths Shows the Non-null Discrete
death number int64
of covid 19
cases
TotalPositiveCases Shows the Non-null Discrete
number of int64
covid 19 cases
TestsPerformed Shows the Non-null Discrete
total number float64
of tests carried
out in Nepal.
Since, there are no missing values in the dataset, all the data respective to their header are shown
above.
Page | 33
5.3 Data Manipulation
Data is made ready for further analysis in this phase. Some of the tasks that was carried out
in this phase are discussed below: -
Page | 34
Page | 35
5. Symptoms of Corona virus
Page | 36
From the research, it can be found that the main symptoms of corona virus are fever, dry
cough, fatigue, sputum production, shortness of breath, muscle pain and so on. People having
high fever, dry cough have high chance of catching corona virus.
Page | 37
5.3 Exploratory Data Analysis
2. Description of Data
Page | 38
3. Grouping data according to Date
This shows the number of total cases, deaths, recovered, hospitalized patients according
to the regions of Nepal.
From this, we get to know that the most affected region of Nepal is Bagmati, and least
affected region is Karnali.
Page | 39
6. Confirmed cases vs Region
From the chart, it can be stated that Bagmati has the highest number of patients
hospitalized with no one at karnali.
Page | 40
8. Death vs Region
From the Chart it is seen that the death ratio of Bagmati is very high. Analysing the
Confirmed, Death and Recovery cases of different regions of Nepal, we can clearly say
that Bagmati is the most affected regions by corona virus.
Page | 41
5.4 Machine Learning
There are altogether two algorithms used for this project. The two algorithms are VAR and
Prophet as we have discussed earlier in the machine learning portions.
1. VAR
First, dropping all the columns who are part of other columns.
E.g. Total Hospitalized patients = Hospitalized patients + Intensive Care Patients
Plotting the Graph accordingly with date respect to recovered, hospitalized, tests
performed, home confinement.
Page | 42
Checking for Causality
Granger causality is a concept of causality derived from the notion that causes may
not occur after effects and that if one variable is the cause of another, knowing the
status on the cause at an earlier point in time can enhance prediction of the effect at a
later point in time
The test Null Hypothesis is that the coefficients of the corresponding past values are zero;
That is the X does not cause Y. The P-values in the table are lesser than our significance
level (0.05), which implies that the Null Hypothesis can be rejected.
Page | 43
Train-Test Split
Page | 44
Page | 45
Checking ADF 1st difference on column
Page | 46
Checking ADF on 2nd Difference
Page | 47
Page | 48
As we can see, after 2 series differences, we have 2 stationary columns under
significance level of 5%, 1 stationary column under significance level of 0.1%, and 2
non-stationary columns (under plausible significance level). This is not ideal -
however, because we're using "short" time series, I've decided to go on with only 2
differences and not to add more differences.
Page | 49
Choosing number of lags to be inserted into the model is a matter of trial and error,
and can be changed according to the regression results (above), the Durbin Watson
test results (will be explained in a moment), and other metrics (e.g., RMSE, MAE,
etc.)
d = 2 indicates no autocorrelation.
If d > 2, successive error terms are negatively correlated. In regressions, this can
imply an underestimation of the > level of statistical significance.
Forecasting
Page | 50
Page | 51
Predicted vs Actual Visualization
Page | 52
Considering the length of our data, the results seems to be reasonable. It might be the case
that the model predictions will be better, as we get more updated data to feed into the model.
Page | 53
2. Prophet Algorithm
Modelling
Here, the periods are 15 which indicates that we are predicting for the next 15 days.
It will provide how much tests will be performed during that time.
Page | 54
Predicting
The Prediction shows that there will be up to 8.41434 M tests performed in the different
regions of Nepal altogether. Based on the current scenario, the prediction shows the
results for upcoming 15 days i.e. up to 21 December 2021.
Page | 55
Showing the tests performed per week
As we can see from the graph that, the highest number of tests performed is in Friday,
Saturday, and Sunda
Page | 56
Modelling
Data Modelling for next 15 days to predict the number of confirmed cases in
Nepal according to the date.
Predicted
Page | 57
Graphical Representation of Predicted Confirmation
Predicting the number of Confirmed cases till Dec 21, there will be more than
12 million people who will be affected by coronavirus.
Page | 58
Modelling
Data Modelling for next 15 days to predict the number of recovered cases in
Nepal according to date.
Predicting
Page | 59
Predicting number of recoveries till Dec 21, 2021 will be 814k
Death Prediction
Data Modelling
Page | 60
Predicting
The current situation is not under control. The current prediction shows 15k
deaths till 21 Dec 2021 and rest will be in isolation.
Page | 61
How future looks like
Future Ratios
Present Ratios:
Page | 62
6.0 Discussion
6.1 Result Summary
The report focuses on the two things which are exploratory data analysis and machine
learning. Exploratory data analysis is used to show the relationship between different
variables of the dataset. It was found that the most affected regions in Nepal is Bagmati. Data
are reported from each area of Nepal starting from April 19, 2021 to December 06, 2021.
This last data was selected to predict the future outcome of cases with a high expected
influence on the new detected infections cases. Data are plotted versus the day they were
recognized. One probable scenario, justifying the virus prevalence diffusion in Bagmati is that
COVID19 cases outside of China might spread and remain undetected for a relevant time
period, resulting in delayed countermeasures and remain unnoticed in Nepal from January. It
is also known that Bagmati is the highest populated area comparisons to others area. The
report also describes the process of machine learning about VAR and Prophet. VAR shows
the results of predicted and actual data variables correlation. Prophet is used for forecasting
the timeline series. From that, the cases for 15 upcoming days was predicted by using the
latest date cases.
RQ 1. What are the short-term predictions for number of cases in Nepal for the next 2-3
weeks based on current situation?
By using the prediction model using machine learning such as VAR and Prophet, we
can predict the cases for short period of time. For example, if we want to know the
prediction of confirm cases for the 15 days. We can simple design a model using
prophet and forecast the values using latest data. Similarly, we can do this for deaths
rate, home confinement, and tests performed.
Page | 63
6.3 Research Contributions
The project helped to understand the author how the different learnt theories were applicable
in the real world during the analysis. Various machine learning theories, techniques and
algorithms were studied during this project which helped to predict the future forecasting of
value of different variables. During the research, it has been found that covid19 dataset of
Nepal can be analyzed and visualized by using different libraries and various machine
learning algorithms can be used for future prediction of cases. For the given data set in this
project, a jupyter lab tool was used as IDE and libraries like sns, VAR, prophet and numpy,
pandas were used for data analysis and visualization.
The research conducted could be extremely helpful for Nepal. We describe and analyze
various versions of prediction and analysis in relation to the present status of COVID-19
research to effectively promote scientific research that will help prevent and control the
epidemic to remedy the lack of analysis of the current situation. This study will help improve
the standards for the prevention and control of major epidemics. There are seven different
aspects covered up in this study and two research questions have been answered
comprehensively. They are related to presenting the number of infected cases, recovered
cases and tests performed, predictions for the number of infected cases, recovered cases and
tests performed. The current study implemented various techniques to present the data
analysis and the results are in sync with few limited studies available in the literature. The
results could be useful in contributing to health policy divisions or government interventions.
This study will be useful for the Government of Nepal and various states or regions of Nepal.
This study will also be favourable for administrative units of other countries to consider
various aspects related to the control of COVID19 outspread in their respective regions.
Page | 64
6.4 Limitations
First and Foremost, lack of data is the major drawback for the system to analyse effectively.
For collecting data implicitly through different official sites, it takes more time. Hence it is a
continuous process of improvement. Moreover, the data available to us not only be the main
reasons for the prediction of covid19.
Similarly, the project is largely scalable. It is difficult for an individual to research and collect
all the data of Nepal. It includes collections of data like confirmed cases, recovered cases, and
so on from different areas or cities of Nepal. Hence, there must be a team to continue this
project in real time. Due to the lack of updated data, the performance and classification of
algorithm might degrade. So, a platform could be created where the hospitality sector people
updated the details of COVID19 information. If this is done, collective approach of high
number of people could help in gathering information at short period of time.
As we have predicted the cases for 15 days, it is not necessarily known that this cases only
occurs in that time. There are other factors also that helps to increase or decreases the cases.
Based on the given data, we have predicted that information. The scope of the project was
limited to mostly research and as an artefact covid19 data analysis and prediction for
future outcomes.
Since this is an academic project, the entire report had to be finished on a particular deadline
by which the scope was limited. However, the scope of the project can be extended with
incorporation of new studies, other algorithms and so on.
Page | 65
7.0 Conclusion
This study investigates how new COVID-19 cases can be predicted while considering
the historical data of COVID-19 cases alongside the external factors that affect the
spread of the virus. To do so, data analytics was adopted by using various python
libraries. The effectiveness and superiority of the developed algorithm are
demonstrated by conducting experiments using data collected for covid19 of Nepal of
different regions. The results show an improved accuracy if compared with the
existing methods. Moreover, the experiments are extended to make future prediction
of the affected COVID-19 cases during the period from March 2020 until December
2020. The predicted COVID-19 cases help in providing some recommendations for
both the government and people of the affected countries. This study provides a novel
way for predicting the number of COVID-19 cases. However, there are some venues
that might be suitable for future directions. For example, predicting the number of
deaths could be one direction. Another direction might be predicting the number of
recovered people. One of the fruitful ideas is predicting the number of COVID-19
cases in the top affected cities, while considering the seasonality factor.
AI has the potential to be a tool in fight against COVID19 and similar pandemics.
Clearly, data is central to whether AI will be an effective tool against future epidemics
and pandemics. The current study implemented various techniques to present the data
and analysis and the results are sync with few limited studies available in the literature
review. This study will be useful for the Government of Nepal and various regions of
Nepal. This study will also be favourable for the administrative units of other
countries to consider various aspects related to the control of COVID19 outspread in
their respective regions.
Page | 66
8.0 References
A, K., S, D. A., Z, W. A., & Kousa. (2020). Healthcare Providers on the Frontline.
Agile Business Consortium Limited. (2021). What is DSDM? Retrieved from www.agilebusiness.org:
https://www.agilebusiness.org/page/whatisdsdm
Alsunaidi, S., & Ibrahim , N. (2021). Applications of Big Data Analytics to Control. Sensors.
Beecham, R. (n.d.). On the use of ‘glyphmaps’ for analysing the scale and temporal spread of Covid-19
reported cases. Retrieved from https://www.roger-beecham.com/: https://www.roger-
beecham.com/covid
Boudrioua, M. s., & Abderrahmane , B. (2021). Predicting the COVID-19 epidemic in Algeria using the
SIR model.
Dsouza, J., & Velan , S. s. (2020). Using Exploratory Data Analysis for Generating Inferences on the
Correlation of COVID-19. 2020 11th International Conference on Computing, Communication
and Networking Technologies (ICCCNT).
G, A., & L, T. (2021, dec 15). Treinish. Retrieved from An Extended Data-Flow Architecture for Data
Analysis and Visualization: https://sci-hub.do/10.1109/VISUAL.1995.480821
Geeks for geeks. (2021, dec 16). Geeks for geeks. Retrieved from Geeks for geeks:
https://www.geeksforgeeks.org/supervised-unsupervised-learning/
Guru99. (2011). unsupervised Machine Learning: What is, Algorithms, Example. Retrieved from
unsupervised Machine Learning: What is, Algorithms, Example:
https://www.guru99.com/unsupervised-machine
Holy python. (2021). holypython.com. Retrieved from K-Means Pros & Cons: https://holypython.com/k-
means/k-means-pros-cons/
Hui, D., & I , A. (2020). The continuing 2019-nCoV epidemic threat of novel coronaviruses to global
health — The latest 2019 novel coronavirus outbreak in Wuhan, China. International Journal of
Infectious Diseases, 91, 264-266.
Iqra , Z., & Khan, N. A. (2018). The Impact of Agile Methodology ( DSDM ) on Software Project.
Retrieved from www.semanticscholar.org: https://www.semanticscholar.org/paper/The-Impact-of-
Agile-Methodology-(-DSDM-)-on-Project-Zafar-Nazir/
Page | 67
843733664dc56367e0c61a6a854a84b844798c45
keim, D., Qu, H., & Ma, K.-l. (2013). Big-data visualization. IEEE Computer Graphics and Applications,
20-21.
li, y., & hou, s. (2017). Methods and Techniques in Data Visualization Model. In: S. Peng, R. Hao & S.
Pal, eds. . First International Conference on Mathematical Modeling and Computational Science
(pp. 71-74). China, India: Springer Nature.
liu, y.-c., kuo, R.-L., & shih, S.-R. (2020). The first documented coronavirus pandemic in history.
Biomedical Journal.
Mudr, A. K. (2020). The Use of Data Analytics During the COVID-19 Pandemic. Signant HEalth.
Nepal Goverment. (2020, jan 3). Retrieved from Health and Population ministry:
https://covid19.mohp.gov.np/
Prabhakaran, S. (2019, july 7). Vector Autoregression (VAR) – Comprehensive Guide with Examples in
Python. Retrieved from machine learning +:
https://www.machinelearningplus.com/time-series/vector-autoregression-examples-python/
Rad, R. E., Mohseni, S., Takhti, H. K., Azad, M. H., Shahabi, N., Aghamolaei, T., & Norozian, F. (n.d.).
Application of the protection motivation theory for predicting COVID-19 preventive behaviors in
Hormozgan, Iran: a crosssectional study. BMC Public Health, 3.
Robson, W. (2019, june 17). The Math of Prophet. Retrieved from medium: https://medium.com/future-
vision/the-math-of-prophet-46864fa9c55a
Stock, J. H., & Watson, M. W. (2015). Introduction to Econometricsx. Pearson Education Limited.
Tantrakarnapa, K., Bhopdhornangkul, B., & Nakhaapakorn, K. (2020). Influencing factors of COVID-19
spreading: a case study of Thailand. 4-8.
Page | 68
Page | 69