Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

School of Computing and Creative Media

XBIS 2023 Data Science

Assignment Report

Name: 1) Esmond Liaw Chu Siang


2) Nicholas Tan Hau Shen

Student ID: 1) 0117040


2) 0125360

Course: Bachelor of Information System (EIS)


1.0 Introduction
An epidemic disease is an infectious disease rapidly spreading to a large number of people in
a given population within a short period of time. Epidemic diseases have always represented
challenging problems to address, and cannot be ignored inhuman history. The worst epidemic
in modern history was the Spanish flu of 1918, which killed more than fifteen million people.
Nowadays, as the world becomes more interconnected, epidemics have the potential to
spread faster. On February 11, 2020 the World Health Organization announced the official
name for the disease that is causing the 2019 novel coronavirus outbreak, first identified in
Wuhan China. The infection caused by the novel coronavirus detected is now affecting about
118countries, raising concerns of widespread fear and increasing anxiety in individuals
subjected to the threat of the virus. With the development of science and technology and the
continuous attempts of scientific research institutions, relevant data one epidemics and
viruses have accumulated at an increasing rate. However, much of the data has not been
analyzed for extracting new knowledge and value. There are three main challenges
associated with these enormous amounts of data: gaps exist between researchers in
different fields, and different approaches and methods make it difficult to understand the
problem in depth; epidemiological data is vast so that we need a method to extract valuable
data, remove irrelevant data, and guide potential applications in a targeted manner; there is
still a lack of efficient methods and models in the field for data utilization and
application in practice, as well as corresponding tools. The exploding amount of data
in many applications, makes it crucial that the advancements in Big Data research, Deep
Learning, Data Analytics, and Data Science find their way from the research labs to practical
applications, and that these research results can be successfully integrated into drug
screening, crowd disease prevention and control, trend prediction, epidemic surveillance and
other fields.

2.0 Hypothesis
Prediction of COVID-19 data that are affected from Wuhan, China which being analyzed for
the data and predict the data to the most accurately one.

3.0 Research Question


Big Data in trend prediction of epidemic diseases. Our research question is to predict the
epidemic crisis such as COVID-19.

4.0 Methodology
4.1 Programming Language
4.1.1 Python
Python is a widely used general-purpose, high level programming language. It
was initially designed by Guido van Rossum in 1991 and developed by Python
Software Foundation. It was mainly developed for emphasis on code readability,
and its syntax allows programmers to express concepts in fewer lines of code.
4.1.2 Why using Python Language?
Easy to use and consistent
Python is a high-level, interpreted and general-purpose
dynamic programming language that focuses on code readability. The
syntax in Python helps the programmers to do coding in fewer steps as
compared to Java or C++. The Python is widely used in bigger
organizations because of its multiple programming paradigms.

Extensive selection of libraries and frameworks


Implementing AI and ML algorithms can be tricky and requires a lot of
time. It’s vital to have a well-structured and well-tested environment to
enable developers to come up with the best coding solutions.
To reduce development time, programmers turn to a number of Python
frameworks and libraries. A software library is pre-written code that
developers use to solve common programming tasks. Python, with its rich
technology stack, has an extensive set of libraries for artificial intelligence
and machine learning. For example, Keras, TensorFlow, and Scikit-learn
for machine learning. Scikit-learn features various classification,
regression, and clustering algorithms, including support vector machines,
random forests, gradient boosting, k-means, and DBSCAN, and is
designed to work with the Python numerical and scientific libraries
NumPy and SciPy.

Great community and popularity


In the Developer Survey 2018 by Stack Overflow, Python was among the
top 10 most popular programming languages, which ultimately means that
you can find and hire a development company with the necessary skill set
to build your AI-based project. If you look closely at figure 1 below,
you’ll see that Python is the language that people Google more than any
other.
Figure 1

5.0 Libraries
5.1 Pandas
Pandas is an open-source Python Library providing high-performance data manipulation
and analysis tool using its powerful data structures. The name Pandas is derived from the
word Panel Data an Econometrics from Multidimensional data. In 2008, developer Wes
McKinney started developing pandas when in need of high performance, flexible tool for
analysis of data. Prior to Pandas, Python was majorly used for data munging and
preparation. It had very little contribution towards data analysis. Pandas solved this
problem. Using Pandas, we can accomplish five typical steps in the processing and
analysis of data, regardless of the origin of data to load, prepare, manipulate, model, and
analyze. Python with Pandas is used in a wide range of fields including academic and
commercial domains including finance, economics, Statistics, and analytics.
5.2 Numpy
NumPy is a Python package. It stands for 'Numerical Python'. It is a library consisting of
multidimensional array objects and a collection of routines for processing of array.
Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package
Numarray was also developed, having some additional functionalities. In 2005, Travis
Oliphant created NumPy package by incorporating the features of Numarray into
Numeric package. There are many contributors to this open source project. Using
NumPy, a developer can perform mathematical and logical operations on arrays, Fourier
transforms and routines for shape manipulation, operations related to linear algebra.
5.3 Seaborn
Seaborn is a library for making statistical graphics in Python. It is built on top
of matplotlib and closely integrated with Pandas data structures. A dataset-oriented API
for examining relationships between multiple variables. Specialized support for using
categorical variables to show observations or aggregate statistics. Options for
visualizing univariate or bivariate distributions and for comparing them between subsets
of data. Automatic estimation and plotting of linear regression models for different
kinds dependent variables. Convenient views onto the overall structure of complex
datasets.
5.4 Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib
is a multi-platform data visualization library built on NumPy arrays and designed to work
with the broader SciPy stack. It was introduced by John Hunter in the year 2002. One of
the greatest benefits of visualization is that it allows us visual access to huge amounts of
data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter,
and histogram.
5.5 Statsmodels
Statsmodels is a Python module that provides classes and functions for the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical
data exploration. An extensive list of result statistics are available for each estimator. The
results are tested against existing statistical packages to ensure that they are correct. 
5.6 Linear Regression
Linear regression is a basic predictive analytics technique that uses historical data to
predict an output variable. It is popular for predictive modelling because it is easily
understood and can be explained using plain English.
5.7 R2-Score
The coefficient of determination is the proportion of the variance in the dependent variable
that is predictable from the independent variables. It is used in the context of statistical
models whose main purpose is either the prediction of future outcomes or the testing
of hypotheses, on the basis of other related information. It provides a measure of how well
observed outcomes are replicated by the model, based on the proportion of total variation of
outcomes explained by the model.
6.0 Train and Test Data for COVID-19
As first model we are using Train and test data to predict the outcome of COVID-19.
Train/Test is a method to measure the accuracy of our model. We split the data into two sets
which is a training set and a test set. We are using 80% for training and 20% for testing.

Figure 2: we start off with importing all the libraries we need.

Figure 3: Showing the null values of our data.

Figure 4: Plotting the graph X and Y from our dataset.

Figure 5: Showing the method of predicting train/test data. The last line of code showing
0.17695, a smaller MSE score is better since it implies agreement between the prediction and
the reality. A smaller value of MSE generally indicates a better estimate.

Figure 2
Figure 3

Figure 4
Figure 5

7.0 Confirmed Case and Total Death prediction


We are using the confirmed case and total death case from our dataset to predict the outcome
of COVID-19. The reason we used confirmed case and total death is because we wanted to
predict whether confirmed case will effect on how many total death from the data. For
example, if there is a strong relationship between this 2 cases, our prediction will be correct.

Figure 6: The libraries we included for our prediction.

Figure 8: Showing the X-axis (Confirmed Case) and Y-axis (Total Deaths).

Figure 10: Showing the Linear Regression line of X and Y. The linear model is Y =
109.007399 + 0.046859X

Figure 11: We have to import statsmodel.api as sm in order to get the results. There are 86%
confirmed case is influenced by the Total Death by looking at the R-squared result. If we add
in more factors like comparing confirmed case with recovered case, the R-squared result will
be different. When p>|t| value is close to 0, this mean that the correlation between confirmed
case and total death is very strong. As the result is 0.733, it is not significant because the
value is more than 0.5. We need to accept for the null hypothesis as the value of p>|t| is more
than 0.5. We have to reject the alternative hypothesis as the value is 0.733 is came from null
hypothesis. R-squared have 86% that fit the regression model. This can consider that X and
Y have strong relationship affected by the COVID-19.

Figure 6

Figure 7
Figure 8

Figure 9
Figure 10

Figure 11
8.0 Confirmed Case with Recovered Case
We used confirmed case and recovered case to predict for the COVID-19. As a result, this can
compare to confirmed case and total death case. We can differentiated whether which of these 2
cases have biggest influenced by the covid-19.
Figure 12: Libraries that we include for the prediction.
Figure 14: The prediction graph of Confirmed case and Recovered case.
Figure 16: We add in the linear regression into the model which generates the straight blue line
in the graph. The linear model is Y = 109.0073993 + 0.046859X. As a result, we can compare
figure 16 and figure 10 graph which shown above. The linear regression line is different.
Figure 17: The confirmed case is 82% influenced by the recovered case by looking at the R-
squared result. By looking at the P>|t| result, the value is closer to 0 and less than 0.5. This can
consider as alternative hypothesis. There is a strong relationship between X (Confirmed case)
and Y (Recovered Case). R-squared which have the value of 82% fit to the regression model.
This can considered that X and Y have strong relationship. R-squared reflects the fit of the
model. The values range from 0 to 1, where a higher value generally indicates a better fit. When
a p-value is less than 0.05, it is considered to be statistically significant as the predicted value
shown in figure 17 is 0.011.

Figure 12
Figure 13

Figure 14
Figure 15

Figure 16
Figure 17
9.0 Different method to get Confirmed Case and Total Death prediction
We are using the same confirmed case and total death case from our dataset to predict the
outcome of COVID-19 through a different method to get the linear regression line and

Figure 18: The libraries we included for our prediction and to show the value of the dataset.

Figure 19: It shows both the prediction graph with X (Confirmed Case) axis and Y (Total
Deaths) axis together with the linear regression line

Figure 20: Then we import the necessary libraries to get the value of interception of linear
regression line which is 60.69577885 and the value of our regression coefficient output
which is 0.04628304.

The output generated with this method is the same as the previous method, hence to compare
is better to use the previous method as it generate a OLS Regression Results table which
provide more detailed information for the prediction.

Figure 18
Figure 19

Figure 20

10.0 New Cases with New Deaths & New Deaths with New Recovered
We used both new cases with new deaths and new deaths with new recovered to predict for the
COVID-19. As a result, we can differentiated whether which of these 2 related cases have
biggest influenced by the covid-19.
Figure 21: The libraries we included for our prediction and to show the value of the dataset.
Figure 22: It shows the prediction graph with X (New Cases) axis and Y (New Deaths) axis
together with the linear regression line.
Figure 23: It shows the prediction graph with X (New Deaths) axis and Y (New Recovered) axis
together with the linear regression line.
Figure 24: Then we import the necessary libraries to get the value of interception of linear
regression line which is 1.14704734 and the value of our regression coefficient output which is
0.03171409 for the first graph prediction (new cases with new deaths). The regression coefficient
is statistically significant because its value is lesser than the usual significance level. While for
new deaths with new recovered the value of interception of linear regression line which is
95.21543898 and the value of our regression coefficient output which is 0.57106593.The
regression coefficient is not statistically significant because its value is greater than the usual
significance level.

Figure 21
Figure 22
Figure 23
Figure 24

You might also like