Professional Documents
Culture Documents
Cav Mini Project
Cav Mini Project
ON
COVID-19 ANALYSIS AND VISUALIZATION
By
Sakshi Chauhan (2001320100117)
Yash Dixit (2001320100168)
Utkarsh Poswal (2001320100155)
Shalini (2001320100127)
I hereby declare that the Project entitled “COVID-19 ANALYSIS AND VISUALIZATION”
submitted to the Department of Computer Science and Engineering, G.N.I.O.T., Greater Noida,
U.P. in partial fulfilment for the award of the degree of BACHELOR OF TECHNOLOGY in
session 2020-2024 is an authentic record of my own work carried out under the guidance of MR.
“RAVIN KUMAR” and that the project has not previously formed the basis for the award of any
other degree/ diploma.
Signature of student:
This is to certify that the above statement made by the above Candidate/Student is
correct to the best of my knowledge
Signature of Guide:
I am very thankful to the director sir of GNIOT College for his extreme efforts that He puts on us
for this project.
I am also thankful to the HOD of computer science Department “Dr.SANDEEP SAXENA” sir
for his great efforts that He puts on us for this project
I am very grateful to my Project Guide Mr. “RAVIN KUMAR” for giving his/ her valuable
time and constructive guidance in preparing the Project.
It would not have been possible to complete this work in short period of time without her kind
encouragement and valuable guidance.
S NO. TITLE
1. Certificate
2. Acknowledgement
3. Introduction
4. Requirements
5. Abstract
6. Data Science Introduction
7. Python Programming
8. Libraries in Python
9. Various learning Algorithms
10. Data Collection
11. Methodology
12. Data Analysis
13. Power BI
14. Result
15. Conclusion
16. Bibliography
INTRODUCTION
The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing global
pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome
coronavirus 2 (SARS-CoV-2). Most people who fall sick with COVID-19 will experience mild
to moderate symptoms and recover without special treatment. However, some will become
seriously ill and require medical attention.
The best way to prevent and slow down transmission is to be well informed about the disease
and how the virus spreads. So, we Analyzed and visualized the data of Covid-19. Because in the
future, if some other disease comes, then we can find out which region people get infected the
most and which region may be less and where we need to start treatment as soon as
possible...and we don't have to neglect the less number of data .We will visualize that data
too .So we took all of our data from WHO site and different sites also and analyzed it and then
visualized. Now, To visualize and analyze the data, we have created an API through power BI.
So now you will ask what is power BI?
Power BI is a collection of software services, apps, and connectors that work together to turn
your unrelated sources of data into coherent, visually immersive, and interactive insights. Your
data may be an Excel spreadsheet, or a collection of cloud-based and on-premises hybrid data
warehouses. Through Power BI we covered these points that is
•Sum Of Recovered
•Sum of Confirmed
•Sum of Active
•Sum of Death
First of all, we import our data with the help of libraries (a library is a collection of precompiled
codes that can be used later on in a program for some specific well-defined operations) and we
use NUMPY, PANDAS, MATPLOTLIB, SEABORN, SKLEARN libraries in our analysis.
After that we removed all the null values which could affect our output /analysis in future by
power BI .These data contains a lot of null values in the whole data and there will be no need of
null values.so we removed it and clean the data by removing duplicate, irrelevant observation
and unwanted observations then fix structural errors and filter unwanted outliers, handle missing
data by this we clean our data. After data cleaning we use EDA .SO What’s EDA? Exploratory
Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used to
discover trends, patterns, or to check assumptions with the help of statistical summary and
graphical representations. Through this we our covered so many objectives and find out the
region wise records to analyze and visualize.
Through this dashboard, we aim to provide a frequently updated data visualization, data
dissemination and data exploration resource, while linking users to other useful and informative
resources.
REQUIREMENTS
In this project, we are going to work with the COVID19 data that we found from different sites,
which consists of the data related to the cumulative number of confirmed cases, active cases,
recovered cases and death cases per day, in each Country/Region.
Importing COVID19 data and preparing it for the analysis by dropping columns and
aggregating rows.
Deciding on and calculating a good measure for our analysis and Preprocessing the data.
Removing superfluous columns.
Grouping if needed.
Rename the columns.
Finding correlations among our data.
Visualizing our analysis results using Seaborn.
It will help us in future, if some other disease comes, then we can find out which region people
get infected the most and which region may be less and where we need to start treatment as soon
as possible...and we don't have to neglect the less number of data. We will visualize that data
too .Through this we have covered a lot of fields.
Our main objective to visualize the confirmed, active, recovered and death cases for the future
work.
DATA SCIENCE INTRODUCTION
Data Science
Data science is the field of data analytics and data visualization in which raw
data or the unstructured data is cleaned and made ready for the analysis
purpose. Data scientists use this data to get the required information for the
future purpose. “Data science uses many processes and methods on the big
data, the data may be structured or unstructured”. Data frames available on
the internet is the raw data we get. It may be either in unstructured or semi
structured format. This data is further filtered, cleaned and then number of
required task are performed for the analysis with the use of the high
programming language. This data is further analysed and then presented for
our better understanding and evaluation.
One must be clear that data science is not about making complicated models
or making awesome visualization neither it is about writing code but about
using the data to create an impact for your company, for this impact we need
tools like complicated data models and data visualization.
Stages of Data Science
There are many tools used to handle the big data available to us.
“Data scientists use programming tools such as Python, R, SAS, Java,
Perl, and C/C++ to extract knowledge from prepared data”.
Data scientists use many algorithms and mathematical models on
the data.
Following are the stages and their cycle performed on the
unstructured data.
• Statistical analysis
• Implementation, development
Data science finds its application in many fields. With the assistance of data
science. It is easy to get the search query on search engines in plenty of time.
A role of the data scientist is to have a deep understanding of the data as well
as a good command on the programming language, he should also know how
to work with the raw data extracted from the data source. Many programming
languages are used to analyze and evaluate the data such as Python, Java,
MATLAB, Scala, Julia, R., SQL and TensorFlow. Among which python is the
most user friendly and vastly used programming language in the field of data
science.
This life cycle is applied in each and every field, in this project we will be
considering all this seven stages of data science to analyze the data. The
process will be starting from data collection, data preparation, data modeling
and finally data evaluation. For instance, As we have huge amount of data we
can create an energy model for a particular country by collecting its previous
energy data, we can also predict the future requirement of it with the same
data.
PYTHON PROGRAMMING
Python is mostly used and easy among all other programming languages is
due to the following reasons.
Data structures in Python
Data structures are the way of storing the data so that we can easily perform
different operations on the data whenever its required. When the data has
been collected from the data source the data is available in different forms. So
later it is easy for the data scientists to perform different operation on the
data once it is sorted in to different data structures.
Data structures are mainly classified in to two categories and then further
their subcategories shown below.
Condition statements
If else statements
“The most common type of statement is the if statement. if statement consist
of a block which is called as clause”, it is the block after if statement, it
executed the statement if the condition is true. The statement is omitted if
the condition is False. then the statement in the else part is printed
• If keyword itself
• Colon
Below is the figure shows how If and else statements are used with
description inside it.
Elif statements
In this statement only one statement is executed, there are many cases in
which there is only one possibility to execute. “The elif statement is an else if
statement that always follows an if or another elif statement”. The elif
statement provides another condition that is checked only if any of the
previous conditions were False. In code, an elif statement always consists of
the following: The only difference between if else and elif statement is that in
elif statement we have the condition where as in else statement we do not
have any condition. elIf statement consist of following –
Range statement - This statement ’range()’ is used with for loop statements
where you can specify one value. For example, if you specify 10, the loop
statement starts from 1 and ends with 9, which is n-1. Also, you can specify
the start and end values. The following examples demonstrate loop
statements.
While loop
While loops are used for repeating the section of code but not same as for
loop, the while loop does not run n times, but until a defined condition is no
longer met. If the condition is initially false, the loop body will not be executed
at all.
Module, Package and Functions
• Module
Modules are Python files which has extension as .py. The name of the
module will be the name of the file. A Python module can have a set of
functions, classes or variables defined and implemented.
Module has some python codes, this codes can define the classes,
functions and variables.The reason behind using the module is that it
organizes your python code by grouping the python code so that it is
easier to use.
• Package
A package consist of the collection of modules in which python codes are
written with name init.py. It means that each python code inside of the
python path, which contains a file named init.py, will be treated as a
package by Python. Packages are used for organizing the module by
using dotted names.
for example -
We have a package named simple package which consist of two modules
a and b. We will import the module from package in following way. from
• Functions
• User defined functions - This functions are user to defined functions and
it starts with the key word ’def’ as shown in the example below. We have
defined the function names temperature and its task to be performed
when called. Below is the example of it.
LIBRARIES IN PYTHON
Python library is vast. There are built in functions in the library which are
written in C language. This library provide access to system functionality such
as file input output and that is not accessible to Python programmers. This
modules and library provide solution to the many problems in programming.
Pandas
”NumPy is a library for the Python programming language, adding support for
large, multidimensional arrays and matrices, along with a large collection of
high-level mathematical functions to operate on these arrays”. The previous
similar programming of NumPy is Numeric, and this language was originally
created by Jim Hugunin with contributions from several other developers. In
2005, Travis Oliphant created NumPy by incorporating features of the
competing Numarray into Numeric, with extensive modifications. [12] It is an
open source library and free of cost.
Pandas + HTML
Required libraries: pandas, jinja2
Creating an HTML report with pandas works similar to what’ve just done with
Excel: If you want a tiny bit more than just dumping a DataFrame as a raw HTML
table, then you’re best off by combining Pandas with a templating engine
like Jinja:
<html>
<head>
<style>
* {
font-family: sans-serif;
}
body {
padding: 20px;
}
table {
border-collapse: collapse;
text-align: right;
}
table tr {
border-bottom: 1px solid
}
table th, table td {
padding: 10px 20px;
}
</style>
</head>
<body>
<h1>My Report</h1>
{{ my_table }}
</body>
</html>
Then, in the same directory, let’s run the following Python script that will create
our HTML report:
import pandas as pd
import numpy as np
import jinja2
# Sample DataFrame
df = pd.DataFrame(np.random.randn(5, 4), columns=['one', 'two', 'three', 'four'],
index=['a', 'b', 'c', 'd', 'e'])
# See: https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Building-styles
def color_negative_red(val):
color = 'red' if val < 0 else 'black'
return f'color: {color}'
styler = df.style.applymap(color_negative_red)
# Template handling
env = jinja2.Environment(loader=jinja2.FileSystemLoader(searchpath=''))
template = env.get_template('template.html')
html = template.render(my_table=styler.render())
# Plot
ax = df.plot.bar()
fig = ax.get_figure()
fig.savefig('plot.svg')
The result is a nice looking HTML report that could also be printed as a PDF by
using something like WeasyPrint:
Note that for such an easy example, you wouldn’t necessarily need to use a Jinja
template. But when things start to become more complex, it’ll definitely come in very
handy.
xlwings
xlwings allows you to program and automate Excel with Python instead of VBA.
The difference to XlsxWriter or OpenPyXL (used above in the Pandas section) is
the following: XlsxWriter and OpenPyXL write Excel files directly on disk. They
work wherever Python works and don’t require an installation of Microsoft Excel.
xlwings, on the other hand, can write, read and edit Excel files via the Excel
application, i.e. a local installation of Microsoft Excel is required. xlwings also
allows you to create macros and user-defined functions in Python rather than in
VBA, but for reporting purposes, we won’t really need that.
While XlsxWriter/OpenPyXL are the best choice if you need to produce reports in
a scalable way on your Linux web server, xlwings does have the advantage that it
can edit pre-formatted Excel files without losing or destroying anything.
OpenPyXL on the other hand (the only writer library with xlsx editing capabilities)
will drop some formatting and sometimes leads to Excel raising errors during
further manual editing.
import xlwings as xw
import pandas as pd
import numpy as np
# Sample DataFrame
df = pd.DataFrame(np.random.randn(5, 4), columns=['one', 'two', 'three', 'four'],
index=['a', 'b', 'c', 'd', 'e'])
This allows us to create a good looking report in your corporate design very fast.
The best part is that the Python developer doesn’t necessarily have to do the
formatting but can leave it to the business user who owns the report.
Note that you could instruct xlwings to run the report in a separate and hidden
instance of Excel so it doesn’t interfere with your other work.
xlwings PRO
The Pandas + Excel as well as the xlwings (open source) sample both have a few
issues:
If, for example, you insert a few rows below the title, you will have to adjust
the cell references accordingly in the Python code. Using named ranges
could help but they have other limitations (like the one mentioned at the
end of this list).
The number of rows in the table might be dynamic. This leads to two issues:
(a) data rows might not be formatted consistently and (b) content below
the table might get overwritten if the table is too long.
Placing the same value in a lot of different cells (e.g. a date in the source
note of every table or chart) will cause duplicated code or unnecessary
loops.
To fix these issues, xlwings PRO comes with a dedicated reports package:
Separation of code and design: Users without coding skills can change the
template on their own without having to touch the Python code.
Template variables: Python variables (between double curly braces) can be
directly used in cells , e.g. {{ title }}. They act as placeholders that will be
replaced by the values of the variables.
Frames for dynamic tables: Frames are vertical containers that dynamically
align and style tables that have a variable number of rows. To see how
Frames work, have a look at the documentation.
You can get a free trial for xlwings PRO here. When using the xlwings PRO reports
package, your code simplifies to the following:
import pandas as pd
import numpy as np
from xlwings.pro.reports import create_report # part of xlwings PRO
# Sample DataFrame
df = pd.DataFrame(np.random.randn(5, 4), columns=['one', 'two', 'three', 'four'],
index=['a', 'b', 'c', 'd', 'e'])
Plotly Dash
Required libraries: pandas, dash
To create a report though, we’re using their latest product Plotly Dash, an open-
source framework that allows the creation of interactive web dashboards with
Python only (no need to write JavaScript code). Plotly Dash is also available
as Enterprise plan.
How it works is best explained by looking at some code, adopted with minimal
changes from the official getting started guide:
import pandas as pd
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
# Sample DataFrame
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/
gapminderDataFiveYear.csv')
# This code runs every time the slider below the chart is changed
@app.callback(Output('graph-with-slider', 'figure'), [Input('year-slider', 'value')])
def update_figure(selected_year):
filtered_df = df[df.year == selected_year]
traces = []
for i in filtered_df.continent.unique():
df_by_continent = filtered_df[filtered_df['continent'] == i]
traces.append(dict(
x=df_by_continent['gdpPercap'],
y=df_by_continent['lifeExp'],
text=df_by_continent['country'],
mode='markers',
opacity=0.7,
marker={'size': 15, 'line': {'width': 0.5, 'color': 'white'}},
name=i
))
return {
'data': traces,
'layout': dict(
xaxis={'type': 'log', 'title': 'GDP Per Capita', 'range': [2.3, 4.8]},
yaxis={'title': 'Life Expectancy', 'range': [20, 90]},
margin={'l': 40, 'b': 40, 't': 10, 'r': 10},
legend={'x': 0, 'y': 1},
hovermode='closest',
transition={'duration': 500},
)
}
if __name__ == '__main__':
app.run_server(debug=True)
The charts look great by default and it’s very easy to make your dashboard
interactive by writing simple callback functions in Python: You can choose the
year by clicking on the slider below the chart. In the background, every change to
our year-slider will trigger the update_figure callback function and hence update
the chart.
By arranging your documents properly, you could create an interactive web
dashboard that can also act as the source for your PDF factsheet, see for example
their financial factsheet demo together with it’s source code.
Datapane
Required libraries: datapane
Using Datapane, you can either generate one-off reports, or deploy your Jupyter
Notebook or Python script so others can generate reports dynamically by entering
parameters through an automatically generated web app.
import datapane as dp
import pandas as pd
import altair as alt
df = pd.read_csv('https://query1.finance.yahoo.com/v7/finance/download/GOOG?
period2=1585222905&interval=1mo&events=history')
chart = alt.Chart(df).encode(
x='Date:T',
y='Open'
).mark_line().interactive()
r = dp.Report(dp.Table(df), dp.Plot(chart))
r.save(path='report.html')
This code renders a standalone HTML document with an interactive, searchable
table and plot component.
If you want to publish your report, you can login to Datapane (via $ datapane login) and use
the publish method, which will give you a URL such as this which you can share or embed.
r.publish(name='my_report')
# stocks.py
import datapane as dp
import altair as alt
import yfinance as yf
dp.Params.load_defaults('./stocks.yaml')
tickers = dp.Params.get('tickers')
plot_type = dp.Params.get('plot_type')
period = dp.Params.get('period')
data = yf.download(tickers=' '.join(tickers), period=period, groupby='ticker').Close
ReportLab
Required libraries: pandas, reportlab
ReportLab OpenSource
In its most basic functionality, ReportLab uses a canvas where you can place
objects using a coordinate system:
import pandas as pd
import numpy as np
from reportlab.pdfgen.canvas import Canvas
from reportlab.lib import colors
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch
from reportlab.platypus import Paragraph, Frame, Table, Spacer, TableStyle
# Sample DataFrame
df = pd.DataFrame(np.random.randn(5, 4), columns=['one', 'two', 'three', 'four'],
index=['a', 'b', 'c', 'd', 'e'])
# Style Table
df = df.reset_index()
df = df.rename(columns={"index": ""})
data = [df.columns.to_list()] + df.values.tolist()
table = Table(data)
table.setStyle(TableStyle([
('INNERGRID', (0, 0), (-1, -1), 0.25, colors.black),
('BOX', (0, 0), (-1, -1), 0.25, colors.black)
]))
# Use a Frame to dynamically align the compents and write the PDF file
c = Canvas('report.pdf')
f = Frame(inch, inch, 6 * inch, 9 * inch)
f.addFromList(story, c)
c.save()
a templating language
the ability to include vector graphics
The templating language is called RML (Report Markup Language), an XML dialect.
Here is a sample of how it looks like, taken directly from the official
documentation:
<document filename="example.pdf">
<template>
<pageTemplate id="main">
<frame id="first" x1="72" y1="72" width="451" height="698" />
</pageTemplate>
</template>
<stylesheet>
</stylesheet>
<story>
<para>
This is the "story". This is the part of the RML document where
your text is placed.
</para>
<para>
It should be enclosed in "para" and "/para" tags to turn it into
paragraphs.
</para>
</story>
</document>
The idea here is that you can have any program produce such an RML document,
not just Python, which can then be transformed into a PDF document by
ReportLab PLUS.
VARIOUS LEARNING ALGORITHMS
Simply put, Machine Learning (ML) is the process of employing algorithms to help
computer systems progressively improve their performance for some specific
task. Software-based ML can be traced back to the 1950’s, but the number and
ubiquity of ML algorithms has exploded since the early 2000’s, mainly due to the
rising popularity of the Python programming language, which continues to drive
advances in ML.
The reigning ML algorithm champ is arguably Python’s scikit-learn package, which
offers simple and easy syntax paired with a treasure trove of multiple algorithms.
Decision Tree Algorithm
The Decision Tree algorithm is widely applicable to most scenarios, and can be
surprisingly effective for such a simple algorithm. It requires minimal data
preparation, and can even work with blank values. This algorithm focuses on
learning simple decision rules inferred from the data. It then compiles them into a
set of “if-then-else” decision rules:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)
dot_file = 'visualizations/decision_tree.dot'
If you think one decision tree is great, imagine what a forest of them could do!
That’s essentially what a Random Forest Classifier does. The classification starts
off by using multiple trees with slightly different training data. The predictions
from all of the trees are then averaged out, resulting in better performance than
any single tree in the model.
Random Forests can be used to solve classification or regression problems. Let’s
have a look at how it solves the following classification problem:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)
The k-nearest neighbor (KNN) algorithm is a simple and efficient algorithm that
can be used to solve both classification and regression problems. If you know the
saying, “birds of a feather flock together” you have the essence of KNN in a
nutshell. It assumes that similar “things” exist in close proximity to each other.
Although you need to perform a certain amount of data cleansing before applying
the algorithm, the benefits outweigh the burdens. Let’s have a look:
model = KNeighborsClassifier(n_neighbors=7)
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)
As you can see, you can tune the hyperparameters to achieve the highest
accuracy. When the above code is executed with just 1 neighbor, the
accuracy rate falls to 70%.
[Running] python -u "/top-10-machine-learning-algorithms-sklearn/knn.py"
K-Nearest Neighbor Accuracy Score: 74.0 %
Now let’s visualize it. This part is a little tricky since we will need to reduce the
model’s dimensions to be able to visualize the result on a scatter plot. You
may want to read more about Principal Component Analysis (PCA), but for
the purposes of this article, all you need to know is that PCA is used to
reduce dimensionality while preserving the meaning of the data.
pca = PCA(n_components=2).fit(X)
pca_2d = pca.transform(X)
if y[i] == 1:
elif y[i] == 0:
c2 = pl.scatter(pca_2d[i,0], pca_2d[i,1], c='r', marker='+')
pl.title('Titanic Survivors')
plt.savefig('visualizations/knn.png')
model = BaggingClassifier(
base_estimator=SVC(),
n_estimators=10,
random_state=0
)
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)
In order to split up the data for multiple learners, we use a Linear Support Vector
Classifier (SVC) to fit and divide the data as equally as possible. This means that no
one set of data will lean on a column too much or have too much variability
between the data.
Boosting is very similar to bagging in the sense that it averages out the results of
multiple weak learners. However, in the case of boosting, these learners are
executed in a sequential manner such that the latest model depends on the
previous one. This leads to lower bias, meaning that it can handle a larger
variance of data.
from xgboost.sklearn import XGBClassifier
from base import Base
model = XGBClassifier()
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)
It’s time to remember your high school course in probability. The Naive Bayes
algorithm determines the probability of each feature set and uses that to
determine the probability of the classification itself.
Here’s a fantastic example from Naive Bayes for Dummies:
“A fruit may be considered to be an apple if it is red, round, and about 3″ in
diameter. A Naive Bayes classifier considers each of these “features” (red, round,
3” in diameter) to contribute independently to the probability that the fruit is an
apple, regardless of any correlations between features.”
Now let’s look at the code:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)
print("\n\nNaive Bayes Accuracy Score:", Base.accuracy_score(ytest, ypred),
"%")
[Running] python -u
"/top-10-machine-learning-algorithms-sklearn/naive_bayes.py"
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.savefig("visualizations/naive_bayes_confusion_matrix.png")
Now, we can see that the possibility of false positives is higher than false
negatives.
Support Vector Machines (SVMs) are robust, non-probabilistic models that can be
used to predict both classification and regression problems. SVMs maximize space
to widen the gap between categories and increase accuracy.
Let’s have a look:
from sklearn import svm
model = svm.LinearSVC(random_state=800)
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)
As you can see, SVMs are generally more accurate than other methods. If even
better data cleansing methods are applied, we can aim to reach higher accuracy.
Stochastic Gradient Descent (SGD) is popular in the neural network world, where
it’s used to optimize the cost function. However, we can also use it to classify
data. SGD is great for scenarios in which you have a large dataset with a very large
feature set. It can help to reduce the complexities involved in learning from highly
variable data.
from sklearn.linear_model import SGDClassifier
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)
[Running] python -u
"/top-10-machine-learning-algorithms-sklearn/stochastic_gradient_descent_class
ifier.py"
Logistic Regression
This is a very basic model that still delivers decent results. It’s a statistical model
that uses logistic (sigmoid) functions to accurately predict data.
Before you use this model, you need to ensure that your training data is clean and
has less noise. Significant variance could lower the accuracy of the model.
from sklearn.linear_model import LogisticRegression
from base import Base
model = LogisticRegression()
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)
Now, let’s plot the Receiver Operating Characteristics (ROC) curve for this model.
This curve helps us visualize accuracy by plotting the true positive rate on the Y
axis and the false positive rate on the X axis. The “larger” the area under the
curve, the more accurate the model.
Voting Classifier
Voting Classifier is another ensemble method where instead of using the same
type of “weak” learners, we choose very different models. The idea is to combine
conceptually different ML algorithms and use a majority vote to predict the class
labels. This is useful for a set of equally well-performing models since it can
balance out individual weaknesses.
For this ensemble, I will combine a Logistic Regression model, a Naive Bayes
model, and a Random Forest model:
from sklearn.ensemble import VotingClassifier
df = Base.clean()
X = df.drop(['Survived'], axis=1)
y = df['Survived']
model_1 = LogisticRegression()
model_1.fit(Xtrain, ytrain)
ypred = model_1.predict(Xtest)
model_2 = GaussianNB()
model_2.fit(Xtrain, ytrain)
ypred = model_2.predict(Xtest)
model_3 = RandomForestClassifier()
model_3.fit(Xtrain, ytrain)
ypred = model_3.predict(Xtest)
eclf = VotingClassifier(
voting='hard'
)
Before analyzing and visualization we need the raw data and this raw data can
gathered from different open source data websites available on the internet.
This data will be in raw form, so it may contain some null values, or dataset
which is not useful for our analysis.We need covid-19 dataset of different
countries which contain number of deaths, recovery, Vaccination etc.
Some of the websites that we used to get Covid-19 dataset are as follows:
https://www.kaggle.com
Kaggle allows users to find and publish data sets, explore and build models
in a web-based data-science environment, work with other data scientists
and machine learning engineers, and enter competitions to solve data
science challenges. Kaggle provides dataset for Covid-19 which we can
download from there website in .csv format.
https://www.who.int/
https://ourworldindata.org/
After Data collection, we have to work on that data in order to analyse and
visualise the data in more user friendly way. For analyses we can’t use this raw
data directly because this raw data contains null values which is to be
processed and removed from the dataset in order to get the accurate results.
To do so, we need to apply various methods on this dataset but first of all we
need to import some libraries required for analysis and visualization.
We will be analyzing the data with the help of some questions. Below is
the figure of the data sheet in excel that will give you the hint that how the
data is available to us.
Libraries Required
Libraries Used
After the libraries are imported now we have to remove the null values that are
there in the fig.
Null Values
To remove the null values, we will use a inbuilt function called isnull(). The
isnull() method returns a DataFrame object where all the values are replaced
with a Boolean value True for NULL values, and otherwise False. Since sum()
calculate as True=1 and False=0 , you can count the number of missing values
in each row and column by calling sum() from the result of isnull().
Seaborn and Matplotlib are the two libraries used to plot data in various form
like we can plot the graph out of that data or we can create heatmap such as
to show the null values present in your dataset which help to visualise data
rather than getting the digits .
Heatmap of missing values
Graph for
Now when particular data can be fetched from datasets and that can be used to
for analysis or to target any problem or query. As we can see, data is not that
visually good. Our analysed information or graphs are not well managed and
properly structured and it should be our priority that we make or project with
structured and good interface. In data analytics, visualisation is very important.
After the data is analysed properly it’s time to visualise the code in more user
friendly interface which will make our project more effective and interactive. In
order to do so, we will use Power Bi software tool for visualisation.
POWER BI
Microsoft Power BI is a business intelligence platform that provides nontechnical
business users with tools for aggregating, analysing, visualizing and sharing data.
Power BI's user interface is fairly intuitive for users familiar with Excel and its
deep integration with other Microsoft products makes it a very versatile self-
service tool that requires little upfront training.
Power Bi is a powerful tool used to give insight of the data in a more organised
and visually good way. When the backend part is done its time to make frontend.
RESULT
Covid-19 Data Analysis and Visualisation project provides you an insight of how
badly Covid-19 effected different parts of the countries. It covers various of
sections like it will show the data of number of confirmed cases, number of
people recovered from Covid-19, number of deaths occurred till date and also it
will show the data in graphical form and in the form if pie charts. And you can
choose and select country from the listed countries and you get the data of that
country as shown in figures below.
Sum of Deaths, Recovered, Confirmed and Active cases all over the world
The Covid-19 pandemic is a huge struggle for all of us. The project we are
making will seek to find the answers to the most pertinent questions as to
what is it that makes the covid 19 such a tragedy and what all people are the
ones who are most affected by it. It will seek to find the appropriate response
which can be mounted by the authorities concerned and we can reach to a
place of proper discussion about the problem and solve it in the best possible
manner out there. It will also lead to a solution to any medical condition we
might encounter later on in our lives where we can apply data sciences for
medical diagnostics. This project saves on the already limited resources that
India have and prevents the spreads as people can use it to get an idea that
they should go and get tested. It also helps unhealthy and infected people to
isolate themselves. Using this system we can effectively and efficiently
mitigate the burden on our healthcare system which is completely stressed
out.
BIBLIOGRAPHY
1. https://en.wikipedia.org/wiki/Choropleth_map
2. https://github.com/CSSEGISandData/COVID-19
3. https://www.who.int/
4. https://ourworldindata.org/
5. https://data.humdata.org/