Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 74

MINI PROJECT

ON
COVID-19 ANALYSIS AND VISUALIZATION
By
Sakshi Chauhan (2001320100117)
Yash Dixit (2001320100168)
Utkarsh Poswal (2001320100155)
Shalini (2001320100127)

Submitted to the Department Of Computer Science


In Partial Fulfilment Of The Requirements
For The Degree Of
Bachelor Of Technology
Submitted to Mr. Ravin Kumar

Department of Computer Science and Engineering

GREATER NOIDA INSTITUTE OF TECHNOLOGY, GREATER NOIDA, UP


(201310)
CERTIFICATE

I hereby declare that the Project entitled “COVID-19 ANALYSIS AND VISUALIZATION”
submitted to the Department of Computer Science and Engineering, G.N.I.O.T., Greater Noida,
U.P. in partial fulfilment for the award of the degree of BACHELOR OF TECHNOLOGY in
session 2020-2024 is an authentic record of my own work carried out under the guidance of MR.
“RAVIN KUMAR” and that the project has not previously formed the basis for the award of any
other degree/ diploma.

Signature of student:

Place: Greater Noida, UP Sakshi Chauhan


Date:30-10-2022 (2001320100117)
Yash Dixit
(2001320100168)
Utkarsh Poswal
(2001320100155)
Shalini
(2001320100127)

This is to certify that the above statement made by the above Candidate/Student is
correct to the best of my knowledge

Signature of Guide:

MR. RAVIN KUMAR

(ASSISTANT PROFESSOR, GNIOT)


ACKNOWLEDGEMENT

I am very thankful to the director sir of GNIOT College for his extreme efforts that He puts on us
for this project.

I am also thankful to the HOD of computer science Department “Dr.SANDEEP SAXENA” sir
for his great efforts that He puts on us for this project

I am very grateful to my Project Guide Mr. “RAVIN KUMAR” for giving his/ her valuable
time and constructive guidance in preparing the Project.

It would not have been possible to complete this work in short period of time without her kind
encouragement and valuable guidance.

Date: 30/10/2022 Signature of Student:


Sakshi Chauhan
(2001320100117)
Yash Dixit
(2001320100168)
Utkarsh Poswal
(2001320100155)
Shalini
(2001320100127)
TABLE OF CONTENTS

S NO. TITLE
1. Certificate
2. Acknowledgement
3. Introduction
4. Requirements

5. Abstract
6. Data Science Introduction
7. Python Programming
8. Libraries in Python
9. Various learning Algorithms
10. Data Collection
11. Methodology
12. Data Analysis
13. Power BI
14. Result
15. Conclusion
16. Bibliography

INTRODUCTION

The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing global
pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome
coronavirus 2 (SARS-CoV-2). Most people who fall sick with COVID-19 will experience mild
to moderate symptoms and recover without special treatment. However, some will become
seriously ill and require medical attention.
The best way to prevent and slow down transmission is to be well informed about the disease
and how the virus spreads. So, we Analyzed and visualized the data of Covid-19. Because in the
future, if some other disease comes, then we can find out which region people get infected the
most and which region may be less and where we need to start treatment as soon as
possible...and we don't have to neglect the less number of data .We will visualize that data
too .So we took all of our data from WHO site and different sites also and analyzed it and then
visualized. Now, To visualize and analyze the data, we have created an API through power BI.
So now you will ask what is power BI?
Power BI is a collection of software services, apps, and connectors that work together to turn
your unrelated sources of data into coherent, visually immersive, and interactive insights. Your
data may be an Excel spreadsheet, or a collection of cloud-based and on-premises hybrid data
warehouses. Through Power BI we covered these points that is
•Sum Of Recovered
•Sum of Confirmed
•Sum of Active
•Sum of Death

First of all, we import our data with the help of libraries (a library is a collection of precompiled
codes that can be used later on in a program for some specific well-defined operations) and we
use NUMPY, PANDAS, MATPLOTLIB, SEABORN, SKLEARN libraries in our analysis.
After that we removed all the null values which could affect our output /analysis in future by
power BI .These data contains a lot of null values in the whole data and there will be no need of
null values.so we removed it and clean the data by removing duplicate, irrelevant observation
and unwanted observations then fix structural errors and filter unwanted outliers, handle missing
data by this we clean our data. After data cleaning we use EDA .SO What’s EDA? Exploratory
Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used to
discover trends, patterns, or to check assumptions with the help of statistical summary and
graphical representations. Through this we our covered so many objectives and find out the
region wise records to analyze and visualize.
Through this dashboard, we aim to provide a frequently updated data visualization, data
dissemination and data exploration resource, while linking users to other useful and informative
resources.

REQUIREMENTS

1. Install Power BI (Business Intelligence) Tool.


2. Install Jupyter Notebook.
3. Libraries:
 Numpy
 Pandas
 Matplotlib
 Seaborn
 Sklearn
4. Python Programming.
5. Sakshi Chauhan
6. (2001320100117)
7. Yash Dixit
8. (2001320100168)
9. Utkarsh Poswal
10. (2001320100155)
11. Shalini
12. (2001320100127)
ABSTRACT

In this project, we are going to work with the COVID19 data that we found from different sites,
which consists of the data related to the cumulative number of confirmed cases, active cases,
recovered cases and death cases per day, in each Country/Region.

 Importing COVID19 data and preparing it for the analysis by dropping columns and
aggregating rows.
 Deciding on and calculating a good measure for our analysis and Preprocessing the data.
 Removing superfluous columns.
 Grouping if needed.
 Rename the columns.
 Finding correlations among our data.
 Visualizing our analysis results using Seaborn.
It will help us in future, if some other disease comes, then we can find out which region people
get infected the most and which region may be less and where we need to start treatment as soon
as possible...and we don't have to neglect the less number of data. We will visualize that data
too .Through this we have covered a lot of fields.
Our main objective to visualize the confirmed, active, recovered and death cases for the future
work.
DATA SCIENCE INTRODUCTION

Data Science

Data science is the field of data analytics and data visualization in which raw
data or the unstructured data is cleaned and made ready for the analysis
purpose. Data scientists use this data to get the required information for the
future purpose. “Data science uses many processes and methods on the big
data, the data may be structured or unstructured”. Data frames available on
the internet is the raw data we get. It may be either in unstructured or semi
structured format. This data is further filtered, cleaned and then number of
required task are performed for the analysis with the use of the high
programming language. This data is further analysed and then presented for
our better understanding and evaluation.

One must be clear that data science is not about making complicated models
or making awesome visualization neither it is about writing code but about
using the data to create an impact for your company, for this impact we need
tools like complicated data models and data visualization.
Stages of Data Science

There are many tools used to handle the big data available to us.
“Data scientists use programming tools such as Python, R, SAS, Java,
Perl, and C/C++ to extract knowledge from prepared data”.
Data scientists use many algorithms and mathematical models on
the data.
Following are the stages and their cycle performed on the
unstructured data.

• Identifying the problem.

• Identify available data sources

• Identify available data sources

• Identify if additional data sources are needed.

• Statistical analysis

• Implementation, development

• Communicate results • Maintenance


7 steps that together constitute this life-cycle model of Data science

Data science finds its application in many fields. With the assistance of data
science. It is easy to get the search query on search engines in plenty of time.
A role of the data scientist is to have a deep understanding of the data as well
as a good command on the programming language, he should also know how
to work with the raw data extracted from the data source. Many programming
languages are used to analyze and evaluate the data such as Python, Java,
MATLAB, Scala, Julia, R., SQL and TensorFlow. Among which python is the
most user friendly and vastly used programming language in the field of data
science.
This life cycle is applied in each and every field, in this project we will be
considering all this seven stages of data science to analyze the data. The
process will be starting from data collection, data preparation, data modeling
and finally data evaluation. For instance, As we have huge amount of data we
can create an energy model for a particular country by collecting its previous
energy data, we can also predict the future requirement of it with the same
data.
PYTHON PROGRAMMING

Why only Python?

“Python is an interpreted, o bject-oriented, high-level programming language


with dynamic semantics”.[6] This language consist of mainly data structures
which make it very easy for the data scientists to analyse the data very
effectively. It does not only help in forecasting and analysis it also helps in
connecting the two different languages.Two best features of this
programming language is that it does not have any compilation step as
compared to the other programming language in which compilation is done
before the program is being executed and other one is the reuse of the code,
it consist of modules and packages due to which we can use the previously
written code any where in between the program whenever is required.
There are multiple languages for example R., Java, SQL, Julia, Scala, MATLAB
available in market which can be used to analyze and evaluate the data, but
due to some outstanding features python is the most famous language used in
the field of data science.

Python is mostly used and easy among all other programming languages is
due to the following reasons.
Data structures in Python

Data structures are the way of storing the data so that we can easily perform
different operations on the data whenever its required. When the data has
been collected from the data source the data is available in different forms. So
later it is easy for the data scientists to perform different operation on the
data once it is sorted in to different data structures.
Data structures are mainly classified in to two categories and then further
their subcategories shown below.

Following are the data structures.


 Array - Array is the collection of data types of same type. Arrays data
structures are used mostly in the NumPy library of python. In the below
example we have first imported the package array from numpy library
and defined the array as variable arr then divided the array by 7 and we
have printed our array to get output.
 List – “A list is a value that contains multiple values in an ordered
sequence”. Values in the list referred to list itself, that is the value can be
stored in a variable or passed to a function. List are changeable and values
in the list are enclosed inside a square bracket, we can perform multiple
operations such as indexing, slicing, adding and multiplying.

 Tuple - A tuple is a list of non-changeable objects. The differences


between tuples and lists are that the tuples cannot be changed, tuples
use parentheses, whereas list uses square brackets.

 Dictionary- These are nothing but a type of data structure which


consist of key value pairs enclosed in the curly brackets. It is same as
the any dictionary we use in day to day life in which we find the
meaning of the particular words. So if I compare normal dictionary to
this python dictionary data structure then a word in a dictionary will be
our key and its meaning will be the value of the dictionary. In the figure
name, occupation and hobby are the keys and Suraj, data analyst and
vlogging are the values assigned to the keys.

 Sets - Set are used for calculating mathematical operation such as


union, intersection and symmetric difference.
Below is the data structure tree which explains the category and sub-
category of each data types.
A data structure tree at glance

Condition statements
If else statements
“The most common type of statement is the if statement. if statement consist
of a block which is called as clause”, it is the block after if statement, it
executed the statement if the condition is true. The statement is omitted if
the condition is False. then the statement in the else part is printed

If statement consist of following -

• If keyword itself

• Condition which may be True or False

• Colon

• If clause or a block of code

Below is the figure shows how If and else statements are used with
description inside it.
Elif statements
In this statement only one statement is executed, there are many cases in
which there is only one possibility to execute. “The elif statement is an else if
statement that always follows an if or another elif statement”. The elif
statement provides another condition that is checked only if any of the
previous conditions were False. In code, an elif statement always consists of
the following: The only difference between if else and elif statement is that in
elif statement we have the condition where as in else statement we do not
have any condition. elIf statement consist of following –

 elIf keyword itself


 Condition which may be True or False
 Colon
 elIf clause or a block of code
Loops in python
For loop
When do we use for loops ?
for loops are traditionally used when you have a block of code which you want
to repeat a fixed number of times. The Python for statement iterates over the
members of a sequence in order, executing the block each time.

Range statement - This statement ’range()’ is used with for loop statements
where you can specify one value. For example, if you specify 10, the loop
statement starts from 1 and ends with 9, which is n-1. Also, you can specify
the start and end values. The following examples demonstrate loop
statements.
While loop
While loops are used for repeating the section of code but not same as for
loop, the while loop does not run n times, but until a defined condition is no
longer met. If the condition is initially false, the loop body will not be executed
at all.
Module, Package and Functions
• Module
Modules are Python files which has extension as .py. The name of the
module will be the name of the file. A Python module can have a set of
functions, classes or variables defined and implemented.

Module has some python codes, this codes can define the classes,
functions and variables.The reason behind using the module is that it
organizes your python code by grouping the python code so that it is
easier to use.

• Package
A package consist of the collection of modules in which python codes are
written with name init.py. It means that each python code inside of the
python path, which contains a file named init.py, will be treated as a
package by Python. Packages are used for organizing the module by
using dotted names.

for example -
We have a package named simple package which consist of two modules

a and b. We will import the module from package in following way. from

simple package import a, b

• Functions

A function is a python code which can be reused at any anytime in the


whole python code. Function performs specific task whenever it is called
during the program.With the help of function the program is divided in
to multiple codes.
• Built in functions - The functions which are already in the python
programming and have specific action to perform are called as built in
functions. This function are immutable. Some examples of this functions
are chr() - used to get string print() - used to print an object in terminal
min() - used to get minimum value in terminal.

• User defined functions - This functions are user to defined functions and
it starts with the key word ’def’ as shown in the example below. We have
defined the function names temperature and its task to be performed
when called. Below is the example of it.
LIBRARIES IN PYTHON

Python library is vast. There are built in functions in the library which are
written in C language. This library provide access to system functionality such
as file input output and that is not accessible to Python programmers. This
modules and library provide solution to the many problems in programming.

Following are some Python libraries.


 Matplotlib
 Pandas
 TensorFlow
 Numpy
 Keras
 PyTorch
 Datepane
 Ploty dash
 XlwingsPro
 Jinjas
 LightGBM
 SciPy
Matplotlib
“Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy”. Matlab provides an application
that is used in graphical user interface tool kits. Another such libraby is pylab
which is almost same as MATLAB.

It is a library for 2D graphics, it finds its application in web application servers,


graphical user interface toolkit and shell.Below is the example of a basic plot
in python.

Pandas

Pandas is also a library or a data analysis tool in python which is written in


python programming language. It is mostly used for data analysis and data
manipulation. It is also used for data structures and time series.
We can see the application of python in many fields such as - Economics,
Recommendation Systems - Spotify, Netflix and Amazon, Stock Prediction,
Neuro science, Statistics, Advertising, Analytics, Natural Language Processing.
Data can be analyzed in pandas in two ways -
Data frames - In this data is two dimensional and consist of multiple series.
Data is always represented in rectangular table.
Series - In this data is one dimensional and consist of single list with index.
NumPy

”NumPy is a library for the Python programming language, adding support for
large, multidimensional arrays and matrices, along with a large collection of
high-level mathematical functions to operate on these arrays”. The previous
similar programming of NumPy is Numeric, and this language was originally
created by Jim Hugunin with contributions from several other developers. In
2005, Travis Oliphant created NumPy by incorporating features of the
competing Numarray into Numeric, with extensive modifications. [12] It is an
open source library and free of cost.
Pandas + HTML

Required libraries: pandas, jinja2

Creating an HTML report with pandas works similar to what’ve just done with
Excel: If you want a tiny bit more than just dumping a DataFrame as a raw HTML
table, then you’re best off by combining Pandas with a templating engine
like Jinja:

First, let’s create a file called template.html:

<html>
<head>
<style>
* {
font-family: sans-serif;
}
body {
padding: 20px;
}
table {
border-collapse: collapse;
text-align: right;
}
table tr {
border-bottom: 1px solid
}
table th, table td {
padding: 10px 20px;
}
</style>
</head>
<body>

<h1>My Report</h1>

{{ my_table }}

<img src='plot.svg' width="600">

</body>
</html>

Then, in the same directory, let’s run the following Python script that will create
our HTML report:

import pandas as pd
import numpy as np
import jinja2
# Sample DataFrame
df = pd.DataFrame(np.random.randn(5, 4), columns=['one', 'two', 'three', 'four'],
index=['a', 'b', 'c', 'd', 'e'])

# See: https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Building-styles
def color_negative_red(val):
color = 'red' if val < 0 else 'black'
return f'color: {color}'

styler = df.style.applymap(color_negative_red)

# Template handling
env = jinja2.Environment(loader=jinja2.FileSystemLoader(searchpath=''))
template = env.get_template('template.html')
html = template.render(my_table=styler.render())

# Plot
ax = df.plot.bar()
fig = ax.get_figure()
fig.savefig('plot.svg')

# Write the HTML file


with open('report.html', 'w') as f:
f.write(html)

The result is a nice looking HTML report that could also be printed as a PDF by
using something like WeasyPrint:
Note that for such an easy example, you wouldn’t necessarily need to use a Jinja
template. But when things start to become more complex, it’ll definitely come in very
handy.

xlwings
xlwings allows you to program and automate Excel with Python instead of VBA.
The difference to XlsxWriter or OpenPyXL (used above in the Pandas section) is
the following: XlsxWriter and OpenPyXL write Excel files directly on disk. They
work wherever Python works and don’t require an installation of Microsoft Excel.
xlwings, on the other hand, can write, read and edit Excel files via the Excel
application, i.e. a local installation of Microsoft Excel is required. xlwings also
allows you to create macros and user-defined functions in Python rather than in
VBA, but for reporting purposes, we won’t really need that.

While XlsxWriter/OpenPyXL are the best choice if you need to produce reports in
a scalable way on your Linux web server, xlwings does have the advantage that it
can edit pre-formatted Excel files without losing or destroying anything.
OpenPyXL on the other hand (the only writer library with xlsx editing capabilities)
will drop some formatting and sometimes leads to Excel raising errors during
further manual editing.

xlwings (Open Source)


Replicating the sample we had under Pandas is easy enough with the open-source
version of xlwings:

import xlwings as xw
import pandas as pd
import numpy as np

# Open a template file


wb = xw.Book('mytemplate.xlsx')

# Sample DataFrame
df = pd.DataFrame(np.random.randn(5, 4), columns=['one', 'two', 'three', 'four'],
index=['a', 'b', 'c', 'd', 'e'])

# Assign data to cells


wb.sheets[0]['A1'].value = 'My Report'
wb.sheets[0]['A3'].value = df

# Save under a new file name


wb.save('myreport.xlsx')

Running this will produce the following report:


So where does all the formatting come from? The formatting is done directly in
the Excel template before running the script. This means that instead of having to
program tens of lines of code to format a single cell with the proper font, colors
and borders, I can just make a few clicks in Excel. xlwings then merely opens the
template file and inserts the values.

This allows us to create a good looking report in your corporate design very fast.
The best part is that the Python developer doesn’t necessarily have to do the
formatting but can leave it to the business user who owns the report.

Note that you could instruct xlwings to run the report in a separate and hidden
instance of Excel so it doesn’t interfere with your other work.
xlwings PRO
The Pandas + Excel as well as the xlwings (open source) sample both have a few
issues:

 If, for example, you insert a few rows below the title, you will have to adjust
the cell references accordingly in the Python code. Using named ranges
could help but they have other limitations (like the one mentioned at the
end of this list).
 The number of rows in the table might be dynamic. This leads to two issues:
(a) data rows might not be formatted consistently and (b) content below
the table might get overwritten if the table is too long.
 Placing the same value in a lot of different cells (e.g. a date in the source
note of every table or chart) will cause duplicated code or unnecessary
loops.

To fix these issues, xlwings PRO comes with a dedicated reports package:

 Separation of code and design: Users without coding skills can change the
template on their own without having to touch the Python code.
 Template variables: Python variables (between double curly braces) can be
directly used in cells , e.g. {{ title }}. They act as placeholders that will be
replaced by the values of the variables.
 Frames for dynamic tables: Frames are vertical containers that dynamically
align and style tables that have a variable number of rows. To see how
Frames work, have a look at the documentation.

You can get a free trial for xlwings PRO here. When using the xlwings PRO reports
package, your code simplifies to the following:

import pandas as pd
import numpy as np
from xlwings.pro.reports import create_report # part of xlwings PRO

# Sample DataFrame
df = pd.DataFrame(np.random.randn(5, 4), columns=['one', 'two', 'three', 'four'],
index=['a', 'b', 'c', 'd', 'e'])

# Create the report by passing in all variables as kwargs


wb = create_report('mytemplate.xlsx',
'myreport.xlsx',
title='My Report',
df=df)

All that’s left is to create a template with the placeholders for title and df:


Running the script will produce the same report that we generated with the open
source version of xlwings above. The beauty of this approach is that there are no
hard coded cell references anymore in your Python code. This means that the
person who is responsible for the layout can move the placeholders around and
change the fonts and colors without having to bug the Python developer
anymore.

Plotly Dash
Required libraries: pandas, dash

Plotly is best known for their beautiful and open-source JavaScript charting


library which builds the core of Chart Studio, a platform for collaboratively
designing charts (no coding required).

To create a report though, we’re using their latest product Plotly Dash, an open-
source framework that allows the creation of interactive web dashboards with
Python only (no need to write JavaScript code). Plotly Dash is also available
as Enterprise plan.

How it works is best explained by looking at some code, adopted with minimal
changes from the official getting started guide:

import pandas as pd
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output

# Sample DataFrame
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/
gapminderDataFiveYear.csv')

# Dash app - The CSS code is pulled in from an external file


app = dash.Dash(__name__,
external_stylesheets=['https://codepen.io/chriddyp/pen/bWLwgP.css'])

# This defines the HTML layout


app.layout = html.Div([
html.H1('My Report'),
dcc.Graph(id='graph-with-slider'),
dcc.Slider(
id='year-slider',
min=df['year'].min(),
max=df['year'].max(),
value=df['year'].min(),
marks={str(year): str(year) for year in df['year'].unique()},
step=None
)
])

# This code runs every time the slider below the chart is changed
@app.callback(Output('graph-with-slider', 'figure'), [Input('year-slider', 'value')])
def update_figure(selected_year):
filtered_df = df[df.year == selected_year]
traces = []
for i in filtered_df.continent.unique():
df_by_continent = filtered_df[filtered_df['continent'] == i]
traces.append(dict(
x=df_by_continent['gdpPercap'],
y=df_by_continent['lifeExp'],
text=df_by_continent['country'],
mode='markers',
opacity=0.7,
marker={'size': 15, 'line': {'width': 0.5, 'color': 'white'}},
name=i
))

return {
'data': traces,
'layout': dict(
xaxis={'type': 'log', 'title': 'GDP Per Capita', 'range': [2.3, 4.8]},
yaxis={'title': 'Life Expectancy', 'range': [20, 90]},
margin={'l': 40, 'b': 40, 't': 10, 'r': 10},
legend={'x': 0, 'y': 1},
hovermode='closest',
transition={'duration': 500},
)
}

if __name__ == '__main__':
app.run_server(debug=True)

Running this script and navigating to http://localhost:8050 in your browser will


give you this dashboard:

The charts look great by default and it’s very easy to make your dashboard
interactive by writing simple callback functions in Python: You can choose the
year by clicking on the slider below the chart. In the background, every change to
our year-slider will trigger the update_figure callback function and hence update
the chart.
By arranging your documents properly, you could create an interactive web
dashboard that can also act as the source for your PDF factsheet, see for example
their financial factsheet demo together with it’s source code.

Alternatives to Plotly Dash


If you are looking for an alternative to Plotly Dash, make sure to check out Panel.
Panel was originally developed with the support of Anaconda Inc., and is now
maintained by Anaconda developers and community contributors. Unlike Plotly
Dash, Panel is very inclusive and supports a wide range of plotting libraries
including: Bokeh, Altair, Matplotlib and others (including also Plotly).

Datapane
Required libraries: datapane

Datapane is a framework for reporting which allows you to generate interactive


reports from pandas DataFrames, Python visualisations (such as Bokeh and
Altair), and Markdown. Unlike solutions such as Dash, Datapane allows you to
generate standalone reports which don’t require a running Python server—but it
doesn’t require any HTML coding either.

Using Datapane, you can either generate one-off reports, or deploy your Jupyter
Notebook or Python script so others can generate reports dynamically by entering
parameters through an automatically generated web app.

Datapane (open-source library)


Datapane’s open-source library allows you to create reports from components,
such as a Table component, a Plot component, etc. These components are
compatible with Python objects such as pandas DataFrames, and many
visualisation libraries, such as Altair:

import datapane as dp
import pandas as pd
import altair as alt

df = pd.read_csv('https://query1.finance.yahoo.com/v7/finance/download/GOOG?
period2=1585222905&interval=1mo&events=history')

chart = alt.Chart(df).encode(
x='Date:T',
y='Open'
).mark_line().interactive()

r = dp.Report(dp.Table(df), dp.Plot(chart))
r.save(path='report.html')
This code renders a standalone HTML document with an interactive, searchable
table and plot component.

If you want to publish your report, you can login to Datapane (via $ datapane login) and use
the publish method, which will give you a URL such as this which you can share or embed.

r.publish(name='my_report')

Hosted Reporting Apps


Datapane can also be used to deploy Jupyter Notebooks and Python scripts so
that other people who are not familiar with Python can generate custom reports.
By adding a YAML file to your folder, you can specify input parameters as well as
dependencies (through pip, Docker, or local folders). Datapane also has support
for managing secret variables, such as database passwords, and for storing and
persisting files. Here is a sample script (stocks.py) and YAML file (stocks.yaml):

# stocks.py
import datapane as dp
import altair as alt
import yfinance as yf

dp.Params.load_defaults('./stocks.yaml')

tickers = dp.Params.get('tickers')
plot_type = dp.Params.get('plot_type')
period = dp.Params.get('period')
data = yf.download(tickers=' '.join(tickers), period=period, groupby='ticker').Close

df = data.reset_index().melt('Date', var_name='symbol', value_name='price')


base_chart = alt.Chart(df).encode(x='Date:T', y='price:Q', color='symbol').interactive()
chart = base_chart.mark_line() if plot_type == 'line' else base_chart.mark_bar()
dp.Report(dp.Plot(chart), dp.Table(df)).publish(name=f'stock_report', headline=f'Report
on {" ".join(tickers)}')
# stocks.yaml
name: stock_analysis
script: stocks.py
# Script parameters
parameters:
- name: tickers
description: A list of tickers to plot
type: list
default: ['GOOG', 'MSFT', 'IBM']
- name: period
description: Time period to plot
type: enum
choices: ['1d','5d','1mo','3mo','6mo','1y','2y','5y','10y','ytd','max']
default: '1mo'
- name: plot_type
type: enum
default: line
choices: ['bar', 'line']

## Python packages required for the script


requirements:
- yfinance
Publishing this into a reporting app is easy as running $ datapane script deploy.
For a full example see this example GitHub repository or read the docs.

ReportLab
Required libraries: pandas, reportlab

ReportLab writes PDF files directly. Most prominently, Wikipedia uses ReportLab


to generate their PDF exports. One of the key strength of ReportLab is that it
builds PDF reports “at incredible speeds”, to cite their homepage. Let’s have a
look at some sample code for both the open-source and the commercial version!

ReportLab OpenSource
In its most basic functionality, ReportLab uses a canvas where you can place
objects using a coordinate system:

from reportlab.pdfgen import canvas


c = canvas.Canvas("hello.pdf")
c.drawString(50, 800, "Hello World")
c.showPage()
c.save()
ReportLab also offers an advanced mode called PLATYPUS (Page Layout and
Typography Using Scripts), which is able to define dynamic layouts based on
templates at the document and page level. Within pages, Frames would then
arrange Flowables (e.g. text and pictures) dynamically according to their height.
Here is a very basic example of how you put PLATYPUS at work:

import pandas as pd
import numpy as np
from reportlab.pdfgen.canvas import Canvas
from reportlab.lib import colors
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch
from reportlab.platypus import Paragraph, Frame, Table, Spacer, TableStyle

# Sample DataFrame
df = pd.DataFrame(np.random.randn(5, 4), columns=['one', 'two', 'three', 'four'],
index=['a', 'b', 'c', 'd', 'e'])

# Style Table
df = df.reset_index()
df = df.rename(columns={"index": ""})
data = [df.columns.to_list()] + df.values.tolist()
table = Table(data)
table.setStyle(TableStyle([
('INNERGRID', (0, 0), (-1, -1), 0.25, colors.black),
('BOX', (0, 0), (-1, -1), 0.25, colors.black)
]))

# Components that will be passed into a Frame


story = [Paragraph("My Report", getSampleStyleSheet()['Heading1']),
Spacer(1, 20),
table]

# Use a Frame to dynamically align the compents and write the PDF file
c = Canvas('report.pdf')
f = Frame(inch, inch, 6 * inch, 9 * inch)
f.addFromList(story, c)
c.save()

Running this script will produce the following PDF:


ReportLab PLUS
In comparison to the open-source version of ReportLab, the most prominent
features of Reportlab PLUS are

 a templating language
 the ability to include vector graphics

The templating language is called RML (Report Markup Language), an XML dialect.
Here is a sample of how it looks like, taken directly from the official
documentation:

<document filename="example.pdf">
<template>
<pageTemplate id="main">
<frame id="first" x1="72" y1="72" width="451" height="698" />
</pageTemplate>
</template>
<stylesheet>
</stylesheet>
<story>
<para>
This is the "story". This is the part of the RML document where
your text is placed.
</para>
<para>
It should be enclosed in "para" and "/para" tags to turn it into
paragraphs.
</para>
</story>
</document>
The idea here is that you can have any program produce such an RML document,
not just Python, which can then be transformed into a PDF document by
ReportLab PLUS.
VARIOUS LEARNING ALGORITHMS

Simply put, Machine Learning (ML) is the process of employing algorithms to help
computer systems progressively improve their performance for some specific
task. Software-based ML can be traced back to the 1950’s, but the number and
ubiquity of ML algorithms has exploded since the early 2000’s, mainly due to the
rising popularity of the Python programming language, which continues to drive
advances in ML. 
The reigning ML algorithm champ is arguably Python’s scikit-learn package, which
offers simple and easy syntax paired with a treasure trove of multiple algorithms.
Decision Tree Algorithm

The Decision Tree algorithm is widely applicable to most scenarios, and can be
surprisingly effective for such a simple algorithm. It requires minimal data
preparation, and can even work with blank values. This algorithm focuses on
learning simple decision rules inferred from the data. It then compiles them into a
set of “if-then-else” decision rules:
from sklearn.tree import DecisionTreeClassifier

from base import Base

Xtrain, Xtest, ytrain, ytest = Base.clean_and_split()

model = DecisionTreeClassifier()

model.fit(Xtrain, ytrain)

ypred = model.predict(Xtest)

print("\n\nDecision Tree Accuracy Score:", Base.accuracy_score(ytest, ypred),


"%")
And here’s the result:
[Running] python -u
"/top-10-machine-learning-algorithms-sklearn/decision_tree.py"
Decision Tree Accuracy Score: 75.0 %

[Done] exited with code=0 in 1.248 seconds


Now, let’s chart a visualization of the tree itself. It’s quite easy to do since sklearn
provides export_graphviz as part of its tree module:
from sklearn.tree import export_graphviz

dot_file = 'visualizations/decision_tree.dot'

export_graphviz(model, out_file=dot_file, feature_names=Xtrain.columns.values)


Now that you have the file, just convert it into a PNG image:
dot -Tpng decision_tree.dot -o decision_tree.png

Random Forest Classifier Algorithm

If you think one decision tree is great, imagine what a forest of them could do!
That’s essentially what a Random Forest Classifier does. The classification starts
off by using multiple trees with slightly different training data. The predictions
from all of the trees are then averaged out, resulting in better performance than
any single tree in the model.
Random Forests can be used to solve classification or regression problems. Let’s
have a look at how it solves the following classification problem:
from sklearn.ensemble import RandomForestClassifier

from base import Base


Xtrain, Xtest, ytrain, ytest = Base.clean_and_split()

model = RandomForestClassifier()

model.fit(Xtrain, ytrain)

ypred = model.predict(Xtest)

print("\n\nRandom Forest Classifier Accuracy Score:",


Base.accuracy_score(ytest, ypred), "%")
And now for the result:
[Running] python -u
"/top-10-machine-learning-algorithms-sklearn/random_forest.py"

Random Forest Classifier Accuracy Score: 81.0 %

[Done] exited with code=0 in 1.106 seconds

K-Nearest Neighbor Algorithm

The k-nearest neighbor (KNN) algorithm is a simple and efficient algorithm that
can be used to solve both classification and regression problems. If you know the
saying, “birds of a feather flock together” you have the essence of KNN in a
nutshell. It assumes that similar “things” exist in close proximity to each other.
Although you need to perform a certain amount of data cleansing before applying
the algorithm, the benefits outweigh the burdens. Let’s have a look:

from sklearn.neighbors import KNeighborsClassifier

from base import Base

Xtrain, Xtest, ytrain, ytest = Base.clean_and_split()

model = KNeighborsClassifier(n_neighbors=7)

model.fit(Xtrain, ytrain)

ypred = model.predict(Xtest)

print("\n\nK-Nearest Neighbor Accuracy Score:", Base.accuracy_score(ytest,


ypred), "%")

As you can see, you can tune the hyperparameters to achieve the highest
accuracy. When the above code is executed with just 1 neighbor, the
accuracy rate falls to 70%.
[Running] python -u "/top-10-machine-learning-algorithms-sklearn/knn.py"
K-Nearest Neighbor Accuracy Score: 74.0 %

[Done] exited with code=0 in 0.775 seconds

Now let’s visualize it. This part is a little tricky since we will need to reduce the
model’s dimensions to be able to visualize the result on a scatter plot. You
may want to read more about Principal Component Analysis (PCA), but for
the purposes of this article, all you need to know is that PCA is used to
reduce dimensionality while preserving the meaning of the data.

# Transforming n-features into 2

pca = PCA(n_components=2).fit(X)

pca_2d = pca.transform(X)

for i in range(0, pca_2d.shape[0]):

   if y[i] == 1:

       c1 = pl.scatter(pca_2d[i,0], pca_2d[i,1], c='g', marker='o')

   elif y[i] == 0:
       c2 = pl.scatter(pca_2d[i,0], pca_2d[i,1], c='r', marker='+')

pl.legend([c1, c2], ['Survived', 'Deceased'])

pl.title('Titanic Survivors')

plt.savefig('visualizations/knn.png')

Bagging Classifier Algorithm


Before we look into Bagging Classifiers, we must understand ensemble learning.
In the previous example of Random Forests and Decision Trees, we learned that
the former is an averaging of the results of the latter. 
This is essentially what bagging is: it’s a paradigm in which multiple “weak”
learners are trained in parallel to solve the same problem, and then combined to
get better results.
from sklearn.svm import SVC

from sklearn.ensemble import BaggingClassifier

from base import Base

Xtrain, Xtest, ytrain, ytest = Base.clean_and_split()

model = BaggingClassifier(

 base_estimator=SVC(),

 n_estimators=10,

 random_state=0

)
model.fit(Xtrain, ytrain)

ypred = model.predict(Xtest)

print("\n\nBagging Accuracy Score:", Base.accuracy_score(ytest, ypred), "%")

In order to split up the data for multiple learners, we use a Linear Support Vector
Classifier (SVC) to fit and divide the data as equally as possible. This means that no
one set of data will lean on a column too much or have too much variability
between the data.

Let’s see what happens:

[Running] python -u "/top-10-machine-learning-algorithms-sklearn/bagging.py"

Bagging Accuracy Score: 72.0 %

[Done] exited with code=0 in 1.302 seconds

Boosting Classifier Algorithm

Boosting is very similar to bagging in the sense that it averages out the results of
multiple weak learners. However, in the case of boosting, these learners are
executed in a sequential manner such that the latest model depends on the
previous one. This leads to lower bias, meaning that it can handle a larger
variance of data.
from xgboost.sklearn import XGBClassifier
from base import Base

Xtrain, Xtest, ytrain, ytest = Base.clean_and_split()

model = XGBClassifier()

model.fit(Xtrain, ytrain)

ypred = model.predict(Xtest)

print("\n\nXG Boost Classifier Accuracy Score:", Base.accuracy_score(ytest,


ypred), "%")

And here is the result:

[Running] python -u "/top-10-machine-learning-algorithms-sklearn/xgboost.py"

Bagging Accuracy Score: 77.0 %

[Done] exited with code=0 in 1.302 seconds


Naive Bayes Algorithm

It’s time to remember your high school course in probability. The Naive Bayes
algorithm determines the probability of each feature set and uses that to
determine the probability of the classification itself.
Here’s a fantastic example from Naive Bayes for Dummies:
“A fruit may be considered to be an apple if it is red, round, and about 3″ in
diameter. A Naive Bayes classifier considers each of these “features” (red, round,
3” in diameter) to contribute independently to the probability that the fruit is an
apple, regardless of any correlations between features.”
Now let’s look at the code:
from sklearn.naive_bayes import GaussianNB

from base import Base

Xtrain, Xtest, ytrain, ytest = Base.clean_and_split()

model = GaussianNB()

model.fit(Xtrain, ytrain)

ypred = model.predict(Xtest)
print("\n\nNaive Bayes Accuracy Score:", Base.accuracy_score(ytest, ypred),
"%")

And the result is pretty good:

[Running] python -u
"/top-10-machine-learning-algorithms-sklearn/naive_bayes.py"

Naive Bayes Accuracy Score: 78.0 %

[Done] exited with code=0 in 1.182 seconds


What the accuracy score doesn’t show is the probability of false positives and
false negatives. We can only capture “how right we are.” However, it is more
important to know the extent of “how wrong we are” in many scenarios
(including life!).
We can plot this using a confusion matrix. This matrix shows the distribution of
true positives, true negatives, false positives, and false negatives :
mat = confusion_matrix(ytest, ypred)

sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)

plt.xlabel('true label')
plt.ylabel('predicted label');

plt.savefig("visualizations/naive_bayes_confusion_matrix.png")

Now, we can see that the possibility of false positives is higher than false
negatives.

Support Vector Machines

Support Vector Machines (SVMs) are robust, non-probabilistic models that can be
used to predict both classification and regression problems. SVMs maximize space
to widen the gap between categories and increase accuracy.
Let’s have a look:
from sklearn import svm

from base import Base


Xtrain, Xtest, ytrain, ytest = Base.clean_and_split()

model = svm.LinearSVC(random_state=800)

model.fit(Xtrain, ytrain)

ypred = model.predict(Xtest)

print("\n\nSVM Accuracy Score:", Base.accuracy_score(ytest, ypred), "%")

As you can see, SVMs are generally more accurate than other methods. If even
better data cleansing methods are applied, we can aim to reach higher accuracy.

[Running] python -u "/top-10-machine-learning-algorithms-sklearn/svm.py"

SVM Accuracy Score: 79.0 %

[Done] exited with code=0 in 1.379 seconds

Now, instead of visualizing model data, let’s look at model performance. We


can look at the classification report using scikit learn’s metrics module:

             precision    recall  f1-score   support


          0       0.85      0.82      0.83       144

          1       0.69      0.73      0.71        79

   accuracy                           0.79       223

  macro avg       0.77      0.78      0.77       223

weighted avg       0.79      0.79      0.79       223

Stochastic Gradient Descent Classification

Stochastic Gradient Descent (SGD) is popular in the neural network world, where
it’s used to optimize the cost function. However, we can also use it to classify
data. SGD is great for scenarios in which you have a large dataset with a very large
feature set. It can help to reduce the complexities involved in learning from highly
variable data.
from sklearn.linear_model import SGDClassifier

from base import Base

Xtrain, Xtest, ytrain, ytest = Base.clean_and_split()


model = SGDClassifier()

model.fit(Xtrain, ytrain)

ypred = model.predict(Xtest)

print("\n\nStochastic Gradient Descent Classifier Accuracy Score:",


Base.accuracy_score(ytest, ypred), "%")

Let’s look at how accurate it is:

[Running] python -u
"/top-10-machine-learning-algorithms-sklearn/stochastic_gradient_descent_class
ifier.py"

Stochastic Gradient Descent Classifier Accuracy Score: 76.0 %

[Done] exited with code=0 in 0.779 seconds

Logistic Regression

This is a very basic model that still delivers decent results. It’s a statistical model
that uses logistic (sigmoid) functions to accurately predict data.
Before you use this model, you need to ensure that your training data is clean and
has less noise. Significant variance could lower the accuracy of the model.
from sklearn.linear_model import LogisticRegression
from base import Base

Xtrain, Xtest, ytrain, ytest = Base.clean_and_split()

model = LogisticRegression()

model.fit(Xtrain, ytrain)

ypred = model.predict(Xtest)

print("\n\nLogistic Regression Accuracy Score:", Base.accuracy_score(ytest,


ypred), "%")

=> Logistic Regression Accuracy Score: 79.0 %

Now, let’s plot the Receiver Operating Characteristics (ROC) curve for this model.
This curve helps us visualize accuracy by plotting the true positive rate on the Y
axis and the false positive rate on the X axis. The “larger” the area under the
curve, the more accurate the model.
Voting Classifier

Voting Classifier is another ensemble method where instead of using the same
type of “weak” learners, we choose very different models. The idea is to combine
conceptually different ML algorithms and use a majority vote to predict the class
labels. This is useful for a set of equally well-performing models since it can
balance out individual weaknesses.
For this ensemble, I will combine a Logistic Regression model, a Naive Bayes
model, and a Random Forest model:
from sklearn.ensemble import VotingClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import GaussianNB


from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import train_test_split

from base import Base

df = Base.clean()

X = df.drop(['Survived'], axis=1)

y = df['Survived']

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

model_1 = LogisticRegression()

model_1.fit(Xtrain, ytrain)
ypred = model_1.predict(Xtest)

model_2 = GaussianNB()

model_2.fit(Xtrain, ytrain)

ypred = model_2.predict(Xtest)

model_3 = RandomForestClassifier()

model_3.fit(Xtrain, ytrain)

ypred = model_3.predict(Xtest)

eclf = VotingClassifier(

 estimators=[('lr', model_1), ('rf', model_2), ('gnb', model_3)],

 voting='hard'
)

for clf, label in zip([model_1, model_2, model_3, eclf], ['Logistic


Regression', 'Naive Bayes', 'Random Forest', 'Ensemble']):

 scores = cross_val_score(clf, X, y, scoring='accuracy', cv=5)

 print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(),


label))
Let’s see how it performs:
[Running] python -u "/top-10-machine-learning-algorithms-sklearn/voting.py"

Accuracy: 0.79 (+/- 0.02) [Logistic Regression]

Accuracy: 0.78 (+/- 0.03) [Naive Bayes]

Accuracy: 0.80 (+/- 0.03) [Random Forest]

Accuracy: 0.80 (+/- 0.02) [Ensemble]


[Done] exited with code=0 in 2.413 seconds
As you can see, the Voting Ensemble learned from all three models and
outperformed them all.
DATA COLLECTION

Before analyzing and visualization we need the raw data and this raw data can
gathered from different open source data websites available on the internet.
This data will be in raw form, so it may contain some null values, or dataset
which is not useful for our analysis.We need covid-19 dataset of different
countries which contain number of deaths, recovery, Vaccination etc.

Some of the websites that we used to get Covid-19 dataset are as follows:

 https://www.kaggle.com

Kaggle allows users to find and publish data sets, explore and build models
in a web-based data-science environment, work with other data scientists
and machine learning engineers, and enter competitions to solve data
science challenges. Kaggle provides dataset for Covid-19 which we can
download from there website in .csv format.

 https://www.who.int/

The WHO coronavirus (COVID-19) dashboard presents official daily counts


of COVID-19 cases, deaths and vaccine utilisation reported by countries,
territories and areas. Through this dashboard, we aim to provide a
frequently updated data visualization, data dissemination and data
exploration resource, while linking users to other useful and informative
resources.

 https://ourworldindata.org/

Our World in Data (OWID) is a scientific online publication that focuses on


global problems such as poverty, disease, hunger, climate change, war,
existential risks, and inequality.
OWID provides Covid-19 dataset for data analysis for others. This data is on
GitHub, from here we can download the dataset in csv format into our
system.
 https://data.humdata.org/

Data humdata is an online platform where different organisations


upload there dataset or researched data for everyone to use. This
website contains large number of dataset for Covid-19 pandemic from
different platforms for free in .csv format.
METHODOLOGY

After Data collection, we have to work on that data in order to analyse and
visualise the data in more user friendly way. For analyses we can’t use this raw
data directly because this raw data contains null values which is to be
processed and removed from the dataset in order to get the accurate results.
To do so, we need to apply various methods on this dataset but first of all we
need to import some libraries required for analysis and visualization.

• DATA VISUALIZATION AND ANALYSIS

We will be analyzing the data with the help of some questions. Below is
the figure of the data sheet in excel that will give you the hint that how the
data is available to us.

Data Sets in Excel Sheet

 Libraries Required
Libraries Used

After the libraries are imported now we have to remove the null values that are
there in the fig.

Null Values

To remove the null values, we will use a inbuilt function called isnull(). The
isnull() method returns a DataFrame object where all the values are replaced
with a Boolean value True for NULL values, and otherwise False. Since sum()
calculate as True=1 and False=0 , you can count the number of missing values
in each row and column by calling sum() from the result of isnull().
Seaborn and Matplotlib are the two libraries used to plot data in various form
like we can plot the graph out of that data or we can create heatmap such as
to show the null values present in your dataset which help to visualise data
rather than getting the digits .
Heatmap of missing values

Plotting the graph

Graph for

Confirmed cases in different Regions


DATA ANALYSIS

Data Analysis is the process of systematically applying statistical and/or logical


techniques to describe and illustrate, condense and recap, and evaluate data.
According to Shamoo and Resnik (2003) various analytic procedures “provide a
way of drawing inductive inferences from data and distinguishing the signal (the
phenomenon of interest) from the noise (statistical fluctuations) present in the
data”.
Using Data Analysis, we can target any query and can give explanation to that
particular query. For ex, in our dataset if we want to get data where maximum
number of confirmed cases were recorded then it can be done as follows:

US with maximum number of confirmed cases

Now when particular data can be fetched from datasets and that can be used to
for analysis or to target any problem or query. As we can see, data is not that
visually good. Our analysed information or graphs are not well managed and
properly structured and it should be our priority that we make or project with
structured and good interface. In data analytics, visualisation is very important.
After the data is analysed properly it’s time to visualise the code in more user
friendly interface which will make our project more effective and interactive. In
order to do so, we will use Power Bi software tool for visualisation.

POWER BI
Microsoft Power BI is a business intelligence platform that provides nontechnical
business users with tools for aggregating, analysing, visualizing and sharing data.
Power BI's user interface is fairly intuitive for users familiar with Excel and its
deep integration with other Microsoft products makes it a very versatile self-
service tool that requires little upfront training.

Power Bi is a powerful tool used to give insight of the data in a more organised
and visually good way. When the backend part is done its time to make frontend.

RESULT
Covid-19 Data Analysis and Visualisation project provides you an insight of how
badly Covid-19 effected different parts of the countries. It covers various of
sections like it will show the data of number of confirmed cases, number of
people recovered from Covid-19, number of deaths occurred till date and also it
will show the data in graphical form and in the form if pie charts. And you can
choose and select country from the listed countries and you get the data of that
country as shown in figures below.

Sum of Deaths, Recovered, Confirmed and Active cases all over the world

User Interface of Covid-19 Data Analysis and visualization.


CONCLUSION

The Covid-19 pandemic is a huge struggle for all of us. The project we are
making will seek to find the answers to the most pertinent questions as to
what is it that makes the covid 19 such a tragedy and what all people are the
ones who are most affected by it. It will seek to find the appropriate response
which can be mounted by the authorities concerned and we can reach to a
place of proper discussion about the problem and solve it in the best possible
manner out there. It will also lead to a solution to any medical condition we
might encounter later on in our lives where we can apply data sciences for
medical diagnostics. This project saves on the already limited resources that
India have and prevents the spreads as people can use it to get an idea that
they should go and get tested. It also helps unhealthy and infected people to
isolate themselves. Using this system we can effectively and efficiently
mitigate the burden on our healthcare system which is completely stressed
out.
BIBLIOGRAPHY

1. https://en.wikipedia.org/wiki/Choropleth_map
2. https://github.com/CSSEGISandData/COVID-19
3. https://www.who.int/
4. https://ourworldindata.org/
5. https://data.humdata.org/

You might also like