Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

lOMoARcPSD|15772120

Sample report on Data Science course on Coursera

btech (Guru Gobind Singh Indraprastha University)

StuDocu is not sponsored or endorsed by any college or university


Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

TRAINING REPORT ON: DATA SCIENCE

A Report submitted in partial fulfilment of the requirement for the


award of degree of

Bachelor’s of Technology
In
ELECTRONICS AND COMMUNICATION ENGINEERING

Submitted by
LAKSHYA KHARAYAT
(01296302818)

Maharaja Surajmal Institute of Technology

C-4 Janak Puri, New Delhi-110058

Affiliated to GGSIPU | NAAC Accredited 'A' Grade | ISO 9001:2015 Certified

Approved by AICTE

Downloaded by Kapil? (kapil.kukreja07@gmail.com)


lOMoARcPSD|15772120

Certificate

Downloaded by Kapil? (kapil.kukreja07@gmail.com)


lOMoARcPSD|15772120

STUDENT DECLARTION

I, Lakshya Kharayat (Enrolment Number: 01296302818) of Maharaja Surajmal Institute of


Technology hereby declare that this report is based on the training undergone during the
summer break of 2020 under Coursera through an online medium, during my Bachelor in
Technology (Electronics and Communication) degree (2018-2022) is an original work. In case
the project report, or any part of it, is found to be copied or quoted without reference, I shall be
solely held accountable for the repercussions arising thereof.

Lakshya Kharayat
01296302818

3
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

CONTENTS

Certificate……………………………………………………………………………………………………………….………. 2
Declaration…………………………………………………………………………….………3
List of Figures …………………………………………………………………………………………….….………………. 5
Abstract ....…………………………………………………………………………………………………………….……….…… 6
Acknowledgement …………………………………………………………………………………………….……..….... 7
Company Profile ……………………………………………………………………………………………………….…… 8

Chapters
1. Data Science with Python……………………………………….……….11
1.1 Introduction…………………………………………………………..…...11
1.2 Pandas………………………………………………………………........12
1.3 Jupyter Notebook…………………………………………….…….…….13

2. Data Structures in Pandas………………………….…………...........……14


2.1 Series Data Structure…………………………………………………….14
2.2 Data Frame Loading and Indexing……………………………….....…..15
2.3 Querying and Indexing a DataFrame…………………………...………16

3. Advanced Python Idioms……………………...…………………..….…19


3.1 Merging DataFrames and Pandas Idioms………………………..………19
3.2 Group by and Scales……………………………………………..……...22
3.3 Date Functionality………………………………………………..……..24

4. Project: COVID-19 Data Analysis…………………..………………….…….…26


4.1 Problem Statement…………….……………………………………….26
4.2 About the DataSet……………………………..………………………..26
4.3 Processing the DataSet……….………………………………………….26
4.4 Flow Diagram............................................................................................27
4.5 Source Codes, Outputs And Analysis……………………..…………27

5. Bibliography……………………………………..……………..……44

4
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

LIST OF FIGURES

1.1 Pandas Trend

1.2 Jupyter Notebook Logo

2.1 Series Data Structure

2.2 Data Frame Indexing and Loading

2.3 Boolean Mask

3.1 Venn Diagram

3.2 Merging DataFrames

3.3 Date Functionality

4.1 Flow Diagram

4.2 Importing Modules

4.3 Loading Dataset

4.4 Generating Global Spread

5
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

ABSTRACT

Every company will say they are doing a form of data science, but what exactly does that

mean? The field is growing so rapidly, and revolutionizing so many industries, it's difficult

to fence in its capabilities with a formal definition, but generally data science is devoted to

the extraction of clean information from raw data for the formulation of actionable insights.

Commonly referred to as the “oil of the 21st century," our digital data carries the most

importance in the field. It has incalculable benefits in business, research, and our everyday

lives. Your route to work, your most recent Google search for the nearest coffee shop, your

Instagram post about what you ate, and even the health data from your fitness tracker are all

important to different data scientists in different ways. Sifting through massive lakes of data,

looking for connections and patterns, data science is responsible for bringing us new products,

delivering breakthrough insights, and making our lives more convenient.

6
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

ACKNOWLEDGEMENT

First and foremost, I am grateful to Mr. Pradeep Sangwan (HoD, Department of Electronics and
Communication Engineering, Maharaja Surajmal Institute of Technology) for allowing me to
undergo this fruitful training from Coursera for a duration of 4 weeks and consequently Coursera.

I am deeply indebted to the numerous instructors on Coursera that provided valuable online
material and lectures with detailed descriptions for me to understand from a very basic level,
and for their guidance and platform provisions that helped me complete the training with ease.
Their numerous resources and guidance helped me gain valuable knowledge on Data Science,
ML, and Python, making me a better fit for the industry than I was before.
Lastly, I would like to express my gratitude to my peers, who provided valuable insights and
collaborations on the project, helping me improve every step of the way.

SUBMITTED BY:
Lakshya Kharayat
01296302818

7
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

COMPANY PROFILE

Our Story
Coursera was founded by Daphne Koller and Andrew Ng with a vision of providing life-
transforming learning experiences to anyone, anywhere. It is now a leading online learning
platform for higher education, where 73 million learners from around the world come to learn
skills of the future. More than 200 of the world’s top universities and industry educators’
partner with Coursera to offer courses, Specializations, certificates, and degree programs.
Thousands of companies trust the company’s enterprise platform Coursera for Business to
transform their talent. Coursera for Government equips government employees and citizens
with in-demand skills to build a competitive workforce. Coursera for Campus empowers any
university to offer high-quality, job-relevant online education to students, alumni, faculty, and
staff. Coursera is backed by leading investors that include Kleiner Perkins, New Enterprise
Associates, Learn Capital, and SEEK Group.

Courses

Learn something new

Every course on Coursera is taught by top instructors from world-class universities and
companies, so you can learn something new anytime, anywhere. Hundreds of free
courses give you access to on-demand video lectures, homework exercises, and
community discussion forums. Paid courses provide additional quizzes and projects as
well as a shareable Course Certificate upon completion.

• 100% online

• Learn something new in 4-6 weeks

• Priced starting at $39 (USD)

• Earn a Course Certificate

Guided Projects

8
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Gain a job-relevant skill in under 2 hours

Enrol in Guided Projects to learn job-relevant skills and industry tools in under 2 hours.
Guided Projects are self-paced, require a smaller time commitment, and provide practice
using tools in real-world scenarios, so you can build the job skills you need, right when
you need them.

• 100% online with no setup required

• Interactive learning experience with step-by-step, visual instruction from


subject-matter experts

• Priced starting at $9.99 (USD)

• Earn a Guided Project certificate

Specializations

Master a skill

If you want to master a specific career skill, consider enrolling in a Specialization. You
will complete a series of rigorous courses at your own pace, tackle hands-on projects
based on real business challenges, and earn a Specialization Certificate to share with
your professional network and potential employers.

• 100% online

• Master a skill in 4-6 months

• Priced starting at $39 (USD) per month

• Earn a Specialization Certificate

Professional Certificates

9
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Get job-ready for an in-demand career

Whether you are looking to start a new career or change your current one, Professional
Certificates on Coursera help you become job ready. Learn at your own pace from top
companies and universities, apply your new skills to hands-on projects that showcase
your expertise to potential employers, unlock access to career support resources, and
earn a career credential to kickstart your new career.

• 100% online

• Get job-ready in less than a year with hands-on projects

• Priced starting at $39 per month (USD)

• Earn a shareable Certificate

Online Degrees

Top degrees that fit your life

Transform your career with a degree online from a world-class university on Coursera.
Our modular degree learning experience gives you the ability to study on your own
schedule and earn credit as you complete your course assignments. For a breakthrough
price, you will learn from top university instructors and graduate with an industry-
relevant university credential.

• Flexible online learning, with open degree courses you can start today

• Build your own schedule over 1-4 years of study

• All-in pricing starting at around $9,000 (USD) with the option to pay in
instalments

• Earn an accredited university bachelor or master’s degree

10
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Chapter 1

Data Science with Python

1.1 Introduction

Data science is all about using data to solve problems. The problem could be decision making

such as identifying which email is spam and which is not. Or a product recommendation such

as which movie to watch? Or predicting the outcome such as who will be the next President of

the USA? So, the core job of a data scientist is to understand the data, extract useful

information out of it and apply this in solving the problems.

The history of data science goes back a little further than 2004, which is where the Google

search term history begins. But this, at least, gives a sense for how popular the area is now. I

think the popularity of interest in the area comes from the network and data driven society we

find ourselves living in. When people think of the term data scientist, they tend to think of

Google or Amazon or Facebook, places with big artificial intelligence research teams, and

certainly these are some amazing companies who are doing great things with data science.

Interest in data science is at an all-time high, and really has exploded in popularity in the last

couple of years. A fun way to see this is to hit up the Google Trends website. Google Trends

shows search keyword information over time. We could see that the term ‘data science’ is

massively popular, really across the globe.

But data scientist is not just limited to careers with tech companies. Almost every company is

turning to data science to better understand how to build products, serve customers and

leverage new opportunities. And companies are not the only one.

11
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

1.2 Pandas

Pandas is an open-source Python Library providing high-performance data manipulation and

analysis tool using its powerful data structures. The name Pandas is derived from the word Panel

Data – an Econometrics from Multidimensional data.

We can also see that related queries include topics like python. For instance, here is the trend

for python pandas: -

Fig 1.1 Pandas Trend

Key Features of Pandas

• Fast and efficient Data Frame object with default and customized indexing.
• Tools for loading data into in-memory data objects from different file formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and sub setting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.

12
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Prior to Pandas, Python was majorly used for data munging and preparation. It had little

contribution towards data analysis. Pandas solved this problem. Using Pandas, we can

accomplish five typical steps in the processing and analysis of data, regardless of the origin of

data — load, prepare, manipulate, model, and analyse.

Python with Pandas is used in a wide range of fields including academic and commercial

domains including finance, economics, Statistics, analytics, etc.

1.3 Jupyter Notebook

The Jupyter Notebook is an open source web application that you can use to create and share

documents that contain live code, equations, visualizations, and text. Jupyter Notebook is

maintained by the people at Project Jupyter.

Jupyter Notebooks are a spin-off project from the IPython project, which used to have an

IPython Notebook project itself. The name, Jupyter, comes from the core supported

programming languages that it supports: Julia, Python, and R. Jupyter ships with the IPython

kernel, which allows you to write your programs in Python, but there are currently over 100

other kernels that you can also use.

Fig 1.2 Jupyter Notebook Logo

13
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Chapter 2

Data Structures in Pandas

2.1 Series Data Structure

The series is one of the core data structures in pandas. You think of its a cross between a list and a

dictionary. The items are all stored in an order and there's labels with which you can retrieve them.

An easy way to visualize this is two columns of data. The first is the special index, a lot like the

dictionary keys. While the second is your actual data. It is important to note that the data column

has a label of its own and can be retrieved using the .name attribute. This is different than with

dictionaries and is useful when it comes to merging multiple columns of data.

You can create a series by passing in a list of values. When you do this, Pandas automatically

assigns an index starting with zero and sets the name of the series to none. Let us see an

example of this: -

Fig 2.1 Series Data Structure

We do not have to use strings. If we passed in a list of whole numbers, for instance, we could

see that panda sets the type to n 64. Underneath panda stores series values in a typed array

using the NumPy library. This offers significant speed-up when processing data versus

traditional python lists. Underneath, pandas do some type conversion.

14
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

If we create a list of strings and we have one element, a None type, pandas insert it as a None

and uses the type object for the underlying array. If we create a list of numbers, integers or

floats, and put in the None type, pandas automatically convert this to a special floating-point

value designated as NAN, which stands for not a number.

2.2 Data Frame Loading and Indexing

The common workflow is to read your data into a DataFrame then reduce this DataFrame to the

columns or rows that you are interested in working with. The Panda's toolkit tries to give you

views on a DataFrame. This is much faster than copying data and much more memory efficient

too.

But it does mean that if you're manipulating the data you have to be aware that any changes to the

DataFrame you're working on may have an impact on the base data frame you used originally.

Here is an example using a DataFrame:-

Fig 2.2 Data Frame Indexing and Loading

We can create a series based on just the cost category using the square brackets. Then we can

increase the cost in this series using broadcasting. Now if we look at our original DataFrame, we

15
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

see those costs have risen as well. This is an important consideration to watch out for. If you

want to explicitly use a copy, then you should consider calling the copy method on the

DataFrame for it first.

Pandas has built-in support for delimited files such as CSV files as well as a variety of other

data formats including relational databases, Excel, and HTML tables.

We can read into a DataFrame by calling the read_csv function of the module. When we look at

the DataFrame we see that the first cell has a NaN in it since it is an empty value, and the rows

have been automatically indexed for us. It seems clear that the first row of data in the

DataFrame is what we really want to see as the column names.

Read csv has several parameters that we can use to indicate to Pandas how rows and columns

should be labelled. For instance, we can use the index call to indicate which column should be

the index and we can also use the header parameter to indicate which row from the data file

should be used as the header. Panda stores a list of all the columns in the columns attribute. We

can change the values of the column names by iterating over this list and calling the rename

method of the data frame.

2.3 Querying and Indexing a DataFrame

Before we talk about how to query data frames, we need to talk about Boolean masking.

Boolean masking is the heart of fast and efficient querying in NumPy. It is analogous a bit to

masking used in other computational areas. A Boolean mask is an array which can be of one

dimension like a series, or two dimensions like a data frame, where each of the values in the

array are either true or false. This array is essentially overlaid on top of the data structure that

we are querying. And any cell aligned with the true value will be admitted into our result, and

any sign aligned with a false value will not.

Boolean masking is powerful conceptually and is the cornerstone of efficient NumPy and
pandas querying. This technique is well used in other areas of computer science, for instance,

16
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

In graphics. But it does not really have an analogue in other traditional relational databases, so I

think it is worth pointing out here. Boolean masks are created by applying operators directly to

the pandas series or DataFrame objects.

Fig 2.3 Boolean Mask

What we want to do next is overlay that mask on the data frame. We can do this using where

function. Where function takes a Boolean mask as a condition, applies it to the data frame or

series, and returns a new data frame or series of the same shape. One more thing to keep in

mind if you are not used to Boolean or bit masking for data reduction. The output of two

Boolean masks being compared with logical operators is another Boolean mask. This means

that you can chain together a bunch of and/or statements to create more complex queries, and

the result is a single Boolean mask.

Both series and Data Frames can have indices applied to them. The index is essentially a row

level label, and we know that rows correspond to axis zero.

Indices can either be inferred, such as when we create a new series without an index, in which case

we get numeric values, or they can be set explicitly, like when we use the dictionary object to create

the series, or when we loaded data from the CSV file and specified the header. Another option for

setting an index is to use the set_index function. This function takes a list of columns

17
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

and promotes those columns to an index. Set index is a destructive process, it does not keep the

current index. If you want to keep the current index, you need to manually create a new column

and copy into it values from the index attribute. One nice feature of pandas is that it has the

option to do multi-level indexing. This is like composite keys in relational database systems. To

create a multi-level index, we simply call set index and give it a list of columns that we are

interested in promoting to an index. Pandas will search through these in order, finding the

distinct data and forming composite indices. A good example of this is often found when

dealing with geographical data which is sorted by regions or demographics.

18
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Chapter 3

Advanced Python Pandas

3.1 Merging DataFrames and Pandas Idioms

We were introduced to the pandas data manipulation and analysis library. We saw that there are

two-core data structures which are very similar, the one-dimensional series object and the two-

dimensional DataFrame object. Querying these two data structures is done in a few different

ways, such as using the iloc or loc attributes for row-based querying or using the square

brackets on the object itself for column-based querying. Most importantly, we saw that one can

query the DataFrame and Series objects through Boolean masking. And Boolean masking is a

powerful filtering method which allows us to use broadcasting to determine what data should

be kept in our analysis.

We have already seen how we add new data to an existing DataFrame. We simply use the

square bracket operator with the new column name, and the data is added if an index is shared.

If there is no shared index and a scalar value is passed in, which remember a scalar value is just

a single value like an integer or a string. The new value column is added with the scalar value

as the default value. What if we wanted to assign a different value for every row? Well, it gets

trickier. If we could hardcode the values into a list, then pandas will unpack them and assign

them to the rows. But if the list we have is not long enough, then we can't do this, since Pandas

doesn't know where the missing values should go.

More commonly, we want to join two larger DataFrames together, and this is a bit more

complex. Before we jump into the code, we need to address a little relational theory, and to get

some language conventions down. This is a Venn diagram: -

19
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Fig 3.1 Venn diagram

A Venn diagram is traditionally used to show set membership. For example, the circle on the

left is the population of students at a university. The circle on the right is the population of staff

at a university. And the overlapping region in the middle are all those students who are also

staff. Maybe these students run tutorials for a course, or grade assignments, or engage in

running research experiments. We could think of these two populations as indices in separate

DataFrames, maybe with the label of Person Name. When we want to join the DataFrames

together, we have some choices to make. First what if we want a list of all the people regardless

of whether they are staff or student, and all the information we can get on them? In database

terminology, this is called a full outer join. And in set theory, it is called a union. In the Venn

diagram, it represents everyone in any circle. It is quite possible though that we only want those

people who we have maximum information for, those people who are both staff and students.

In database terminology, this is called an inner join. Or in set theory, the intersection. And this

is represented in the Venn diagram as the overlapping parts of each circle. Okay, so let us see

an example of how we would do this in pandas, where we would use the merge function.

20
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

First we create two DataFrames, staff and students. There is some overlap in these DataFrames,

in that James and Sally are both students and staff, but Mike and Kelly are not. Importantly,

both DataFrames are indexed along the value we want to merge them on, which is called Name.

If we want the union of these, we will call merge passing in the DataFrame on the left and the

DataFrame on the right and telling merge that we want it to use an outer join. We tell merge

that we want to use the left and right indices as the joining columns. We see in the resulting

DataFrame that everyone is listed. And since Mike does not have a role, and John does not have

a school, those cells are listed as missing values. If we wanted to get the intersection, that is,

just those students who are also staff, we could set the how attribute to inner. And we set the

resulting DataFrame has only James and Sally in it.

Fig 3.2 Merging DataFrames

Python programmers will often suggest that there many ways the language can be used to solve

a particular problem. But that some are more appropriate than others. The best solutions are

celebrated as Idiomatic Python and there are lots of great examples of this on stack overflow

and websites. An idiomatic solution is often one which has both high performance and high

readability. This is not necessarily true. A sort of sub-language within Python, Pandas has its

own set of idioms.

21
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

We have alluded to some of these already, such as using vectorization whenever possible, and

not using iterative loops if you do not need to. Several developers and users within the Panda's

community have used the term pandorable for these idioms.

The first of these is called method chaining. Now we saw that previously, you could chain

pandas calls together when you are querying DataFrames. For, instance if you wanted to select

rows based on index like county name. Then you wanted to only project certain columns like

the total population, you can write a query, like df.loc[“Washtenaw”] [“Total Population”].

This is a form of chaining, called chain indexing. And it is generally a bad practice. Because it

is possible that pandas could be returning a copy or a view of the DataFrame depending upon

the underlying NumPy library. Method chaining though, little bit different. The general idea

behind method chaining is that every method on an object returns a reference to that object. The

beauty of this is that you can condense many different operations on a DataFrame, for instance,

into one line or at least one statement of code.

3.2 Group by and Scales

We have seen that even though PANDAS allow us to iterate over every row in a data frame this

is generally a slow way to accomplish a given task and it's not very pandorable. For instance, if

we wanted to write some code to iterate over all the of the states and generate a list of the

average census population numbers. We could do so using a loop in the unique function.

Another option is to use the dataframe group by function. This function takes some column

name or names and splits the dataframe up into chunks based on those names, it returns a

dataframe group by object. Which can be iterated upon, and then returns a tuple where the first

item is the group condition, and the second item is the data frame reduced by that grouping.

Since it's made up of two values, you can unpack this, and project just the column that you're

interested in, to calculate the average.

22
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Now, 99% of the time, you will use group by on one or more columns. But you can provide a

function to group by as well and use that to segment your data. This is a bit of a fabricated example

but let’s say that you have a big batch job with lots of processing and you want to work on only a

third or so of the states at a given time. We could create some function which returns a number

between zero and two based on the first character of the state name. Then we can tell group by to

use this function to split up our data frame. It is important to note that in order to do this you need

to set the index of the data frame to be the column that you want to group by first.

Let us say that we have a data frame of students and their academic levels such as being in

grade 1 or grade 2 and grade 3. There is a difference between a student in grade 1 and a student

in grade 2, the same as the difference between a student in grade 8 and a student in grade 9.

Well let us think about the final exam scores these students might get on assignments. Is the

difference between an A and an A minus the same as the difference between an A minus and a

B plus? At the University of Michigan at least, the answer is usually no. We have intuitively

seen some different scales, and as we move through data cleaning and statistical analysis and

machine learning, it is important to clarify our knowledge and terminology. As a data scientist

there four scales that it is worth knowing. The first is a ratio scale. In the ratio scale the

measurements units are equally spaced and mathematical operations, such as subtract, division,

and multiplication are all valid. Good examples of ratio scale measurements might be the height

and weight. The next scale is the interval scale. In the interval scale the measurement units are

equally spaced like the ratio scale. But there is no clear absence of value. That is there is not a

true zero, and so operation such as multiplication and division are not valid. For most of the

work you do at data mining, the differences between the ratio and interval scales might not be

clearly apparent or important than the algorithm you are applying. But it is important to have

this clear in your mind when you are applying advanced statistical tests.

23
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

3.3 Date Functionality

Pandas has four main time related classes. Timestamp, DatetimeIndex, Period, and PeriodIndex.

First, let us look at Timestamp. Timestamp represents a single timestamp and associates values

with points in time. For example, let us create a timestamp using a string 9/1/2016 10:05AM,

and here we have our timestamp. Timestamp is interchangeable with Python's datetime in most

cases.

Fig 3.3 Date Functionality

Suppose we were not interested in a specific point in time, and instead wanted a span of time. This

is where Period comes into play. Period represents a single time span, such as a specific day or

month. Here we are creating a period that is January 2016, and here is an example of a period that is

March 5th, 2016. The index of a timestamp is DatetimeIndex. Looking at the type of our series

index, we see that it is DatetimeIndex. Similarly, the index of period is PeriodIndex.

Another cool thing we can do is change the frequency of our dates in our DataFrame

using asfreq. If we use this to change the frequency from bi-weekly to weekly, we will end up

with missing values every other week. So, let us use the forward fill method on those missing

values. One last thing I wanted to briefly touch upon is plotting time series. Importing

24
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

matplotlib.pyplot, and using the iPython magic %mapplotlib inline, will allow you to visualize

the time series in the notebook.

25
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

CHAPTER 4 : COVID-19 DATA ANALYSIS

ABSTRACT
This project is the amalgamation of two guided projects successfully completed on
the Coursera Project Network, namely “COVID-19 Data Analysis Using Python”
and “COVID-19 Data Visualization using Python”. The concepts used and taught
in the two projects mentioned are the brainchild of this cumulative project that applies
their concepts on the data set of India and the world to draw comparisons that are
relevant to the current time.

4.1 PROBLEM STATEMENT:

To analyse the spread of COVID-19 in the world and compare the spread in New
Zealand and India using python and associated libraries and modules.

4.2 ABOUT THE DATASET:

The World dataset contains the daily total of cases in different countries, along with
the total number of deaths, and recovered cases.
The dataset for India has been taken from the Ministry of Health and Family Welfare.
It comprises a daily total of cases in different states/UT , along with the total number
of deaths, and recovered cases.

4.3 PROCESSING THE DATASET:

4.3.1 WORLD DATASET:


The countries have been taken into consideration from the period when there
was at least 1 active case.
For further country-wise analysis, new data frames have been created with the
rows and columns corresponding to the selected country(India/New Zealand).
New columns have been created to calculate the infection, death, and recovery
rates.

26
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

4.3.2 INDIA DATASET:


A similar approach has been followed, and a state level data has been filtered
with Maharashtra and Delhi as the cases.

4.4 FLOW DIAGRAM:

Fig 4.1 Flow Diagram of the Project

4.5 SOURCE CODES, OUTPUTS AND ANALYSIS:

Fig 4.2 Importing required modules

4.5.1 WORLD ANALYSIS

Fig 4.3 Loading World Dataset

Fig 4.4 Generating Global Spread of Covid-19

27
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Fig 4.5 Country-wise total confirmed cases viewed using choropleth


module of pyplot library (2020-04-03)

Fig 4.6 Country-wise total confirmed cases viewed using choropleth


module of pyplot library (2020-09-16)
From the above figures, it can be seen that initial spread was majorly
in China, USA, Russia and Italy.
China, Russia and Italy managed to curb their numbers, in India and
USA, the cases grew to 6M.

Fig 4.7 Generating Global Deaths of Covid-19

28
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Fig 4.8 Country-wise total deaths viewed using choropleth module of


pyplot library (2020-09-16)

Fig 4.9 Generating Global Recovered Cases from Covid-19

Fig 4.10 Country-wise total recovered cases viewed using choropleth


module of pyplot library (2020-09-16)

29
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

4.5.2 ANALYSIS FOR NEW ZEALAND

Fig 4.11 Processing New Zealand Data

Fig 4.12 Setting Variables for Lockdown dates


in New Zealand

30
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

31
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

32
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Fig 4.13 Generating the curve for New Zealand

Fig 4.14 Infection Rate curve along with dates of different levels of
lockdown in New Zealand viewed online graph using pyplot library

33
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

The Level 1 of lockdown(lowest level) was implemented only after the


wave died. As the new case emerged in Auckland, different levels of
lockdown were applied, and the second wave was limited to a
maximum rate of 20 new cases in a day.

4.5.3 ANALYSIS FOR INDIA

Fig 4.15 Processing India Data

Fig 4.16 Setting Variables for Different Phases of Lockdown in India

34
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

35
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

36
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Fig 4.17 Generating the curve for India

Fig 4.18 Infection and Recovery Rate curves along with dates of
different phases of lockdown and un-lockdown in India viewed on line
graph using pyplot library
Lockdown was implemented in 4 different phases.

37
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

When recovery rate was greater than infection rate, phases of unlock
were started.
Un-Lockdown 5 was implemented after the recovery rate was
consistently greater than Infection Rate and the wave was constantly
dying.

4.5.3.1 NATIONAL ANALYSIS

Fig 4.19 Loading India Dataset

Fig 4.20 Generating National Spread of Covid-19

Fig 4.21 State-wise total confirmed cases viewed using choropleth


module of pyplot library (2020-09-16)

Fig 4.22 Generating National Deaths due to Covid-19

38
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Fig 4.23 State-wise total deaths viewed using choropleth module of


pyplot library (2020-09-16)

Fig 4.24 Calculating Maximum Infection Rates for all States/UT

Fig 4.25 Plotting Bar graph for maximum infection rates

Fig 4.26 State-wise maximum infection rate viewed on a bar graph


using pyplot library
Maharashtra has the maximum Infection rate of 24.8k
From above figures, it is evident that the maximum cases in India are
from Maharashtra.
On diving deeper, it is known that the maximum cases came from
Mumbai, due to the spread in Dharavi, Asia’s largest slum.

39
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

The National Capital, Delhi also had a major contribution, given its
small area and a high number of cases.

4.5.3.2 ANALYSIS FOR DELHI

Fig 4.27 Creating new data frame for Delhi

Fig 4.28 Normalising Data

Plotting normalised values of Infection and Death rates of Delhi

Fig 4.29 Normalised values of Infection and Death rates of Delhi


viewed on a line graph using Pyplot library

Fig 4.30 Plotting normalised values of Infection Rate and Confirmed


Cases in Delhi

40
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Fig 4.31 Normalised values of Infection Rate and Confirmed Cases in


Delhi plotted on a line graph using pyplot library
Delhi Infection Rate has a bimodal curve (two peaks).
The first peak can be seen around 8 July,2020. The infection rate then
decreased till about 27 July 2020.
With the new phase of un-lockdown, the cases then rose and hit a new
peak around 20 September,2020.
As the Infection Rate increased, a similar trend can be seen in the
Death Rate curve.
The Death Rate hit its peak around 16 June 2020.
The impact of infection rate can be seen on the confirmed cases, since
the infection rate shows the slope of confirmed cases.

4.5.3.3 ANALYSIS FOR MAHARASHTRA

Fig 4.32 Creating new data frame for Maharashtra

Fig 4.33 Normalising Data

Fig 4.34 Plotting normalised values of Infection and Death rates of


Maharashtra

41
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

Fig 4.35 Potting values of Infection Rate and Confirmed Cases in


Maharashtra

Fig 4.36 Normalised values of Infection and Death rates of


Maharashtra viewed on a line graph using Pyplot library

Fig 4.37 Normalised values of Infection Rate and Confirmed Cases in


Maharashtra plotted on a line graph using pyplot library Maharashtra
Infection Rate has a unimodal curve (single peak)
The maximum fresh cases were observed around 16 September,2020.
As the Infection Rate increased, a similar trend can be seen in the
Death Rate curve.
The Death Rate hit its peak around 20 June 2020.
The impact of infection rate can be seen on the confirmed cases, since
the infection rate shows the slope of confirmed cases.

42
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

CONCLUSION

The project successfully gives us an insight into the COVID-19 pandemic worldwide
and in New Zealand and India. It lets us know about the government’s decisions about
the phases of lockdown and un-lockdown and their basis. Pandas, NumPy, matplotlib
and pyplot have been heavily used in the project. First world countries like the USA,
Italy etc. have been more prone to the virus than the other countries. It can also be
seen that the infection rates for the countries have started to fall and the pandemic has
now started to die.

43
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

BIBLIOGRAPHY
1. Jason Brownlee. “Supervised and Unsupervised Machine Learning
Algorithms”.
Internet: https://machinelearningmastery.com/supervised-and-unsupervised-
machine-learning-algorithms/
2. “Probability and Statistics”.
Internet: https://www.statisticshowto.com/probability-and-statistics/
3. “The elements of Data Analytic Style” by Jeff Leek, Prof. John Hopkins
University
4. “School of Data Handbook” published by the School of Data in 2015
5. “The Data Science Handbook” by Carl Shan

44
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

MAHARAJA SURAJMAL INSTITUTE OF TECHNOLOGY

Summer/Industrial Training Evaluation Form F05 (MSIT-EXM-P-02)

(Year 2020 -- 2021)


Details of the Student Details of the Organisation

Name Lakshya Kharayat Name and address of organization Coursera Inc. (Online)

Roll No 02196302818 381 E. Evelyn Ave Mountain View, CA 94041

Branch and Semester ECE 5th Sem Broader Area Data Science with Python

Mobile No 9958846298 Name of Instructor Christopher Brooks

E-mail ID cartoonshow57@gmail.com Designation and Contact No Associate Professor


(University of Michigan)
Student Performance Record
No. of days Number of Curriculum Scheduled for the Curriculum actually covered
Scheduled for days student by the student
the training actually
attended
5 5 Python Fundamentals Python Fundamentals
Week 1
5 5 Basic Data Processing with Basic Data Processing with
Week 2 Pandas Pandas
5 5 Advanced Python Pandas Advanced Python Pandas
Week 3
5 5 Statistical Analysis in Python Statistical Analysis in Python
Week 4 and Project and Project
- - - -
Week 5
- - - -
Week 6

(Signature of the student)

Any comments or suggestions for the student performance during the training program(to be filled by
instructor)…………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………
………………………………………………………………………………………………………….

45
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

(Signature of the Instructor)


along with Seal
Note: Every student has to fill and submit this Performa duly signed by his/her instructor to the faculty-in-
charge by first week of September.

46
Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120

47
Downloaded by Kapil? (kapil.kukreja07@gmail.com)

You might also like