Sample Report On Data Science Course On Coursera

lOMoARcPSD|15772120
Sample report on Data Science course on Coursera
btech (Guru Gobind Singh Indraprastha University)
StuDocu is not sponsored or endorsed by any college or university

Downloaded by Kapil? (kapil.kukreja07@gmail.com)
lOMoARcPSD|15772120
TRAINING REPORT ON: DATA SCIENCE
A Report submitted in partial fulfilment of the requirement for the

award of degree of
Bachelor’s of Technology
In
ELECTRONICS AND COMMUNICATION ENGINEERING
Submitted by
LAKSHYA KHARAYAT
(01296302818)
Maharaja Surajmal Institute of Technology
C-4 Janak Puri, New Delhi-110058
Affiliated to GGSIPU | NAAC Accredited 'A' Grade | ISO 9001:2015 Certified
Approved by AICTE

lOMoARcPSD|15772120
Certificate

lOMoARcPSD|15772120
STUDENT DECLARTION
I, Lakshya Kharayat (Enrolment Number: 01296302818) of Maharaja Surajmal Institute of

Technology hereby declare that this report is based on the training undergone during the
summer break of 2020 under Coursera through an online medium, during my Bachelor in
Technology (Electronics and Communication) degree (2018-2022) is an original work. In case
the project report, or any part of it, is found to be copied or quoted without reference, I shall be
solely held accountable for the repercussions arising thereof.
Lakshya Kharayat
01296302818
3
lOMoARcPSD|15772120
CONTENTS
Certificate……………………………………………………………………………………………………………….………. 2
Declaration…………………………………………………………………………….………3
List of Figures …………………………………………………………………………………………….….………………. 5
Abstract ....…………………………………………………………………………………………………………….……….…… 6
Acknowledgement …………………………………………………………………………………………….……..….... 7
Company Profile ……………………………………………………………………………………………………….…… 8
Chapters
1. Data Science with Python……………………………………….……….11
1.1 Introduction…………………………………………………………..…...11
1.2 Pandas………………………………………………………………........12
1.3 Jupyter Notebook…………………………………………….…….…….13
2. Data Structures in Pandas………………………….…………...........……14

2.1 Series Data Structure…………………………………………………….14
2.2 Data Frame Loading and Indexing……………………………….....…..15
2.3 Querying and Indexing a DataFrame…………………………...………16
3. Advanced Python Idioms……………………...…………………..….…19

3.1 Merging DataFrames and Pandas Idioms………………………..………19
3.2 Group by and Scales……………………………………………..……...22
3.3 Date Functionality………………………………………………..……..24
4. Project: COVID-19 Data Analysis…………………..………………….…….…26

4.1 Problem Statement…………….……………………………………….26
4.2 About the DataSet……………………………..………………………..26
4.3 Processing the DataSet……….………………………………………….26
4.4 Flow Diagram............................................................................................27
4.5 Source Codes, Outputs And Analysis……………………..…………27
5. Bibliography……………………………………..……………..……44
4
lOMoARcPSD|15772120
LIST OF FIGURES
1.1 Pandas Trend
1.2 Jupyter Notebook Logo
2.1 Series Data Structure
2.2 Data Frame Indexing and Loading
2.3 Boolean Mask
3.1 Venn Diagram
3.2 Merging DataFrames
3.3 Date Functionality
4.1 Flow Diagram
4.2 Importing Modules
4.3 Loading Dataset
4.4 Generating Global Spread
5
lOMoARcPSD|15772120
ABSTRACT
Every company will say they are doing a form of data science, but what exactly does that
mean? The field is growing so rapidly, and revolutionizing so many industries, it's difficult
to fence in its capabilities with a formal definition, but generally data science is devoted to
the extraction of clean information from raw data for the formulation of actionable insights.
Commonly referred to as the “oil of the 21st century," our digital data carries the most
importance in the field. It has incalculable benefits in business, research, and our everyday
lives. Your route to work, your most recent Google search for the nearest coffee shop, your
Instagram post about what you ate, and even the health data from your fitness tracker are all
important to different data scientists in different ways. Sifting through massive lakes of data,
looking for connections and patterns, data science is responsible for bringing us new products,
delivering breakthrough insights, and making our lives more convenient.
6
lOMoARcPSD|15772120
ACKNOWLEDGEMENT
First and foremost, I am grateful to Mr. Pradeep Sangwan (HoD, Department of Electronics and
Communication Engineering, Maharaja Surajmal Institute of Technology) for allowing me to
undergo this fruitful training from Coursera for a duration of 4 weeks and consequently Coursera.
I am deeply indebted to the numerous instructors on Coursera that provided valuable online
material and lectures with detailed descriptions for me to understand from a very basic level,
and for their guidance and platform provisions that helped me complete the training with ease.
Their numerous resources and guidance helped me gain valuable knowledge on Data Science,
ML, and Python, making me a better fit for the industry than I was before.
Lastly, I would like to express my gratitude to my peers, who provided valuable insights and
collaborations on the project, helping me improve every step of the way.
SUBMITTED BY:
Lakshya Kharayat
01296302818
7
lOMoARcPSD|15772120
COMPANY PROFILE
Our Story
Coursera was founded by Daphne Koller and Andrew Ng with a vision of providing life-
transforming learning experiences to anyone, anywhere. It is now a leading online learning
platform for higher education, where 73 million learners from around the world come to learn
skills of the future. More than 200 of the world’s top universities and industry educators’
partner with Coursera to offer courses, Specializations, certificates, and degree programs.
Thousands of companies trust the company’s enterprise platform Coursera for Business to
transform their talent. Coursera for Government equips government employees and citizens
with in-demand skills to build a competitive workforce. Coursera for Campus empowers any
university to offer high-quality, job-relevant online education to students, alumni, faculty, and
staff. Coursera is backed by leading investors that include Kleiner Perkins, New Enterprise
Associates, Learn Capital, and SEEK Group.
Courses
Learn something new
Every course on Coursera is taught by top instructors from world-class universities and
companies, so you can learn something new anytime, anywhere. Hundreds of free
courses give you access to on-demand video lectures, homework exercises, and
community discussion forums. Paid courses provide additional quizzes and projects as
well as a shareable Course Certificate upon completion.
• 100% online
• Learn something new in 4-6 weeks
• Priced starting at $39 (USD)
• Earn a Course Certificate
Guided Projects
8
lOMoARcPSD|15772120
Gain a job-relevant skill in under 2 hours
Enrol in Guided Projects to learn job-relevant skills and industry tools in under 2 hours.
Guided Projects are self-paced, require a smaller time commitment, and provide practice
using tools in real-world scenarios, so you can build the job skills you need, right when
you need them.
• 100% online with no setup required
• Interactive learning experience with step-by-step, visual instruction from

subject-matter experts
• Priced starting at $9.99 (USD)
• Earn a Guided Project certificate
Specializations
Master a skill
If you want to master a specific career skill, consider enrolling in a Specialization. You
will complete a series of rigorous courses at your own pace, tackle hands-on projects
based on real business challenges, and earn a Specialization Certificate to share with
your professional network and potential employers.
• 100% online
• Master a skill in 4-6 months
• Priced starting at $39 (USD) per month
• Earn a Specialization Certificate
Professional Certificates
9
lOMoARcPSD|15772120
Get job-ready for an in-demand career
Whether you are looking to start a new career or change your current one, Professional
Certificates on Coursera help you become job ready. Learn at your own pace from top
companies and universities, apply your new skills to hands-on projects that showcase
your expertise to potential employers, unlock access to career support resources, and
earn a career credential to kickstart your new career.
• 100% online
• Get job-ready in less than a year with hands-on projects
• Priced starting at $39 per month (USD)
• Earn a shareable Certificate
Online Degrees
Top degrees that fit your life
Transform your career with a degree online from a world-class university on Coursera.
Our modular degree learning experience gives you the ability to study on your own
schedule and earn credit as you complete your course assignments. For a breakthrough
price, you will learn from top university instructors and graduate with an industry-
relevant university credential.
• Flexible online learning, with open degree courses you can start today
• Build your own schedule over 1-4 years of study
• All-in pricing starting at around $9,000 (USD) with the option to pay in
instalments
• Earn an accredited university bachelor or master’s degree
10
lOMoARcPSD|15772120
Chapter 1
Data Science with Python
1.1 Introduction
Data science is all about using data to solve problems. The problem could be decision making
such as identifying which email is spam and which is not. Or a product recommendation such
as which movie to watch? Or predicting the outcome such as who will be the next President of
the USA? So, the core job of a data scientist is to understand the data, extract useful
information out of it and apply this in solving the problems.
The history of data science goes back a little further than 2004, which is where the Google
search term history begins. But this, at least, gives a sense for how popular the area is now. I
think the popularity of interest in the area comes from the network and data driven society we
find ourselves living in. When people think of the term data scientist, they tend to think of
Google or Amazon or Facebook, places with big artificial intelligence research teams, and
certainly these are some amazing companies who are doing great things with data science.
Interest in data science is at an all-time high, and really has exploded in popularity in the last
couple of years. A fun way to see this is to hit up the Google Trends website. Google Trends
shows search keyword information over time. We could see that the term ‘data science’ is
massively popular, really across the globe.
But data scientist is not just limited to careers with tech companies. Almost every company is
turning to data science to better understand how to build products, serve customers and
leverage new opportunities. And companies are not the only one.
11
lOMoARcPSD|15772120
1.2 Pandas
Pandas is an open-source Python Library providing high-performance data manipulation and
analysis tool using its powerful data structures. The name Pandas is derived from the word Panel
Data – an Econometrics from Multidimensional data.
We can also see that related queries include topics like python. For instance, here is the trend
for python pandas: -
Fig 1.1 Pandas Trend
Key Features of Pandas
• Fast and efficient Data Frame object with default and customized indexing.
• Tools for loading data into in-memory data objects from different file formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and sub setting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.
12
lOMoARcPSD|15772120
Prior to Pandas, Python was majorly used for data munging and preparation. It had little
contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin of
data — load, prepare, manipulate, model, and analyse.
Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.
1.3 Jupyter Notebook
The Jupyter Notebook is an open source web application that you can use to create and share
documents that contain live code, equations, visualizations, and text. Jupyter Notebook is
maintained by the people at Project Jupyter.
Jupyter Notebooks are a spin-off project from the IPython project, which used to have an
IPython Notebook project itself. The name, Jupyter, comes from the core supported
programming languages that it supports: Julia, Python, and R. Jupyter ships with the IPython
kernel, which allows you to write your programs in Python, but there are currently over 100
other kernels that you can also use.
Fig 1.2 Jupyter Notebook Logo
13
lOMoARcPSD|15772120
Chapter 2
Data Structures in Pandas
2.1 Series Data Structure
The series is one of the core data structures in pandas. You think of its a cross between a list and a
dictionary. The items are all stored in an order and there's labels with which you can retrieve them.
An easy way to visualize this is two columns of data. The first is the special index, a lot like the
dictionary keys. While the second is your actual data. It is important to note that the data column
has a label of its own and can be retrieved using the .name attribute. This is different than with
dictionaries and is useful when it comes to merging multiple columns of data.
You can create a series by passing in a list of values. When you do this, Pandas automatically
assigns an index starting with zero and sets the name of the series to none. Let us see an
example of this: -
Fig 2.1 Series Data Structure
We do not have to use strings. If we passed in a list of whole numbers, for instance, we could
see that panda sets the type to n 64. Underneath panda stores series values in a typed array
using the NumPy library. This offers significant speed-up when processing data versus
traditional python lists. Underneath, pandas do some type conversion.
14
lOMoARcPSD|15772120
If we create a list of strings and we have one element, a None type, pandas insert it as a None
and uses the type object for the underlying array. If we create a list of numbers, integers or
floats, and put in the None type, pandas automatically convert this to a special floating-point
value designated as NAN, which stands for not a number.
2.2 Data Frame Loading and Indexing
The common workflow is to read your data into a DataFrame then reduce this DataFrame to the
columns or rows that you are interested in working with. The Panda's toolkit tries to give you
views on a DataFrame. This is much faster than copying data and much more memory efficient
too.
But it does mean that if you're manipulating the data you have to be aware that any changes to the
DataFrame you're working on may have an impact on the base data frame you used originally.
Here is an example using a DataFrame:-
Fig 2.2 Data Frame Indexing and Loading
We can create a series based on just the cost category using the square brackets. Then we can
increase the cost in this series using broadcasting. Now if we look at our original DataFrame, we
15
lOMoARcPSD|15772120
see those costs have risen as well. This is an important consideration to watch out for. If you
want to explicitly use a copy, then you should consider calling the copy method on the
DataFrame for it first.
Pandas has built-in support for delimited files such as CSV files as well as a variety of other
data formats including relational databases, Excel, and HTML tables.
We can read into a DataFrame by calling the read_csv function of the module. When we look at
the DataFrame we see that the first cell has a NaN in it since it is an empty value, and the rows
have been automatically indexed for us. It seems clear that the first row of data in the
DataFrame is what we really want to see as the column names.
Read csv has several parameters that we can use to indicate to Pandas how rows and columns
should be labelled. For instance, we can use the index call to indicate which column should be
the index and we can also use the header parameter to indicate which row from the data file
should be used as the header. Panda stores a list of all the columns in the columns attribute. We
can change the values of the column names by iterating over this list and calling the rename
method of the data frame.
2.3 Querying and Indexing a DataFrame
Before we talk about how to query data frames, we need to talk about Boolean masking.
Boolean masking is the heart of fast and efficient querying in NumPy. It is analogous a bit to
masking used in other computational areas. A Boolean mask is an array which can be of one
dimension like a series, or two dimensions like a data frame, where each of the values in the
array are either true or false. This array is essentially overlaid on top of the data structure that
we are querying. And any cell aligned with the true value will be admitted into our result, and
any sign aligned with a false value will not.
Boolean masking is powerful conceptually and is the cornerstone of efficient NumPy and
pandas querying. This technique is well used in other areas of computer science, for instance,
16
lOMoARcPSD|15772120
In graphics. But it does not really have an analogue in other traditional relational databases, so I
think it is worth pointing out here. Boolean masks are created by applying operators directly to
the pandas series or DataFrame objects.
Fig 2.3 Boolean Mask
What we want to do next is overlay that mask on the data frame. We can do this using where
function. Where function takes a Boolean mask as a condition, applies it to the data frame or
series, and returns a new data frame or series of the same shape. One more thing to keep in
mind if you are not used to Boolean or bit masking for data reduction. The output of two
Boolean masks being compared with logical operators is another Boolean mask. This means
that you can chain together a bunch of and/or statements to create more complex queries, and
the result is a single Boolean mask.
Both series and Data Frames can have indices applied to them. The index is essentially a row
level label, and we know that rows correspond to axis zero.
Indices can either be inferred, such as when we create a new series without an index, in which case
we get numeric values, or they can be set explicitly, like when we use the dictionary object to create
the series, or when we loaded data from the CSV file and specified the header. Another option for
setting an index is to use the set_index function. This function takes a list of columns
17
lOMoARcPSD|15772120
and promotes those columns to an index. Set index is a destructive process, it does not keep the
current index. If you want to keep the current index, you need to manually create a new column
and copy into it values from the index attribute. One nice feature of pandas is that it has the
option to do multi-level indexing. This is like composite keys in relational database systems. To
create a multi-level index, we simply call set index and give it a list of columns that we are
interested in promoting to an index. Pandas will search through these in order, finding the
distinct data and forming composite indices. A good example of this is often found when
dealing with geographical data which is sorted by regions or demographics.
18
lOMoARcPSD|15772120
Chapter 3
Advanced Python Pandas
3.1 Merging DataFrames and Pandas Idioms
We were introduced to the pandas data manipulation and analysis library. We saw that there are
two-core data structures which are very similar, the one-dimensional series object and the two-
dimensional DataFrame object. Querying these two data structures is done in a few different
ways, such as using the iloc or loc attributes for row-based querying or using the square
brackets on the object itself for column-based querying. Most importantly, we saw that one can
query the DataFrame and Series objects through Boolean masking. And Boolean masking is a
powerful filtering method which allows us to use broadcasting to determine what data should
be kept in our analysis.
We have already seen how we add new data to an existing DataFrame. We simply use the
square bracket operator with the new column name, and the data is added if an index is shared.
If there is no shared index and a scalar value is passed in, which remember a scalar value is just
a single value like an integer or a string. The new value column is added with the scalar value
as the default value. What if we wanted to assign a different value for every row? Well, it gets
trickier. If we could hardcode the values into a list, then pandas will unpack them and assign
them to the rows. But if the list we have is not long enough, then we can't do this, since Pandas
doesn't know where the missing values should go.
More commonly, we want to join two larger DataFrames together, and this is a bit more
complex. Before we jump into the code, we need to address a little relational theory, and to get
some language conventions down. This is a Venn diagram: -
19
lOMoARcPSD|15772120
Fig 3.1 Venn diagram
A Venn diagram is traditionally used to show set membership. For example, the circle on the
left is the population of students at a university. The circle on the right is the population of staff
at a university. And the overlapping region in the middle are all those students who are also
staff. Maybe these students run tutorials for a course, or grade assignments, or engage in
running research experiments. We could think of these two populations as indices in separate
DataFrames, maybe with the label of Person Name. When we want to join the DataFrames
together, we have some choices to make. First what if we want a list of all the people regardless
of whether they are staff or student, and all the information we can get on them? In database
terminology, this is called a full outer join. And in set theory, it is called a union. In the Venn
diagram, it represents everyone in any circle. It is quite possible though that we only want those
people who we have maximum information for, those people who are both staff and students.
In database terminology, this is called an inner join. Or in set theory, the intersection. And this
is represented in the Venn diagram as the overlapping parts of each circle. Okay, so let us see
an example of how we would do this in pandas, where we would use the merge function.
20
lOMoARcPSD|15772120
First we create two DataFrames, staff and students. There is some overlap in these DataFrames,
in that James and Sally are both students and staff, but Mike and Kelly are not. Importantly,
both DataFrames are indexed along the value we want to merge them on, which is called Name.
If we want the union of these, we will call merge passing in the DataFrame on the left and the
DataFrame on the right and telling merge that we want it to use an outer join. We tell merge
that we want to use the left and right indices as the joining columns. We see in the resulting
DataFrame that everyone is listed. And since Mike does not have a role, and John does not have
a school, those cells are listed as missing values. If we wanted to get the intersection, that is,
just those students who are also staff, we could set the how attribute to inner. And we set the
resulting DataFrame has only James and Sally in it.
Fig 3.2 Merging DataFrames
Python programmers will often suggest that there many ways the language can be used to solve
a particular problem. But that some are more appropriate than others. The best solutions are
celebrated as Idiomatic Python and there are lots of great examples of this on stack overflow
and websites. An idiomatic solution is often one which has both high performance and high
readability. This is not necessarily true. A sort of sub-language within Python, Pandas has its
own set of idioms.
21
lOMoARcPSD|15772120
We have alluded to some of these already, such as using vectorization whenever possible, and
not using iterative loops if you do not need to. Several developers and users within the Panda's
community have used the term pandorable for these idioms.
The first of these is called method chaining. Now we saw that previously, you could chain
pandas calls together when you are querying DataFrames. For, instance if you wanted to select
rows based on index like county name. Then you wanted to only project certain columns like
the total population, you can write a query, like df.loc[“Washtenaw”] [“Total Population”].
This is a form of chaining, called chain indexing. And it is generally a bad practice. Because it
is possible that pandas could be returning a copy or a view of the DataFrame depending upon
the underlying NumPy library. Method chaining though, little bit different. The general idea
behind method chaining is that every method on an object returns a reference to that object. The
beauty of this is that you can condense many different operations on a DataFrame, for instance,
into one line or at least one statement of code.
3.2 Group by and Scales
We have seen that even though PANDAS allow us to iterate over every row in a data frame this
is generally a slow way to accomplish a given task and it's not very pandorable. For instance, if
we wanted to write some code to iterate over all the of the states and generate a list of the
average census population numbers. We could do so using a loop in the unique function.
Another option is to use the dataframe group by function. This function takes some column
name or names and splits the dataframe up into chunks based on those names, it returns a
dataframe group by object. Which can be iterated upon, and then returns a tuple where the first
item is the group condition, and the second item is the data frame reduced by that grouping.
Since it's made up of two values, you can unpack this, and project just the column that you're
interested in, to calculate the average.
22
lOMoARcPSD|15772120
Now, 99% of the time, you will use group by on one or more columns. But you can provide a
function to group by as well and use that to segment your data. This is a bit of a fabricated example
but let’s say that you have a big batch job with lots of processing and you want to work on only a
third or so of the states at a given time. We could create some function which returns a number
between zero and two based on the first character of the state name. Then we can tell group by to
use this function to split up our data frame. It is important to note that in order to do this you need
to set the index of the data frame to be the column that you want to group by first.
Let us say that we have a data frame of students and their academic levels such as being in
grade 1 or grade 2 and grade 3. There is a difference between a student in grade 1 and a student
in grade 2, the same as the difference between a student in grade 8 and a student in grade 9.
Well let us think about the final exam scores these students might get on assignments. Is the
difference between an A and an A minus the same as the difference between an A minus and a
B plus? At the University of Michigan at least, the answer is usually no. We have intuitively
seen some different scales, and as we move through data cleaning and statistical analysis and
machine learning, it is important to clarify our knowledge and terminology. As a data scientist
there four scales that it is worth knowing. The first is a ratio scale. In the ratio scale the
measurements units are equally spaced and mathematical operations, such as subtract, division,
and multiplication are all valid. Good examples of ratio scale measurements might be the height
and weight. The next scale is the interval scale. In the interval scale the measurement units are
equally spaced like the ratio scale. But there is no clear absence of value. That is there is not a
true zero, and so operation such as multiplication and division are not valid. For most of the
work you do at data mining, the differences between the ratio and interval scales might not be
clearly apparent or important than the algorithm you are applying. But it is important to have
this clear in your mind when you are applying advanced statistical tests.
23
lOMoARcPSD|15772120
3.3 Date Functionality
Pandas has four main time related classes. Timestamp, DatetimeIndex, Period, and PeriodIndex.
First, let us look at Timestamp. Timestamp represents a single timestamp and associates values
with points in time. For example, let us create a timestamp using a string 9/1/2016 10:05AM,
and here we have our timestamp. Timestamp is interchangeable with Python's datetime in most
cases.
Fig 3.3 Date Functionality
Suppose we were not interested in a specific point in time, and instead wanted a span of time. This
is where Period comes into play. Period represents a single time span, such as a specific day or
month. Here we are creating a period that is January 2016, and here is an example of a period that is
March 5th, 2016. The index of a timestamp is DatetimeIndex. Looking at the type of our series
index, we see that it is DatetimeIndex. Similarly, the index of period is PeriodIndex.
Another cool thing we can do is change the frequency of our dates in our DataFrame
using asfreq. If we use this to change the frequency from bi-weekly to weekly, we will end up
with missing values every other week. So, let us use the forward fill method on those missing
values. One last thing I wanted to briefly touch upon is plotting time series. Importing
24
lOMoARcPSD|15772120
matplotlib.pyplot, and using the iPython magic %mapplotlib inline, will allow you to visualize
the time series in the notebook.
25
lOMoARcPSD|15772120
CHAPTER 4 : COVID-19 DATA ANALYSIS
ABSTRACT
This project is the amalgamation of two guided projects successfully completed on
the Coursera Project Network, namely “COVID-19 Data Analysis Using Python”
and “COVID-19 Data Visualization using Python”. The concepts used and taught
in the two projects mentioned are the brainchild of this cumulative project that applies
their concepts on the data set of India and the world to draw comparisons that are
relevant to the current time.
4.1 PROBLEM STATEMENT:
To analyse the spread of COVID-19 in the world and compare the spread in New
Zealand and India using python and associated libraries and modules.
4.2 ABOUT THE DATASET:
The World dataset contains the daily total of cases in different countries, along with
the total number of deaths, and recovered cases.
The dataset for India has been taken from the Ministry of Health and Family Welfare.
It comprises a daily total of cases in different states/UT , along with the total number
of deaths, and recovered cases.
4.3 PROCESSING THE DATASET:
4.3.1 WORLD DATASET:

The countries have been taken into consideration from the period when there
was at least 1 active case.
For further country-wise analysis, new data frames have been created with the
rows and columns corresponding to the selected country(India/New Zealand).
New columns have been created to calculate the infection, death, and recovery
rates.
26
lOMoARcPSD|15772120
4.3.2 INDIA DATASET:

A similar approach has been followed, and a state level data has been filtered
with Maharashtra and Delhi as the cases.
4.4 FLOW DIAGRAM:
Fig 4.1 Flow Diagram of the Project
4.5 SOURCE CODES, OUTPUTS AND ANALYSIS:
Fig 4.2 Importing required modules
4.5.1 WORLD ANALYSIS
Fig 4.3 Loading World Dataset
Fig 4.4 Generating Global Spread of Covid-19
27
lOMoARcPSD|15772120
Fig 4.5 Country-wise total confirmed cases viewed using choropleth

module of pyplot library (2020-04-03)
Fig 4.6 Country-wise total confirmed cases viewed using choropleth

From the above figures, it can be seen that initial spread was majorly
in China, USA, Russia and Italy.
China, Russia and Italy managed to curb their numbers, in India and
USA, the cases grew to 6M.
Fig 4.7 Generating Global Deaths of Covid-19
28
lOMoARcPSD|15772120
Fig 4.8 Country-wise total deaths viewed using choropleth module of

pyplot library (2020-09-16)
Fig 4.9 Generating Global Recovered Cases from Covid-19
Fig 4.10 Country-wise total recovered cases viewed using choropleth

29
lOMoARcPSD|15772120
4.5.2 ANALYSIS FOR NEW ZEALAND
Fig 4.11 Processing New Zealand Data
Fig 4.12 Setting Variables for Lockdown dates

in New Zealand
30
lOMoARcPSD|15772120
31
lOMoARcPSD|15772120
32
lOMoARcPSD|15772120
Fig 4.13 Generating the curve for New Zealand
Fig 4.14 Infection Rate curve along with dates of different levels of
lockdown in New Zealand viewed online graph using pyplot library
33
lOMoARcPSD|15772120
The Level 1 of lockdown(lowest level) was implemented only after the

wave died. As the new case emerged in Auckland, different levels of
lockdown were applied, and the second wave was limited to a
maximum rate of 20 new cases in a day.
4.5.3 ANALYSIS FOR INDIA
Fig 4.15 Processing India Data
Fig 4.16 Setting Variables for Different Phases of Lockdown in India
34
lOMoARcPSD|15772120
35
lOMoARcPSD|15772120
36
lOMoARcPSD|15772120
Fig 4.17 Generating the curve for India
Fig 4.18 Infection and Recovery Rate curves along with dates of
different phases of lockdown and un-lockdown in India viewed on line
graph using pyplot library
Lockdown was implemented in 4 different phases.
37
lOMoARcPSD|15772120
When recovery rate was greater than infection rate, phases of unlock
were started.
Un-Lockdown 5 was implemented after the recovery rate was
consistently greater than Infection Rate and the wave was constantly
dying.
4.5.3.1 NATIONAL ANALYSIS
Fig 4.19 Loading India Dataset
Fig 4.20 Generating National Spread of Covid-19
Fig 4.21 State-wise total confirmed cases viewed using choropleth

Fig 4.22 Generating National Deaths due to Covid-19
38
lOMoARcPSD|15772120
Fig 4.23 State-wise total deaths viewed using choropleth module of

pyplot library (2020-09-16)
Fig 4.24 Calculating Maximum Infection Rates for all States/UT
Fig 4.25 Plotting Bar graph for maximum infection rates
Fig 4.26 State-wise maximum infection rate viewed on a bar graph

using pyplot library
Maharashtra has the maximum Infection rate of 24.8k
From above figures, it is evident that the maximum cases in India are
from Maharashtra.
On diving deeper, it is known that the maximum cases came from
Mumbai, due to the spread in Dharavi, Asia’s largest slum.
39
lOMoARcPSD|15772120
The National Capital, Delhi also had a major contribution, given its
small area and a high number of cases.
4.5.3.2 ANALYSIS FOR DELHI
Fig 4.27 Creating new data frame for Delhi
Fig 4.28 Normalising Data
Plotting normalised values of Infection and Death rates of Delhi
Fig 4.29 Normalised values of Infection and Death rates of Delhi

viewed on a line graph using Pyplot library
Fig 4.30 Plotting normalised values of Infection Rate and Confirmed

Cases in Delhi
40
lOMoARcPSD|15772120
Fig 4.31 Normalised values of Infection Rate and Confirmed Cases in

Delhi plotted on a line graph using pyplot library
Delhi Infection Rate has a bimodal curve (two peaks).
The first peak can be seen around 8 July,2020. The infection rate then
decreased till about 27 July 2020.
With the new phase of un-lockdown, the cases then rose and hit a new
peak around 20 September,2020.
As the Infection Rate increased, a similar trend can be seen in the
Death Rate curve.
The Death Rate hit its peak around 16 June 2020.
The impact of infection rate can be seen on the confirmed cases, since
the infection rate shows the slope of confirmed cases.
4.5.3.3 ANALYSIS FOR MAHARASHTRA
Fig 4.32 Creating new data frame for Maharashtra
Fig 4.33 Normalising Data
Fig 4.34 Plotting normalised values of Infection and Death rates of

Maharashtra
41
lOMoARcPSD|15772120
Fig 4.35 Potting values of Infection Rate and Confirmed Cases in

Maharashtra
Fig 4.36 Normalised values of Infection and Death rates of

Maharashtra viewed on a line graph using Pyplot library
Fig 4.37 Normalised values of Infection Rate and Confirmed Cases in

Maharashtra plotted on a line graph using pyplot library Maharashtra
Infection Rate has a unimodal curve (single peak)
The maximum fresh cases were observed around 16 September,2020.
As the Infection Rate increased, a similar trend can be seen in the
Death Rate curve.
The Death Rate hit its peak around 20 June 2020.
The impact of infection rate can be seen on the confirmed cases, since
the infection rate shows the slope of confirmed cases.
42
lOMoARcPSD|15772120
CONCLUSION
The project successfully gives us an insight into the COVID-19 pandemic worldwide
and in New Zealand and India. It lets us know about the government’s decisions about
the phases of lockdown and un-lockdown and their basis. Pandas, NumPy, matplotlib
and pyplot have been heavily used in the project. First world countries like the USA,
Italy etc. have been more prone to the virus than the other countries. It can also be
seen that the infection rates for the countries have started to fall and the pandemic has
now started to die.
43
lOMoARcPSD|15772120
BIBLIOGRAPHY
1. Jason Brownlee. “Supervised and Unsupervised Machine Learning
Algorithms”.
Internet: https://machinelearningmastery.com/supervised-and-unsupervised-
machine-learning-algorithms/
2. “Probability and Statistics”.
Internet: https://www.statisticshowto.com/probability-and-statistics/
3. “The elements of Data Analytic Style” by Jeff Leek, Prof. John Hopkins
University
4. “School of Data Handbook” published by the School of Data in 2015
5. “The Data Science Handbook” by Carl Shan
44
lOMoARcPSD|15772120
MAHARAJA SURAJMAL INSTITUTE OF TECHNOLOGY
Summer/Industrial Training Evaluation Form F05 (MSIT-EXM-P-02)
(Year 2020 -- 2021)

Details of the Student Details of the Organisation
Name Lakshya Kharayat Name and address of organization Coursera Inc. (Online)
Roll No 02196302818 381 E. Evelyn Ave Mountain View, CA 94041
Branch and Semester ECE 5th Sem Broader Area Data Science with Python
Mobile No 9958846298 Name of Instructor Christopher Brooks
E-mail ID cartoonshow57@gmail.com Designation and Contact No Associate Professor

(University of Michigan)
Student Performance Record
No. of days Number of Curriculum Scheduled for the Curriculum actually covered
Scheduled for days student by the student
the training actually
attended
5 5 Python Fundamentals Python Fundamentals
Week 1
5 5 Basic Data Processing with Basic Data Processing with
Week 2 Pandas Pandas
5 5 Advanced Python Pandas Advanced Python Pandas
Week 3
5 5 Statistical Analysis in Python Statistical Analysis in Python
Week 4 and Project and Project
- - - -
Week 5
- - - -
Week 6
(Signature of the student)
Any comments or suggestions for the student performance during the training program(to be filled by
instructor)…………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………
………………………………………………………………………………………………………….
45
lOMoARcPSD|15772120
(Signature of the Instructor)

along with Seal
Note: Every student has to fill and submit this Performa duly signed by his/her instructor to the faculty-in-
charge by first week of September.
46
lOMoARcPSD|15772120
47

Sample Report On Data Science Course On Coursera

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sample Report On Data Science Course On Coursera

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|15772120

Sample report on Data Science course on Coursera

btech (Guru Gobind Singh Indraprastha University)

StuDocu is not sponsored or endorsed by any college or university

TRAINING REPORT ON: DATA SCIENCE

A Report submitted in partial fulfilment of the requirement for the

Maharaja Surajmal Institute of Technology

C-4 Janak Puri, New Delhi-110058

Affiliated to GGSIPU | NAAC Accredited 'A' Grade | ISO 9001:2015 Certified

Downloaded by Kapil? (kapil.kukreja07@gmail.com)

Downloaded by Kapil? (kapil.kukreja07@gmail.com)

I, Lakshya Kharayat (Enrolment Number: 01296302818) of Maharaja Surajmal Institute of

2. Data Structures in Pandas………………………….…………...........……14

3. Advanced Python Idioms……………………...…………………..….…19

4. Project: COVID-19 Data Analysis…………………..………………….…….…26

1.1 Pandas Trend

1.2 Jupyter Notebook Logo

2.1 Series Data Structure

2.2 Data Frame Indexing and Loading

2.3 Boolean Mask

3.1 Venn Diagram

3.2 Merging DataFrames

3.3 Date Functionality

4.1 Flow Diagram

4.2 Importing Modules

4.3 Loading Dataset

4.4 Generating Global Spread

delivering breakthrough insights, and making our lives more convenient.

Learn something new

• Learn something new in 4-6 weeks

• Priced starting at $39 (USD)

• Earn a Course Certificate

Gain a job-relevant skill in under 2 hours

• 100% online with no setup required

• Interactive learning experience with step-by-step, visual instruction from

• Priced starting at $9.99 (USD)

• Earn a Guided Project certificate

• Master a skill in 4-6 months

• Priced starting at $39 (USD) per month

• Earn a Specialization Certificate

Get job-ready for an in-demand career

• Get job-ready in less than a year with hands-on projects

• Priced starting at $39 per month (USD)

• Earn a shareable Certificate

Top degrees that fit your life

• Build your own schedule over 1-4 years of study

• Earn an accredited university bachelor or master’s degree

Data Science with Python

information out of it and apply this in solving the problems.

massively popular, really across the globe.

Pandas is an open-source Python Library providing high-performance data manipulation and

Data – an Econometrics from Multidimensional data.

for python pandas: -

Fig 1.1 Pandas Trend

Key Features of Pandas

data — load, prepare, manipulate, model, and analyse.

domains including finance, economics, Statistics, analytics, etc.

1.3 Jupyter Notebook

maintained by the people at Project Jupyter.

other kernels that you can also use.

Fig 1.2 Jupyter Notebook Logo

Data Structures in Pandas