Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

2 Data Analysis With Python

Workshop | Learning Material


By Maaheen & Megna
Data Analytics
◻ Data Analytics consists of:
🞑 Data
🞑 Information generated from the data
🞑 Statistical Analysis
🞑 Quantitative Methods
🞑 Modelling

For stakeholders to gain insights about their problem statement

©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners. 2
Types of Analytics

©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
Descriptive Analytics

©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
Prescriptive Analytics

©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
Predictive Predicting on
the data
Analytics

©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
Categories of Data
◻ There are 4 types of
data:
🞑 Categorical
🞑 Ordinal data
🞑 Interval data
🞑 Ratio data

©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
Categorical data
1. Categorical data is a collection of information that has been
classified into groups. For example, if a company or
organization attempts to obtain biodata on its employees, the
resulting information is referred to as categorical.
2. This data is referred to as categorical because it can be
categorized based on the factors present in the biodata, such
as gender, state of residence, and so on.
3. Categorical data can be numerically valued.
4. 2 types- nominal(used for naming variables without numeric
values), ordinal(has set order or scale included).
©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
Ordinal data
1. Ordinal data is a type of categorical data that has a
fixed order or scale.
2. Ordinal data is said to have been gathered, for
example, when a respondent rates his or her financial
happiness on a scale of 1 to 10.
3. There is no standard scale on which the difference in
each score is calculated in ordinal data.

©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
Interval data
1. An interval scale has order, and the distinction
between two values is significant.
2. Temperature (Farenheit), temperature
(Celcius), pH, SAT score (200-800), and credit
score are examples of interval variables (300-
850)

©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
Ratio data
1. The measurement of heights is an excellent
example of ratio data. The height of a person
can be measured in centimeters, meters,
inches, or feet.
2. There is no such thing as a negative height.
3. When compared to interval data, for
example, the temperature can be –10°C, but
the height cannot be negative, as stated
above.
©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
Data example-

©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
What is a Model?
1. A model is a representation of a real
system, concept, or object that is
abstracted.
2. It encapsulates the most important aspects.
3. A written or verbal description, a visual
display, a mathematical formula, or a
spreadsheet representation may all be
used.
©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
What is a Decision Model?
A decision model is a model that is used to
comprehend, evaluate, or promote decision-making.
After that, the model can be used to make predictions.

©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
Steps to Problem Solving-
1. Business Understanding- Understand the industry, project requirements, idea about the data to be
used, a preliminary plan.
2. Understanding the data- Identify the relevant data from the many sources (internal database and
external sources), understand attributes, dependent/independent variables, distribution of the data,
issues with the data, use graphical & statistical techniques to understand data.
3. Data preprocessing- important for model’s accuracy.
🞑 Data consolidation – combine data from multiple sources
🞑 Data cleaning – handling missing data, noisy data, data inconsistency
🞑 Data transformation – normalization, discretization, aggregation, construct new variables,
etc
🞑 Data reduction – reduce dimension of data, remove records/columns, etc
1. Modelling- Build and compare different models with their scores, go back and try different metrics to
check model’s performance, train and test your data in different sets, evaluate your model based on
different metrics(confusion matrix,precision,recall,fscore,etc.).
2. Deployment- Depending on the requirements, the deployment phase can be:
🞑 As simple as generating a report
🞑 Or as complex as implementing a system that uses the model for daily operations
Monitoring and maintenance of models
©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
Types of plots for EDA-
1. Matpotlib: Bar graph

It is a type of visualization that helps in representing categorical data. It has rectangular bars (hence the name,
bar graph) that can be represented horizontally and vertically.
Syntax- plt.bar(index or x,values or y)

1. Matpotlib: Pie chart

Also known as a circle chart or Emma chart, it is used to represent proportions of data. The central angle and the
area between each of the parts of the pie chart represent the quantity of data. It has been named as pie chart
due to its resemblance to a piece of a pie.
Usually the data shown in a pie chart are in percentages.
Syntax- pie(x, explode=None, labels=None, colors=None, autopct=None, pctdistance=0.6, shadow=False,
labeldistance=1.1, startangle=None, radius=None, counterclock=True, wedgeprops=None, textprops=None,
center=(0, 0), frame=False, rotatelabels=False, *, data=None)
©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
3. Matpotlib: Scatter plot
It helps visualize the relationship between 2 or more variables. In addition to this, it helps in identifying
outliers(abnormalities)

Syntax- plt.scatter(x, y, c=colors, alpha=0.5)

4. Matpotlib: Histogram
Histograms help in representing grouped data. The X-axis and the Y-axis represent the range and the
frequencies respectively. The histogram is based on the area of the bar and not always the height of the bar.
Syntax-plt..hist(x, bins=n_bins)

©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.
Thank You

©2021 by Megna Roy and Maaheen Jaiswal. All rights reserved. No part of this document may be reproduced or
transmitted in any form or by any means, electronic or otherwise, without prior written permission of the owners.

You might also like