Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28

ZG 536

Foundations of Data Science


BITS Pilani Pravin Mhaske
Pilani Campus
BITS Pilani
Pilani Campus

M2 Data Science Foundations


Lecture 4 Descriptive Analytics
Data Visualizations
Storytelling with data
Types of Analytics

• Descriptive analytics: “What happened?" -


by summarizing historical data
• Diagnostic analytics: “Why it happened?" -
by analyzing reasons for trends and
patterns identified
• Predictive analytics: “What will happen?”-
Statistical models and machine learning to
predict future outcomes
• Prescriptive analytics: “What to do?” -
Recommends what to do next based on
insights from historical and predictive data

BITS Pilani, Pilani Campus


Descriptive Analytics and Descriptive
Statistics

Statistical interpretation used to analyze historical data to identify


patterns and relationships.

Standalone applications - Charts, tables, figures, plots, reports,


dashboards

As an input to predictive analytics – Exploratory Data Analysis

BITS Pilani, Pilani Campus


Descriptive Statistics

• Measures of Central Tendency


– Mathematical averages – Mean (Arithmetic, Harmonic, Geometric…)
– Positional averages – Median, Mode, Quartiles, Percentiles

• Measures of Dispersion
– Range
– Standard Deviation/Variance
– IQR
– Coefficient of Variation

• Measures of Association
– Covariance
– Correlation

BITS Pilani, Pilani Campus


Measures of Dispersion

• Dispersion indicates variability,


scatteredness, or spreadness of data;
it shows how stretched or squeezed is
the underlying distribution.
• Extremely useful for a number of
practical applications: quality control,
reliability analysis, banking, insurance
and portfolio management, etc.
• Range, interquartile range, variance,
standard deviation, and coefficient of
variation

BITS Pilani, Pilani Campus


Measures of Dispersion

Variance/standard deviation
• Numerous applications in descriptive statistics, statistical inference, hypothesis testing, Monte Carlo
simulation, analysis of variance.
• Wide applications in physics, biology, chemistry, economics, and finance.
n

x  x 
2
i
Sample standard deviation  s   i 1

n 1
Coefficient of Variation (CV)
• A relative measure of dispersion.
• It has enormous applications in quality assurance studies.
• Useful in comparing dispersions of two distributions having different measurement units.
s
coefficient of variation CV  for sample data 
x

coefficient of variation CV  for population 

BITS Pilani, Pilani Campus
Measures of Dispersion

• Range: difference between the largest and the smallest values in a dataset.
• Interquartile range: difference between the third (upper) quartile and first (lower)
quartile. IQR = Q3 – Q1

BITS Pilani, Pilani Campus


Box and whisker plot, Five number summary

• The box-and-whisker plot (box plot) is a graphical


representation of a set of observations, based on the
five-number summary.
• It is a very useful tool in detecting outliers, and in
summarizing the distribution of data.
• Five number summary
• Min
• Q1
• Q2 (Median)
• Q3
• Max

BITS Pilani, Pilani Campus


Outlier

• An outlier (spurious data point) is an observation point that is distant from other
observations.

• Outlier detection methods:


• Standardized values (z-scores)
• Using quartiles and IQR:
• Find lower limit = Q1 – 1.5 (IQR) and upper limit = Q3 + 1.5 (IQR)
• Data outside this range could be flagged as outliers.

BITS Pilani, Pilani Campus


Skew and Kurtosis

BITS Pilani, Pilani Campus


Measures of Association
Covariance
• It is an absolute measure of how much two variables change together.
• The sign of the covariance shows the tendency in the linear relationship between the variables.
The magnitude of covariance does not really produce a fruitful meaning.
• If two variables tend to show similar behaviour, then the covariance is positive, otherwise
negative. Zero covariance implies the variables are not linearly related.

  x  x  y  y 
i i
Sample covariance, sxy  i 1

n 1
n

  x  x  y  y 
i i
Population covariance,  xy  i 1

N
BITS Pilani, Pilani Campus
Measures of Association

Correlation
• Correlation is a normalized covariance. It lies in between -1 to +1.
• It provides a measure of linear relationship or association between two variables.
• If two variables tend to show similar behaviour, then the correlation is positive, otherwise
negative.

sxy x y x y
i i
Sample correlation, rxy   i 1

sx s y n n

  x i -x    yi -y 
2 2

i=1 i=1

 xy
Population correlation,  xy 
 x y

BITS Pilani, Pilani Campus


Correlation

BITS Pilani, Pilani Campus


Exploratory Data Analysis (EDA)

• An approach to analyze the data using visual techniques. Initial investigation.

Objectives:
1. Explore data to become familiar with data
2. Discover patterns, trends, relationships
3. Spot anomalies
4. Test Hypothesis or assumptions
5. Summarizing data
6. Missing/Null values
7. Explain outcomes or results of analysis
8. Tell a story with data

BITS Pilani, Pilani Campus


Types of Visualization

Text:
• Simple text
• Tables
• Heatmap

Region Q1 Sales Q2 Sales Q3 Sales Q4 Sales

North 12000 23456 20000 12345

South 45678 12000 67890 12346

West 34567 12345 12000 45678

BITS Pilani, Pilani Campus


Types of Visualization

•Graphs 3.5
3
• Points (Scatter) 2.5

• Lines 2
1.5
• SlopeGraph 1
0.5
0
0.5 1 1.5 2 2.5 3

6
5
4
3
2
1
0
Category 1 Category 2 Category 3 Category 4
Jan-18 Feb-18
Series 1 Series 2 Series 3
North South East West Middle

BITS Pilani, Pilani Campus


Types of Visualization

6 100%
◦ Bars 5 80%
◦ Horizontal / Vertical 4 60%
3
◦ Stacked 2 40%

◦ Waterfall 1 20%
0 0%
◦ Area South North East West South North East West
Sales Cost Series 3 Computer Electronics Series 3

50
West
40
East 30

North 20
10
South
0
0% 20% 40% 60% 80% 100%
1/5/20021/6/20021/7/20021/8/20021/9/2002
Computer Electronics Series 3
Series 1 Series 2

BITS Pilani, Pilani Campus


Types of Visualization

Univariate - distribution

Bivariate - relationships

• Categorical Vs Categorical
• Continuous Vs Continuous
• Continuous Vs Categorical

Multivariate

BITS Pilani, Pilani Campus


Visualization Cheat Sheet

BITS Pilani, Pilani Campus


Visualization Cheat Sheet

BITS Pilani, Pilani Campus


Visualizations to be avoided

• Pie charts/Donut charts


• 3D charts
• Dual Axis charts Sales

35
30 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
25
20
15
10
5
0
1/5/2002 1/6/2002
1/7/2002 1/8/2002 Series 1
1/9/2002

Series 1 Series 2

BITS Pilani, Pilani Campus


Bad Design

BITS Pilani, Pilani Campus


Good Design

BITS Pilani, Pilani Campus


Decluttering

BITS Pilani, Pilani Campus


Storytelling with data

BITS Pilani, Pilani Campus


How to lie?

BITS Pilani, Pilani Campus


Visualization tools

• Excel

• Tableau

• Power BI

• Python (matplotlib, Seaborn, bokeh)

• R/Rstudio

• Qlikview/Qliksense…

BITS Pilani, Pilani Campus

You might also like