Professional Documents
Culture Documents
ZG536 - L4 - Descriptive Analytics - 030224
ZG536 - L4 - Descriptive Analytics - 030224
• Measures of Dispersion
– Range
– Standard Deviation/Variance
– IQR
– Coefficient of Variation
• Measures of Association
– Covariance
– Correlation
Variance/standard deviation
• Numerous applications in descriptive statistics, statistical inference, hypothesis testing, Monte Carlo
simulation, analysis of variance.
• Wide applications in physics, biology, chemistry, economics, and finance.
n
x x
2
i
Sample standard deviation s i 1
n 1
Coefficient of Variation (CV)
• A relative measure of dispersion.
• It has enormous applications in quality assurance studies.
• Useful in comparing dispersions of two distributions having different measurement units.
s
coefficient of variation CV for sample data
x
coefficient of variation CV for population
BITS Pilani, Pilani Campus
Measures of Dispersion
• Range: difference between the largest and the smallest values in a dataset.
• Interquartile range: difference between the third (upper) quartile and first (lower)
quartile. IQR = Q3 – Q1
• An outlier (spurious data point) is an observation point that is distant from other
observations.
x x y y
i i
Sample covariance, sxy i 1
n 1
n
x x y y
i i
Population covariance, xy i 1
N
BITS Pilani, Pilani Campus
Measures of Association
Correlation
• Correlation is a normalized covariance. It lies in between -1 to +1.
• It provides a measure of linear relationship or association between two variables.
• If two variables tend to show similar behaviour, then the correlation is positive, otherwise
negative.
sxy x y x y
i i
Sample correlation, rxy i 1
sx s y n n
x i -x yi -y
2 2
i=1 i=1
xy
Population correlation, xy
x y
Objectives:
1. Explore data to become familiar with data
2. Discover patterns, trends, relationships
3. Spot anomalies
4. Test Hypothesis or assumptions
5. Summarizing data
6. Missing/Null values
7. Explain outcomes or results of analysis
8. Tell a story with data
Text:
• Simple text
• Tables
• Heatmap
•Graphs 3.5
3
• Points (Scatter) 2.5
• Lines 2
1.5
• SlopeGraph 1
0.5
0
0.5 1 1.5 2 2.5 3
6
5
4
3
2
1
0
Category 1 Category 2 Category 3 Category 4
Jan-18 Feb-18
Series 1 Series 2 Series 3
North South East West Middle
6 100%
◦ Bars 5 80%
◦ Horizontal / Vertical 4 60%
3
◦ Stacked 2 40%
◦ Waterfall 1 20%
0 0%
◦ Area South North East West South North East West
Sales Cost Series 3 Computer Electronics Series 3
50
West
40
East 30
North 20
10
South
0
0% 20% 40% 60% 80% 100%
1/5/20021/6/20021/7/20021/8/20021/9/2002
Computer Electronics Series 3
Series 1 Series 2
Univariate - distribution
Bivariate - relationships
• Categorical Vs Categorical
• Continuous Vs Continuous
• Continuous Vs Categorical
Multivariate
35
30 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
25
20
15
10
5
0
1/5/2002 1/6/2002
1/7/2002 1/8/2002 Series 1
1/9/2002
Series 1 Series 2
• Excel
• Tableau
• Power BI
• R/Rstudio
• Qlikview/Qliksense…