Professional Documents
Culture Documents
04 Exploratory Data Analysis
04 Exploratory Data Analysis
Outliers Detection using Box Plot
Measuring Relationship between Variables
Covariance
Spearson’s Correlation
Spearman’s Rank Correlation
2
Sasadhar Bera, IIM Ranchi
Box Plot
A box plot (also called box and whisker plot) is graphical
presentation of a data set with its central tendency,
spread, skewness, and the existence of outliers.
3
Sasadhar Bera, IIM Ranchi
Box Plot (Contd.)
Left Whisker
Right Whisker
Lower Quartile Upper Quartile
Symmetric distribution
Negative or left skewed distribution
Q1 - 3(IQR) Q3 + 3(IQR)
Outer Fence Outer Fence
6
Sasadhar Bera, IIM Ranchi
Example1: Box Plot
Two different vendors supply similar type of steel wire to a
rope manufacturer. Determine which vendor is better and
provide consistent product. Available sample dataset given
below.
Obs. No. Vendor Wire Strength Vendor Wire Strength
1 A 346 B 470
2 A 338 B 573
3 A 323 B 520
4 A 438 B 425
5 A 398 B 382
6 . . . .
7 . . . .
8 . . . .
9 . . . .
10 . . . .
11 A 368 B 526
12 A 376 B 371
13 A 311 B 452
14 A 379 B 300
15 A 216 B 598 7
Sasadhar Bera, IIM Ranchi
Example1: Box Plot (Contd.)
Descriptive Vendor A Vendor B
Statistics
Min. 216 300
1st Qu. 330.5 407
Median 376 452
Mean 375.5 466.7
Standard 92.34 89.44
deviation
3rd Qu. 400 523
Max. 635 620
IQR 69.5 116
8
Sasadhar Bera, IIM Ranchi
Example1: Box Plot (Contd.)
Outlier
Profit
Year Q1 Q2 Q3 Q4
1992 45.5 59.3 82.8 69.4
1993 74.7 88.2 109.6 87.7
1994 97.4 109.0 123.3 118.9
1995 122.2 137.7 155.8 140.9
1996 144.3 161.9 169.8 162.4
10
Sasadhar Bera, IIM Ranchi
Example2: Box Plot (Contd.)
The medians seem to rise from year to year. It indicates that there
is an increasing trend of profit. The interquartile range (IQR) gets
smaller. Most of the box plots are symmetric or very slightly
positively skewed. There are no outliers.
11
Sasadhar Bera, IIM Ranchi
Measuring Relationship between Variables
So far we have discussed different methods to summarize
the data for one variable at a time.
12
Sasadhar Bera, IIM Ranchi
Covariance
The covariance is a measure of the strength of linear
relationship between two variables.
13
Sasadhar Bera, IIM Ranchi
Formula for Covariance
The covariance between two variable x and y is computed
as follows: n
(x i x )( y i y)
For sample data: Sxy i 1
n 1
N
(x i x )( yi y )
For population : xy i 1
N
n → Sample size (i.e. number of sample observations)
N → Population size
14
Sasadhar Bera, IIM Ranchi
Pearson’s Correlation
A measurement of linear relationship between two variables
that is not affected by the unit of measurement is the
Pearson’s correlation coefficient.
Pearson’s correlation coefficient for sample data is
commonly known as sample correlation coefficient.
1 n ( x i x ) ( y i y)
For sample data: rxy
n 1 i 1 Sx Sy
Sx , S y → sample standard deviation of x and y variables
1 n
( xi x ) ( yi y )
For population data: xy
N
i 1 y
x
Y r = +1 Y r = -1
X X
17
Sasadhar Bera, IIM Ranchi
Relation between Covariance and Correlation
1 n ( xi x) ( yi y )
Correlation: rxy
n 1 i 1 S x Sy
1 n ( xi x)( yi y )
S x S y i 1 n 1
1
S xy
Sx S y
→ rxy S x S y S xy
18
Sasadhar Bera, IIM Ranchi
Example: Covariance and Correlation Coefficient
A sample data of 9 observations related to automobile tire
is given below. Determine covariance and correlation
coefficient.
Observation Load Carrying
Number Tire Rating Capacity
1 75 853
2 82 1047
3 85 1135
4 87 1201
5 88 1235
6 91 1356
7 92 1389
8 93 1433
9 105 2039
19
Sasadhar Bera, IIM Ranchi
Example: Covariance and Correlation Coefficient
x – xbar y – ybar
x y (1) (2) (1) * (2)
75 853 ‐13.67 ‐445.67 6091
82 1047 ‐6.67 ‐251.67 1678
85 1135 ‐3.67 ‐163.67 600
87 1201 ‐1.67 ‐97.67 163
88 1235 ‐0.67 ‐63.67 42
91 1356 2.33 57.33 134
92 1389 3.33 90.33 301
93 1433 4.33 134.33 582
105 2039 16.33 740.33 12092
Average 88.67 1298.67 Sum 21683
Standard
deviation 8.29 331.65
21
Sasadhar Bera, IIM Ranchi
Spearman’s Rank Correlation
The Pearson’s correlation coefficient characterizes only
linear association. Spearman’s correlation coefficient
measures the extent to which the variables have the same
ordering.
22
Sasadhar Bera, IIM Ranchi
Spearman’s Rank Correlation
Spearman’s rank correlation coefficient (rS)xy for a sample
data is defines as follows.
n
6 (x r yr ) 2
For sample data: (rS ) xy 1 i 1
n (n 2 1)
x r , yr → rank assign to same individual in x and y
variables
n → number of observations
23
Sasadhar Bera, IIM Ranchi
Example: Spearman’s Rank Correlation
You are given the ranks of 10 students in midterm and end
term examinations in statistics. Compute Spearman’s
coefficient of rank correlation and interpret it.
Student MidTerm EndTerm
name Rank Rank
Priyank 1 3
Sunita 3 2
Rohit 7 8
Sam 10 7
Tamal 9 9
Rohini 5 6
Santanu 4 5
Vijay 8 10
Anirudh 2 1
Sania 6 4
24
Sasadhar Bera, IIM Ranchi
Example: Spearman’s Rank Correlation (Contd.)
Student MidTerm EndTerm Rank
name Rank Rank difference
Priyank 1 3 ‐2
Sunita 3 2 1
Rohit 7 8 ‐1
Sam 10 7 3
Tamal 9 9 0
Rohini 5 6 ‐1
Santanu 4 5 ‐1
Vijay 8 10 ‐2
Anirudh 2 1 1
Sania 6 4 2
n
6 ( 2 ) 2 (1) 2 . . . ( 2 ) 2
( rS ) xy 1 i 1
0 . 842
10 (10 1)
2
26
Sasadhar Bera, IIM Ranchi
Comments on Correlation Coefficient (Contd.)
There is a term so‐called spurious correlation or false
correlation which shows two variables are highly
correlated but this is happen due to some unseen factor.
27
Sasadhar Bera, IIM Ranchi