Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Exploratory Data Analysis

Sasadhar Bera, IIM Ranchi 1


Outline
 Box Plot

 Outliers Detection using Box Plot

 Measuring Relationship between Variables

 Covariance

 Spearson’s Correlation

 Spearman’s Rank Correlation

2
Sasadhar Bera, IIM Ranchi
Box Plot
A box plot (also called box and whisker plot) is graphical
presentation of a data set with its central tendency,
spread, skewness, and the existence of outliers.

A box plot presents five summary measures of the data set


 Median of the data
 Lower quartile (Q1)
 Upper quartile (Q3)
 Smallest observation
 Largest observation

3
Sasadhar Bera, IIM Ranchi
Box Plot (Contd.)

25% of data 25% 25% 25% of data


of data of data

Xsmallest Q1 Median Q3 Xlargest


(Q2)

Left Whisker
Right Whisker

Lower Quartile Upper Quartile

Median is location of the dataset


Spread is length of box i. e. Interquartile range (IQR)
IQR = (Q3 – Q1)
IQR is the range for the middle 50% of the data
4
Sasadhar Bera, IIM Ranchi
Box Plot (Contd.)
Positive or right skewed distribution

Xsmallest Q1 Median Q3 Xlargest

Symmetric distribution

Xsmallest Q1 Median Q3 Xlargest

Negative or left  skewed distribution

Xsmallest Q1 Median Q3 Xlargest


5
Sasadhar Bera, IIM Ranchi
Detecting Outliers using Box Plot
Observations beyond the inner fences but within the outer
fences are suspected as outliers. Observations beyond the
outer fences are identified as outliers.

Smallest data point Largest data point


not exceeding inner not exceeding inner Suspected
fence fence outlier
Outlier x x *
o
Q1 Q2 Q3
IQR
Q1 - 1.5(IQR) Q3 + 1.5(IQR)

Inner Fence Inner Fence

Q1 - 3(IQR) Q3 + 3(IQR)
Outer Fence Outer Fence
6
Sasadhar Bera, IIM Ranchi
Example1: Box Plot
Two different vendors supply similar type of steel wire to a
rope manufacturer. Determine which vendor is better and
provide consistent product. Available sample dataset given
below.
Obs. No. Vendor Wire Strength Vendor Wire Strength
1 A 346 B 470
2 A 338 B 573
3 A 323 B 520
4 A 438 B 425
5 A 398 B 382
6 . . . .
7 . . . .
8 . . . .
9 . . . .
10 . . . .
11 A 368 B 526
12 A 376 B 371
13 A 311 B 452
14 A 379 B 300
15 A 216 B 598 7
Sasadhar Bera, IIM Ranchi
Example1: Box Plot (Contd.)
Descriptive  Vendor A Vendor B
Statistics
Min. 216 300
1st Qu. 330.5 407
Median 376 452
Mean 375.5 466.7
Standard 92.34 89.44
deviation
3rd Qu. 400 523
Max. 635 620
IQR 69.5 116

8
Sasadhar Bera, IIM Ranchi
Example1: Box Plot (Contd.)
Outlier

Median wire strength of vendor‐B is higher than vendor‐A which


indicates that wire from vendor‐B is stronger than vendor‐A.
In case of vendor‐B tail length (left and right whisker) is equal that
indicates vendor‐B produce more consistent wire.
Outlier present in case of vendor‐A. So there is a chance of producing
extremely high or low strength wire. 9
Sasadhar Bera, IIM Ranchi
Example2: Box Plot
The quarterly profits in thousand of dollars of a small but
growing company are shown in the table below. Make
graphical plot to understand the data.

Profit
Year Q1 Q2 Q3 Q4
1992 45.5 59.3 82.8 69.4
1993 74.7 88.2 109.6 87.7
1994 97.4 109.0 123.3 118.9
1995 122.2 137.7 155.8 140.9
1996 144.3 161.9 169.8 162.4

10
Sasadhar Bera, IIM Ranchi
Example2: Box Plot (Contd.)

The medians seem to rise from year to year. It indicates that there
is an increasing trend of profit. The interquartile range (IQR) gets
smaller. Most of the box plots are symmetric or very slightly
positively skewed. There are no outliers.
11
Sasadhar Bera, IIM Ranchi
Measuring Relationship between Variables
So far we have discussed different methods to summarize
the data for one variable at a time.

Often a manager or decision maker is interested in the


relationship between two variables.

Three descriptive measures of the relationship between


two variables are Covariance, Pearson’s correlation, and
Spearman’s rank correlation.

12
Sasadhar Bera, IIM Ranchi
Covariance
The covariance is a measure of the strength of linear
relationship between two variables.

Positive covariance value indicates a positive relationship


between two variables.

Negative covariance value indicates a negative relationship


between two variables.

However, covariance value depends on unit of


measurement for the two variables.

13
Sasadhar Bera, IIM Ranchi
Formula for Covariance
The covariance between two variable x and y is computed
as follows: n

 (x i  x )( y i  y)
For sample data: Sxy  i 1
n 1
N

 (x i   x )( yi   y )
For population :  xy  i 1
N
n → Sample size (i.e. number of sample observations)
N → Population size

x , y → Sample average for variable x and y respectively

 x ,  y → Popula on mean for variable x and y respec vely

14
Sasadhar Bera, IIM Ranchi
Pearson’s Correlation
A measurement of linear relationship between two variables
that is not affected by the unit of measurement is the
Pearson’s correlation coefficient.
Pearson’s correlation coefficient for sample data is
commonly known as sample correlation coefficient.

1 n ( x i  x ) ( y i  y)
For sample data: rxy  
n  1 i 1 Sx Sy
Sx , S y → sample standard deviation of x and y variables
1 n
( xi   x ) ( yi   y )
For population data:  xy 
N
 
i 1 y
x

 x ,  y → Population standard deviation of x and y variables


15
Sasadhar Bera, IIM Ranchi
Pearson’s Correlation (Contd.)
The correlation coefficient (rxy) value lies between ‐1 to +1.
A correlation coefficient of +1 corresponds to perfect positive
relationship between two variables. A correlation coefficient of
‐1 corresponds to perfect negative relationship between two
variables.
Correlation coefficient value close to +1 or ‐1 indicates a strong
relationship. Correlation coefficient value close to 0 indicates a
weaker relationship.

Correlation only measures the strength of a relationship


between two variables but does not prove a cause and effect
relationship.
A value of correlation coefficient ≈ 0 would indicate no linear
relationship between variables but this may indicate that the
true form of the relationship is non‐linear.
16
Sasadhar Bera, IIM Ranchi
Pearson’s Correlation (Contd.)
r = +1 , perfect positive relationship between two variables.
r = ‐1 , perfect negative relationship between two variables.

Y r = +1 Y r = -1

X X

17
Sasadhar Bera, IIM Ranchi
Relation between Covariance and Correlation

1 n ( xi  x) ( yi  y )
Correlation: rxy  
n  1 i 1 S x Sy
1 n ( xi  x)( yi  y )
 
S x S y i 1 n 1
1
 S xy
Sx S y

→ rxy S x S y  S xy

(Correlation of x and y)*(Standard deviation of x)


*(Standard deviation of y) = Covariance of x and y

18
Sasadhar Bera, IIM Ranchi
Example: Covariance and Correlation Coefficient
A sample data of 9 observations related to automobile tire
is given below. Determine covariance and correlation
coefficient.
Observation  Load Carrying 
Number Tire Rating Capacity
1 75 853
2 82 1047
3 85 1135
4 87 1201
5 88 1235
6 91 1356
7 92 1389
8 93 1433
9 105 2039
19
Sasadhar Bera, IIM Ranchi
Example: Covariance and Correlation Coefficient
x – xbar y – ybar
x y (1) (2) (1) * (2)
75 853 ‐13.67 ‐445.67 6091
82 1047 ‐6.67 ‐251.67 1678
85 1135 ‐3.67 ‐163.67 600
87 1201 ‐1.67 ‐97.67 163
88 1235 ‐0.67 ‐63.67 42
91 1356 2.33 57.33 134
92 1389 3.33 90.33 301
93 1433 4.33 134.33 582
105 2039 16.33 740.33 12092
Average 88.67 1298.67 Sum 21683
Standard
deviation 8.29 331.65

n = 9, covariance = 21683/(9‐1) = 2710.38


correlation = 21683/((9‐1) *8.29*331.65) = 0.986
20
Sasadhar Bera, IIM Ranchi
Example: Covariance and Correlation Coefficient
Correlation coefficient value +0.986 indicates strong
positive linear association (or relationship), and this is also
borne out by the scatter plot of the data.

21
Sasadhar Bera, IIM Ranchi
Spearman’s Rank Correlation
The Pearson’s correlation coefficient characterizes only
linear association. Spearman’s correlation coefficient
measures the extent to which the variables have the same
ordering.

To determine Spearman’s correlation coefficient, first


ranks both the variables, and then compute the
correlation coefficient on these new variables. These new
variables can handle non‐linear association. Furthermore,
it is not disturbed by outliers.

When data has been collected which is in ranked form


then a ranked correlation coefficient can be determined.

22
Sasadhar Bera, IIM Ranchi
Spearman’s Rank Correlation
Spearman’s rank correlation coefficient (rS)xy for a sample
data is defines as follows.
n
6  (x r  yr ) 2
For sample data: (rS ) xy  1  i 1
n (n 2  1)
x r , yr → rank assign to same individual in x and y
variables

n → number of observations

23
Sasadhar Bera, IIM Ranchi
Example: Spearman’s Rank Correlation
You are given the ranks of 10 students in midterm and end
term examinations in statistics. Compute Spearman’s
coefficient of rank correlation and interpret it.
Student MidTerm EndTerm
name Rank Rank
Priyank 1 3
Sunita 3 2
Rohit 7 8
Sam 10 7
Tamal 9 9
Rohini 5 6
Santanu 4 5
Vijay 8 10
Anirudh 2 1
Sania 6 4
24
Sasadhar Bera, IIM Ranchi
Example: Spearman’s Rank Correlation (Contd.)
Student MidTerm EndTerm Rank
name Rank Rank difference
Priyank 1 3 ‐2
Sunita 3 2 1
Rohit 7 8 ‐1
Sam 10 7 3
Tamal 9 9 0
Rohini 5 6 ‐1
Santanu 4 5 ‐1
Vijay 8 10 ‐2
Anirudh 2 1 1
Sania 6 4 2
n
6  (  2 ) 2  (1) 2  . . .  ( 2 ) 2
( rS ) xy  1  i 1
 0 . 842
10 (10  1)
2

There is a high degree of correlation between student’s mid


term and end term ranks. The higher is the rank in mid term,
higher is the rank in end term also.
25
Sasadhar Bera, IIM Ranchi
Comments on Correlation Coefficient
The correlation coefficient may vary depending on sample
size.

If samples are drawn with different sample size from a


population, it is quite likely that different values for
correlation coefficient will be observed.

Hence before interpretation of a correlation coefficient


sample size must be taken into account.

If two variables are highly correlated, this does not


indicate that one must be the cause of the other variable.

26
Sasadhar Bera, IIM Ranchi
Comments on Correlation Coefficient (Contd.)
There is a term so‐called spurious correlation or false
correlation which shows two variables are highly
correlated but this is happen due to some unseen factor.

Example: A group of 50 adults was chosen at random and


the length of their left arm, and right leg was recorded. A
high positive correlation observed between left arm and
right leg. But it make no sense to suggest that the length
of the left arm is a “cause” of the length of right leg.

Thus it is very important that attention be given to other


factors that may affect both the variables being
investigated.

27
Sasadhar Bera, IIM Ranchi

You might also like