211MAT1302 Unit-3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

211MAT1302- Statistics for Engineers Course Material

KALASALINGAM ACADEMY OF RESEARCH AND EDUCATION


DEPARTMENT OF MATHEMATICS
211MAT1302- Statistics for Engineers
Course Material
Unit 3 - Correlation, Rank Correlation and Regression Analysis
1. Correlation
Scatter Diagram: For the bivariate distribution (xi , yi ), i = 1, 2, · · · , n,
if the values of the variables X and Y be plotted along x-axis and y-axis
respectively in the xy-plane, the diagram of dots so obtained is known as
scatter diagram.
Note: From the scatter diagram, we can identify whether the variables are
correlated or not. If the points are very dense, i.e., very close to each other,
we should expect a fairly good amount of correlation between the variables
and if the points are widely scattered, a poor correlation is expected.

Karl Pearson’s Coefficient of Correlation

Correlation coefficient between two random variables X and Y , denoted by


r(X, Y ) or rXY or simply r, is a numerical measure of linear relationship
between them and is defined as
Cov(X, Y )
r(X, Y ) =
σX σY
For the bivariate distribution (xi , yi ), i = 1, 2, · · · , n,
P P
Cov(X, Y ) = 1
n
(xi − x)(yi − y) = 1
n
xy − x.y
P P
2
σX = 1
n
(xi − x)2 = 1
n
x2 − (x)2
P P
σY2 = 1
n
(yi − y)2 = 1
n
y 2 − (y)2

1
P 1
P
x= n
x and y = n
y

Properties:
1. −1 ≤ r(X, Y ) ≤ 1
2. Correlation coefficient is independent of change of origin,
i.e. r(X + a, Y + b) = r(X, Y ) where a and b are constants.
3. Correlation coefficient is independent of scale,
i.e., r(aX, bY ) = r(X, Y ) where a > 0 and b > 0 are constants.

Compiled by: Dr. K. Karuppasamy, www.drkk.in, KARE Page: 1


211MAT1302- Statistics for Engineers Course Material

4. If X and Y are independent random variables, then they are uncorre-


lated, i.e., r(X, Y ) = 0. But the converse is need not be true.
That is r(X, Y ) = 0, does not imply that X and Y are independent.

Note: Scatter diagram for r > 0, r < 0, r = 0, r = +1 and r = −1.


Y Y Y
`
`` ` `
` ` ` ` ` `
` `` `
` ` ` ` ` ` `
``
` `` ` ` ` `` ` ` ` ` ` `` ` ` `
` ` ` ` ` ` `
` ` ` ` ` ` ` ` ` ` ` `
` ` `
` ` ` ` ` ` ` `` ` `

O X O X O X
r>0 r<0 r=0

Y Y
``
`
`` ``
` ``
`
` ``
` `
`` ` `

O X O
r = +1 r = −1 X

Note: For a, b, c and d are constants, we have

1. E(a) = a.

2. E(aX + bY ) = aE(X) + bE(Y ).

3. If X and Y are independent, then E(XY ) = E(X)E(Y ).

4. V ar(a) = 0.

5. V ar(aX) = a2 V ar(X).

6. V ar(aX + bY ) = a2 V ar(X) + b2 V ar(Y ) + 2abCov(X, Y ).

7. If X and Y are independent, then


V ar(aX + bY ) = a2 V ar(X) + b2 V ar(Y ).

8. Cov(X + a, Y + b) = Cov(X, Y ).

9. Cov(aX, bY ) = abCov(X, Y ).

10. Cov(aX + c, bY + d) = abCov(X, Y ).

Compiled by: Dr. K. Karuppasamy, www.drkk.in, KARE Page: 2


211MAT1302- Statistics for Engineers Course Material

11. If X and Y are independent, then Cov(X, Y ) = 0.

Problem: If Y = −2X + 3, find Cov(X, Y ).

Solution: Given Y = −2X + 3.


Now Cov(X, Y ) = E(XY ) − E(X)E(Y )
= E[X(−2X +3)]−E(X)E(−2X +3) = E(−2X 2 +3X)−E(X)[−2E(X)+3]
= −2E(X 2 ) + 3E(X) + 2{E(X)}2 − 3E(X)
= −2 [E(X 2 ) − {E(X)}2 ] = −2Var(X)

Problem: If X and Y are independent random variables having means 16


and 9, respectively, find the correlation coefficient between X and Y .

Solution: Given X and Y are independent and E(X) = 16 and E(Y ) = 9.


Since X and Y are independent, we have E(XY ) = E(X)E(Y ).
Hence Cov(X, Y ) = E(XY ) − E(X)E(Y ) = 0 and hence
Cov(X, Y )
r= = 0.
σX σy
Problem: If X, Y, Z are uncorrelated random variables having same vari-
ance, find the correlation coefficient between (X + Y ) and (X − Y )

Solution: Given X, Y, Z are uncorrelated random variables and Var(X) =


Var(Y ) = Var(Z) = k (say).
X and Y are uncorrelated ⇒ r(X, Y ) = 0 ⇒ Cov(X, Y ) = 0
⇒ E(XY ) − E(X)E(Y ) = 0, and hence E(XY ) = E(X)E(Y ).
Similarly, Y and Z are uncorrelated ⇒ E(Y Z) = E(Y )E(Z).
Also, X and Z are uncorrelated ⇒ E(XZ) = E(X)E(Z).
Let U = X + Y and V = X − Y .
Now E(U ) = E(X + Y ) = E(X) + E(Y );
E(V ) = E(X − Y ) = E(X) − E(Y );
E(U V ) = E{(X + Y )(X − Y )} = E{X 2 − Y 2 } = E(X 2 ) − E(Y 2 )
Now Cov(U, V ) = E(U V ) − E(U )E(V )
= {E(X 2 ) − E(Y 2 )} − {E(X)
 + E(Y )} {E(X) − E(Y )}
= {E(X 2 ) − E(Y 2 )} − {E(X)}2 − {E(Y )}2
   
= E(X 2 ) − {E(X)}2 − E(Y 2 ) − {E(Y )}2 = Var(X)−Var(Y ) = k−k = 0
Now Var(U ) = Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
= Var(X) + Var(Y ) (Since Cov(X, Y ) = 0).
= k + k = 2k
Also Var(V ) = Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y )
= Var(X) + Var(Y ) (Since Cov(X, Y ) = 0).

Compiled by: Dr. K. Karuppasamy, www.drkk.in, KARE Page: 3


211MAT1302- Statistics for Engineers Course Material

= k + k = 2k

Cov(U, V ) 0
Now r(U, V ) = = √ √ =0
σU σV 2k. 2k
Problem: Two random variables X and Y are related as Y = 4X + 9. Find
the correlation coefficient between X and Y .

Solution: Given Y = 4X + 9.
Now Cov(X, Y ) = E(XY ) − E(X)E(Y )
= E {X(4X + 9)} − E(X)E(4X + 9)
= E {4X 2 + 9X} − E(X) {4E(X) + 9}
= 4E(X

2
) + 9E(X) − 4 {E(X)}2 − 9E(X)
= 4 E(X 2 ) − {E(X)}2
2
= 4Var(X) = 4σX

Now Var(Y ) = Var(4X + 9) = 42 Var(X)


Hence σY = 4σX
2
Cov(X, Y ) 4σX
Now r(X, Y ) = = =1
σX σY σX 4σX
Problem: If X, Y, Z are uncorrelated random variables with mean zero and
SD 5, 12 and 9 respectively. If U = X + Y and V = Y + Z, find the correla-
tion coefficient between U and V .

Solution: Given X, Y, Z are uncorrelated random variables, E(X) = E(Y ) =


E(Z) = 0, σX = 5, σY = 12 and σZ = 9

X and Y are uncorrelated ⇒ r(X, Y ) = 0 ⇒ Cov(X, Y ) = 0


⇒ E(XY ) − E(X)E(Y ) = 0, and hence E(XY ) = 0.

Similarly Y and Z are uncorrelated ⇒ E(Y Z) = 0 and

X and Z are uncorrelated ⇒ E(XZ) = 0.

Thus E(XY ) = E(Y Z) = E(XZ) = 0

Now E(U V ) = E {(X + Y )(Y + Z)} = E {XY + XZ + Y 2 + Y Z}


= E(XY ) + E(XZ) + E(Y 2 ) + E(Y Z)
= 0 + 0 + E(Y 2 ) + 0 = E(Y 2 )
= Var(Y ) − {E(Y )}2 = 122 − 0 = 144

Compiled by: Dr. K. Karuppasamy, www.drkk.in, KARE Page: 4


211MAT1302- Statistics for Engineers Course Material

Now E(U ) = E(X + Y ) = E(X) + E(Y ) = 0 + 0 = 0 and


E(V ) = E(Y + Z) = E(Y ) + E(Z) = 0 + 0 = 0.

Therefore, Cov(U, V ) = E(U V ) − E(U )E(V ) = 144 − 0 = 144

Now Var(U ) = Var(X + Y ) = Var(X) √


+ Var(Y ) + 2Cov(X, Y )
2 2
= 5 + 12 + 0 = 169 and hence σU = 169 = 13.

Now Var(V ) = Var(Y + Z) = Var(Y ) +


√Var(Z) + 2Cov(Y, Z)
2 2
= 12 + 9 + 0 = 225 and hence σV = 225 = 15.

Cov(U, V ) 144 48
Now, r(U, V ) = = 13×15
= 65
σU σV
Problem: Compute the coefficient of correlation between x and y, from the
following data.
x 1 3 5 7 8 10
y 8 12 15 17 18 20

Solution:

x y xy x2 y2
1 8 8 1 64
3 12 36 9 144
5 15 75 25 225
7 17 119 49 289
8 18 144 64 324
10 20 200 100 400
34 90 582 248 1446
1
P 34 17 1
P 90
x= 6
x= 6
= 3
; y= 6
y= 6
= 15
P P
σx2 = 1
6
x2 − x2 = 248
6
− ( 17
3
)2 = 83
9
; σy2 = 1
6
y2 − y2 = 1446
6
− 152 = 16
P
Cov(x, y) = 16 xy − x.y = 582
6
− ( 17
3
).15 = 12
Cov(x, y) 12
r= = q √ = 0.9879
σx .σy 83
16 9

Problem: Compute the correlation coefficient between heights (in inches)


of fathers (x) and their sons (y).

Compiled by: Dr. K. Karuppasamy, www.drkk.in, KARE Page: 5


211MAT1302- Statistics for Engineers Course Material

x 65 67 66 71 67 70 68 69
y 67 68 68 70 64 67 72 70
Solution:
x y xy x2 y2
65 67 4355 4225 4489
67 68 4556 4489 4624
66 68 4488 4356 4624
71 70 4970 5041 4900
67 64 4288 4489 4096
70 67 4690 4900 4489
68 72 4896 4624 5184
69 79 5451 4761 6241
543 555 37694 36885 38647
1
P 543 1
P 555
x= 8
x= 8
= 67.875; y = 8
y= 8
= 69.375
P
σx2 = 1
8
x2 − x2 = 36885
8
− (67.875)2 = 231
64
;
P
σy2 = 1
8
y2 − y2 = 38647
8
− 69.3752 = 1151
64
P
Cov(x, y) = 18 xy − x.y = 37694
8
− (67.875).(69.375) = 187
64
187
Cov(x, y)
r= = q 64 q = 0.3627
σx .σy 231 1151
64 64

Problem: Find the correlation coefficient between industrial production and


export using the following data:
Production(x) 55 56 58 59 60 60 62
Export(y) 35 38 37 39 44 43 44
Solution:
x y xy x2 y2
55 35 1925 3025 1225
56 38 2128 3136 1444
58 37 2146 3364 1369
59 39 2301 3481 1521
60 44 2640 3600 1936
60 43 2580 3600 1849
62 44 2728 3844 1936
410 280 16448 24050 11280

Compiled by: Dr. K. Karuppasamy, www.drkk.in, KARE Page: 6


211MAT1302- Statistics for Engineers Course Material

1
P 410 1
P 280
x= 7
x= 7
; y= 7
y= 7
= 40
P
σx2 = 1
7
x2 − x2 = 24050
7
− ( 410
7
)2 = 5.1020;
P
σy2 = 1
7
y2 − y2 = 11280
7
− 402 = 80
7
P
Cov(x, y) = 17 xy − x.y = 16448
7
− ( 410
7
).(40) = 6.8571
Cov(x, y) 6.8571
r= =√ q = 0.8980
σx .σy 5.1020 807

Spearman’s Rank Correlation

Let (xi , yi ), i = 1, 2, · · · , n be the ranks of n-individuals in two characteristics


X and Y. The Spearman’s rank correlation is given by
P
n
6 d2i
i=1
ρ=1− , where di = xi − yi .
n(n2 − 1)
P P P P
Note 1: di = (xi − yi ) = xi − yi = 0.
Note 2: If two or more individuals are equal in any classification with respect
to characteristics X or Y, then common ranks are to be given for repeated
values in X or Y. The common rank is the average of the ranks which would
have assumed.
m(m2 − 1) P 2
Note 3: We add the correction factor to di , where m is the
12
number of times an item is repeated. This correction factor (CF) to be added
for each repeated value.
Note 4: −1 ≤ ρ ≤ 1.

Problem: Ten students got the following percentage of marks in Mathemat-


ics and Physics:
Students: 1 2 3 4 5 6 7 8 9 10
Mathematics: 78 36 98 25 75 82 90 62 65 39
Physics: 84 51 91 60 68 62 86 58 63 47
Find the rank correlation between Mathematics marks and Physics marks.
Solution:

Compiled by: Dr. K. Karuppasamy, www.drkk.in, KARE Page: 7


211MAT1302- Statistics for Engineers Course Material

X Y Rank in X Rank in Y d = x − y d2
(x) (y)
78 84 4 3 1 1
36 51 9 9 0 0
98 91 1 1 0 0
25 60 10 7 3 9
75 68 5 4 1 1
82 62 3 6 −3 9
90 86 2 2 0 0
62 58 7 8 −1 1
65 63 6 5 1 1
39 47 8 10 −2 4
26
P
Here n = 10, d2 = 26
P
6 d2 6 × 26
Rank Correlation, ρ = 1 − =1− = 0.8424
n(n − 1)
2 10(102 − 1)
Problem: Obtain the rank correlation for the following data:
X: 68 64 75 50 64 80 75 40 55 64
Y: 62 58 68 45 81 60 68 48 50 70
Solution:
X Y Rank in X Rank in Y d = x − y d2
(x) (y)
68 62 4 5 −1 1
64 58 6 7 −1 1
75 68 2.5 3.5 −1 1
50 45 9 10 −1 1
64 81 6 1 5 25
80 60 1 6 −5 25
75 68 2.5 3.5 −1 1
40 48 10 9 1 1
55 50 8 8 0 0
64 70 6 2 4 16
72
P
Here n = 10, d = 72
m(m2 − 1) 2(22 − 1)
CF for X = 75 : = = 0.5
12 12
m(m − 1)
2
3(3 − 1)
2
CF for X = 64 : = =2
12 12
Compiled by: Dr. K. Karuppasamy, www.drkk.in, KARE Page: 8
211MAT1302- Statistics for Engineers Course Material

m(m2 − 1) 2(22 − 1)
CF for Y = 68 : = = 0.5
12 12
∴ Total CF = 0.5 + 2 + 0.5 = 3
P
6( d2 + CF ) 6 × (72 + 3)
Rank Correlation, ρ = 1 − =1− = 0.5455
n(n − 1)
2 10(102 − 1)
Problem: Ten competitors in a musical contest were ranked by three judges
A, B and C as follows:
Competitors: 1 2 3 4 5 6 7 8 9 10
Rank by A: 1 6 5 10 3 2 4 9 7 8
Rank by B: 3 5 8 4 7 10 2 1 6 9
Rank by C: 6 4 9 8 1 2 3 10 5 7
Using rank correlation technique, find which pair of judges have more or less
the same taste in music.

Solution:
Rank Rank Rank
by A by A by A d1 = d2 = d3 = d21 d22 d23
(x) (y) (z) x − y y − z x − z
1 3 6 −2 −3 −5 4 9 25
6 5 4 1 1 2 1 1 4
5 8 9 −3 −1 −4 9 1 16
10 4 8 6 −4 2 36 16 4
3 7 1 −4 6 2 16 36 4
2 10 2 −8 8 0 64 64 0
4 2 3 2 −1 1 4 1 1
9 1 10 8 −9 −1 64 81 1
7 6 5 1 1 2 1 1 4
8 9 7 −1 2 1 1 4 1
200 214 60
P P P
Here n = 10, d21 = 200, d22 = 214, d23 = 60
P
6 d21 6 × 200
ρ(A, B) = 1 − =1− = −0.2121
n(n − 1)
2 10(102 − 1)
P
6 d22 6 × 214
ρ(B, C) = 1 − =1− = −0.2970
n(n − 1)
2 10(102 − 1)
P
6 d23 6 × 60
ρ(A, C) = 1 − =1− = 0.6364
n(n − 1)
2 10(102 − 1)

Compiled by: Dr. K. Karuppasamy, www.drkk.in, KARE Page: 9


211MAT1302- Statistics for Engineers Course Material

Since ρ(A, C) > ρ(B, C) > ρ(A, B), judges A and C have more or less the
same taste in music.

Regression Analysis

The regression line of Y on X is given by


σy Cov(x, y)
y − y = byx (x − x) where byx = r = .
σx σx2
The regression line of X on Y is given by
σx Cov(x, y)
x − x = bxy (y − y) where bxy = r = .
σy σy2

Note 1: The two regression lines intersect at (x, y).


σy σx
Note 2: byx .bxy = r .r = r2
p σ x σ y
⇒ r = ± byx .bxy
p
r=+ byx .bxy if both byx and bxy are positive.
p
r=− byx .bxy if both byx and bxy are negative.

Note 3: The acte angle θ between the regression lines is given by

(1 − r2 ) σx σy
tan θ =
|r| σx2 + σy2

The two regression lines are perpendicular if r = 0.

The two regression lines are coincident if r = ±1.

Problem: If x = 970, y = 18, σx = 38, σy = 2 and r = 0.6, find the regression


line y on x and x on y.
Solution: Given x = 970, y = 18, σx = 38, σy = 2 and r = 0.6
σy 2 σx 38
Now byx = r = 0.6 × = 0.0316; bxy = r = 0.6 × = 11.4
σx 38 σy 2
The regression line of Y on X is y − y = byx (x − x)

Compiled by: Dr. K. Karuppasamy, www.drkk.in, KARE Page: 10


211MAT1302- Statistics for Engineers Course Material

⇒ y − 18 = 0.0316(x − 970)
⇒ y = 0.0316x − 12.652

The regression line of X on Y is x − x = bxy (y − y)

⇒ x − 970 = 11.4(y − 18)


⇒ x = 11.4y + 764.8.

Problem: Find the correlation coefficient and the regression lines from the
following data:

x 62 64 65 69 70 71 72 74
y 126 125 139 145 165 152 180 208

Solution:
x y xy x2 y2
62 126 7812 3844 15876
64 125 8000 4096 15625
65 139 9035 4225 19321
69 145 10005 4761 21025
70 165 11550 4900 27225
71 152 10792 5041 23104
72 180 12960 5184 32400
74 208 15392 5476 43264
547 1240 85546 37527 197840
1
P 547
P
x= 8
x= 8
= 68.375; y = 18 y = 1240
8
= 155
P
σx2 = 1
8
x2 − x2 = 37527
8
− 68.3752 = 15.7344;
P
σy2 = 1
8
y2 − y2 = 197840
8
− 1552 = 705
P
Cov(x, y) = 1
8
xy − x.y = 85546
8
− (68.375).(155) = 95.125

Cov(x, y) 95.125
r= =√ √ = 0.9032
σx .σy 15.7344 705
Cov(x, y) 95.125
Now byx = 2
= = 6.0457
σx 15.7344
Cov(x, y) 95.125
bxy = 2
= = 0.1349
σy 705

Compiled by: Dr. K. Karuppasamy, www.drkk.in, KARE Page: 11


211MAT1302- Statistics for Engineers Course Material

The regression line of Y on X is y − y = byx (x − x)

⇒ y − 155 = 6.0457(x − 68.375)

⇒ y = 6.0457x − 258.3747

The regression line of X on Y is x − x = bxy (y − y)

⇒ x − 68.375 = 0.1349(y − 155)

⇒ x = 0.1349y + 47.4655.

Problem: The two regression lines are 4x − 5y + 33 = 0 and 20x − 9y = 107


and Var(x) = 25. Find (i) the means of X and Y (ii) Correlation coefficient
and Var(Y) (iii) angle between the regression lines.

Solution: Given 4x − 5y + 33 = 0 · · · (1) and 20x − 9y − 107 = 0 · · · (2) be


the two regression lines. and σx2 = 25.
(i) Since regression lines passes through (x, y), we have
4x − 5y = −33 · · · (3)
20x − 9y = 107 · · · (4)
Solving (3) & (4) we get (x, y).
4 −5
∆= = −36 + 100 = 64
20 −9
−33 −5
∆x = = 297 + 535 = 832
107 −9
4 −33
∆y = = 428 + 660 = 1088
20 107
∆x 832
x= = 64
= 13

∆y 1088
y= = 64
= 17

(ii) Let (1) be the regression line of y on x and (2) be the regression line of
x on y.
(1) ⇒ y = 45 x + 33
5

(2) ⇒ x = 9
20
y + 107
20

⇒ byx = 4
5
and bxy = 9
20

Compiled by: Dr. K. Karuppasamy, www.drkk.in, KARE Page: 12


211MAT1302- Statistics for Engineers Course Material

p q
We have r = ± byx .bxy = 4 9
.
5 20
= ± 35

r = 53 , since byx and bxy positive.


σx
We have bxy = r
σy
rσx
⇒ σy =
bxy
3
×5
⇒ σy = 5 9 = 20 3
20
400
Var(Y) = 9
1 − r 2 σx σy 1 − ( 35 )2 5 × 203
(3) We have tan θ = . = . = 0.512
|r| σx2 + σy2 | 35 | 25 + 400
9

θ = tan−1 (0.512) = 27◦ 6′ 44.81′′

Compiled by: Dr. K. Karuppasamy, www.drkk.in, KARE Page: 13

You might also like