Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

746 Statistics

Perfect Correlation: If two variables vary in such a way that their ratio is always constant,
then the correlation is said to be prefect.
10.17 SCATTER OR DOT-DIAGRAM
When we plot the corresponding values of two variables, taking one on x-axis and the
other along y-axis, it shows a collection of dots.
This collection of dots is called a dot diagram or a scatter diagram

Methods of Determining Simple Correlation

10.18 KARL PEARSON’S COEFFICIENT OF CORRELATION


r between two variables x and y is defined by the relation
XY P Covariance x, y
r    ,

 X2   Y2 x y 
 variancex 

variance y
 
where X  x  x, Y  y  y
i.e. X, Y are the deviations measured from their respective means,
  X Y
P     covariance.
 n 
and x, y being the standard deviations of these series.
Statistics 747

Example 15. Ten students got the following percentage of marks in Economics and
Statistics.
Roll No. 1 2 3 4 5 6 7 8 9 10
Marks in Economics 78 36 98 25 75 82 90 62 65 39
Marks in Statistics 84 51 91 60 68 62 86 58 53 47

Calculate the coefficient of correlation.


Solution. Let the marks of two subjects be denoted by x and y respectively.
650 660
Then the mean for x marks   65 and the mean of y marks   66
10 10
If X and Y are deviations of x’s and y’s from their respective means, then the data may
be arranged in the following form :
x y X = x – 65 Y = y – 66 X2 Y2 XY
78 84 13 18 169 324 234
36 51 – 29 – 15 841 225 435
98 91 33 25 1089 625 825
25 60 – 40 –6 1600 36 240
75 68 10 2 100 4 20
82 62 17 –4 289 16 – 68
90 86 25 20 625 400 500
62 58 –3 –8 9 64 24
65 53 0 – 13 0 169 0
39 47 – 26 – 19 676 361 494
650 660 0 0 5398 2224 2704

Here  X2  5398,  Y2  2224,  X Y = 2704


XY 2704
 r  

 X   Y 
2 2 
5398  2224
2704 2704
   0.78 Ans.
73.4  47.1 3457
Example 16. Find the coefficient of correlation between the age and the sum assured
from the following table:
Sum assured (in Rs.)
Age group 10,000 20,000 30,000 40,000 50,000
20–30 4 6 3 7 1
30–40 2 8 15 7 1
40–50 3 9 12 6 2
50–60 8 4 2 — —

Solution. Let the sum assured denote by x and the age group by y.
748 Statistics

x  30,000 y  45
x  , y 
10,000 10
x 10,000 20,000 30,000 40,000 50,000
x –2 –1 0 1 2 f f  y f y 2 fxy
y y
(Rows)

f fxy f fxy f fxy f fxy f fxy


20–30 25 –2
16 12 0 14 4
4 6 3 7 1 21 –42 84 +10
30–40 4 8 0 7 2
35 –1
2 8 15 7 1 33 –33 33 +3
40–50 0 0 0 0 0
45 0
3 9 12 6 2 32 0 0 0

50–60 16 4 0 0 0
55 1 8 4 2 – 14 14 14 –20

f 17 27 32 20 4 N = 100  f y  f y2  fxy
colu- = –61
mn = 131 = –7
–34 –27 0 20 8  fx
fx
= –33
68 0 20 16  f x2
f  x2 27
= 131
 fxy
f x y 4 16 0 –21 –6
= –7

N  fxy   f  x  f  y
r 
 N  f  x2   f  x
 2 N  f  x2   f  y2

100  7   33  61  700  2013
 

100 131   332 100 131   612
 
 13100  1089  13100  3721
 
 2713  2713  2713
     0.2556

  
12011 
9379 109.59  96.85 10613  7915
Hence, the age and sum assured are negatively correlated, i.e., as age goes up the sum
assured comes down. Ans.
10.19 SHORT-CUT METHOD
 X Y   X    Y 
  
r  N  N  N 


      
2 2
  X  X  
2   Y   Y   2
    
 N  N    N  N  
where r is the coefficient of correlation.
X  = deviation from assumed mean of x = x – a
Y  = deviation from assumed mean of y = y – b
N = Total number of items.
Statistics 749

Example 17. Calculate the coefficient of correlation for the following table :
x–age 0–4 4–8 8–12 12–16
marks
0–5 7 — — —
5–10 6 8 — —
10–15 — 5 3 —
15–20 — 7 2 —
20-25 — — — 9

Solution. Replace the class-interval for x and y by their mid-points and then let
x  10 y  12.5
X  and Y  
4 5
2 6 10 14 f fY  f Y 2 f XY
x X –2 –1 0 1 (row)
y Y f fXY f fXY f fXY f fXY

0–5 2.5 –2 7 28 7 – 14 – 28 28
5–10 7.5 –1 6 12 8 8 14 –14 – 14 20
10–15 12.5 0 5 0 3 0 8 0 0 0
1 7 2 0
15–20 17.5 9 9 9 –7
–7
20–25 22.5 2 9 18 9 18 36 18
f 13 20 5 9 47 fY –1 fY = 87 fXY=59
9 fX 
fX –26 – 20 0
–37
9 fX2 
f X 2 52 20 0
81

fXY 40 1 0 18 fXY 
59

Here,  f X    37,  f X 2  81,  f Y    1,  f Y 2  87,  fXY  59


 f X Y    fX     fY  
  
N  N  N 
r 

  
  N
2 2
fX   fX  
2 fY   fY  2
 
   
N  N  N  
59   37    1 

47  47   47  1.255  0.017
 

 
1.723  0.620  
1.851  0.0005

 
 
 81  372  87  1 2
         
 47  47    47  47  

1.238 1.238 1.238


    0.87 Ans.

1.103 

1.8505 1.05  1.36 1.428
750 Statistics

10. 20 SPEARMAN’S RANK CORRELATION


6  d2
r  1
n n2  1
Solution. Let x1, y1, x2, y2  xn, yn be the ranks of n individuals corresponding to two
characteristics.
Assuming nor two individuals are equal in either classification, each individual takes the
values 1, 2, 3, ... n and hence their arithmetic means are, each
n 1 n n  1 n1
  
n n 2 2
Let x1, x2, x3, ... xn be the values of variable X and y1, y2, y3, ... yn those of Y.
 n  1  n  1
Then d  X  Y  x   y  xy
 2   2 
where X and Y are deviations from the mean.
2 2
 n  1 n  1
 X2   x     x  n  1  x    2 
2
 2   
2
n n  1 2n  1 n  1 n n  1 n  1
  n 
6 2  2 
n n2  1

12
Clearly,  X   Y and  X2   Y2
n n2  1
  Y2 
12
Hence  d   x  y2   x2   y2  2  xy
2

1  n n2  1 
  XY     d2
2 6 
1 1
 n n  1   d2
2
12 2
 XY
Putting these values in r 

 X2  
 Y2
1 1
n n2  1   d2
12 2

n n2  1
12
6  d2
 1 Ans.
n n2  1
10.21 SPEARMAN’S RANK CORRELATION COEFFICIENT
6  d2
r  1
n n2  1
where r denotes rank coefficient of correlation and d refers to the difference of ranks
between paired items in two series.
Statistics 751

Example 18. Compute Spearman’s rank correlation coefficient r for the following data:
Person A B C D E F G H I J
Rank in statistics 9 10 6 5 7 2 4 8 1 3
Rank in income 1 2 3 4 5 6 7 8 9 10

Solution.
Person Rank in statistics Rank in income d  R1  R2 d2
A 9 1 8 64
B 10 2 8 64
C 6 3 3 9
D 5 4 1 1
E 7 5 2 4
F 2 6 –4 16
G 4 7 –3 9
H 8 8 0 0
I 1 9 –8 64
J 3 10 –7 49
 d2  280

6  d2
r  1
n n2  1
6  280
r  1  1  1.697   0.697 Ans.
10 100  1
Example 19. Establish the formula
2x  y  2x  2y  2r x y
where r is the correlation coefficient between x and y.

 x  x2
Solution. We know that x 
2
n

 [x  y  x  y]2
 x  y 
2
n
  
x  y  mean of x  y series.  mean of x  mean of y  x  y
   
 [ x  y  x  y ]2 [ x  x  y  y ]2
2xy  
n n
   
 [ x  x  y  y  2 x  x y  y ]
2 2

n
   
 x  x2 y  y2 2 x  x y  y
  
n n n
 
2  x  x y  y
 2x  2y  ...(1)
n
   
 x  x y  y  x  x y  y
We know that r  or  r x y
n x y n
752 Statistics

Putting this value in (1) we get,


2x  y  2x  2y  2r x y Proved
Example 20. If X and Y are uncorrelated random variables, find the coefficient of
correlation between X + Y and X – Y.
Solution.
Let u  X  Y and v  X  Y
 
 u  u v  v
Then r 
n u v
  
Now u  X  Y, u  X  Y

Similarly v  XY
 
Now  u  u v v   
  X  X  Y  Y [X  X  Y  Y]
  x  y x  y
  x2   y2
 n 2x  n 2y
 
 u  u2 1 
Also 2u    [X  X  Y  Y]2
n n
1
  x  y2
n
1
  x2   y2  2  xy
n
 2x  2y (As X and Y are not correlated, we have  xy  0)
Similarly 2v  2x  2y
 
 u  u v  v
 r 
n u v
n 2x  2y

n 2x  2y  
 n 2x  2y

2x  2y
 Ans.
2x  2y
10.22 REGRESSION
If the scatter diagram indicates some relationship between two variables x and y, then the
dots of the scatter diagram will be concentrated round a curve. This curve is called the curve
of regression.
Regression analysis is the method used for estimating the unknown values of one variable
corresponding to the known value of another variable.
10.23 LINE OF REGRESSION
When the curve is a straight line, it is called a line of regression. A line of regression is
the straight line which gives the best fit in the least square sense to the given frequency.
Statistics 753

10.24 EQUATIONS TO THE LINES OF REGRESSION


Let y  a  bx ...(1)
be the equation of the line of regression of y on x.
Let xr, yr be any point of dot.
From the figure
PR  yr
QR  a  bxr
PQ  PR  QR  yr  a  bxr
Let S be the sum of the squares of such distances, then
S   y  a  bx2
According to the principle of least squares, we have to choose a and b so that S is
minimum. The method of least square gives the condition for minimum value of S.
S S
  2  y  a  bx,   2  y  a  bx x
a b
S S
 0,  0, for S minimum
a b
i.e.  y  a  bx  0 or  y  na  b  x  0
or  y  na  b  x ...(2)
and  xy  ax  bx   0 or  xy  a  x  b  x  0
2 2

 xy  a  x  b  x2 ...(3)
Dividing (2) by n we get
y x  y   x
 ab  y  , x 
n  n  n n 

  y  abx
where x and y are the means of x series and y series.
   
This shows that x, y lie on the line of regression (1), shifting the origin to x, y, the
equation (3) becomes
   
 x  x y  y  a  x  x  b  x  x2
   
But  x  x  0 i.e.  x  x y  y  b  x  x2
 
 x  x y
  y XY
or b   ...(4)
 x  x2  X2
XY XY XY
We know r   

 X 
2

Y 2
 
 
X 2
Y 2 n x y
n
n n
or  X Y  nr x y
Putting the value of  X Y in (4) we get
nr xy r x y r x y r y
b    
X 2 X 2 x2 x
n
754 Statistics

y
i.e. slope of the line of regression = b = r
x
 
The line of regression passes through  x, y .
Hence the equation to the line of regression is
 y 
yy  r x  x 
x
Similarly the regression line of x on y is
  
x  x  r x y  y .
y
y x
Note. byx  r and bxy  r are known as the coefficients of regression.
x y
 y   x 
byx . bxy   r  r   r
2

 x   y 
Example 21. If  be the acute angle between the two regression lines in the case of
two variables x and y, show that
1  r2 x y
tan  
r 2x  2y
where r, x, y have their usual meanings. Explain the significance where r = 0 and
r   1. (A.M.I.E., Winter 2001)
Solution. Lines of regression are
 y  y
yy  r x  x ...(1)  m1  r
x x
 x  1 y
and xx  r y  y ...(2)  m2 
y r x
m2  m1
tan  
1  m1 m2
1 y y  1 y
r   r 
r x x  r  x
 
y 1 y 2y
1r  1 2
x r x x
 y  2
  x
1  r2  x  1  r2 x y
 . 2  ...(3) Proved
r x  y
2 r 2x  2y
(a) If r = 0, then there is no relationship between the two variables and they are independent.

On putting the value of r = 0 in (3) we get tan  = ,   . So the lines (1) and (2)
2
are perpendicular. (A.M.I.E., Summer 1998)
(b) If r = 1 or –1
On putting these values of r in (3) we get, tan   0 or   0
Statistics 755

i.e. lines (1) and (2) coincide.


The correlation between the variables is perfect. Ans.
Example 22. Find the correlation coefficient between x and y, when the lines of regression
are:
2x  9y  6  0 , x  2y  1  0
Solution. Let the line of regression of x on y be 2x – 9y + 6 = 0
Then, the line of regression of y on x is x – 2y + 1 = 0
9 9
 2x  9y  6  0  x  y  3  bxy 
2 2
1 1 1
and x  2y  1  0  y  x   byx 
2 2 2
bxy  byx  
   32 > 1 which is not possible.
9 1
r  
 
2 2
So our choice of regression line is incorrect.
 The regression line of x on y is x – 2y + 1 = 0
And, the regression line of y on x is 2x – 9y + 6 = 0
 x  2y  1  0  x  2y  1  bxy  2
2 2 2
And 2x  9y  6  0  y  x  byx 
9 3 9
r  bxy  byx 
 

 29
2 
2
3
2
Hence the correlation coefficient between x and y is .
3
Example 23. The following regression equations were obtained from a correlation table:
y  0.516 x  33.73, x  0.512 y  32.52
Find the value of (a) the correlation coefficient, (b) the mean of x’s and (c) the mean of
y’s.
Solution. y  0.516 x  33.73 ...(1)
x  0.512 y  32.52 ...(2)
y
(a) From (1), r  0.516 ...(3)
x
x
From (2), r  0.512 ...(4)
y
From (3) and (4)
 y   x 
 r   r   0.516 0.512
 x   y 
r2  0.516  0.512 or r  0.514
Coefficient of correlation = 0.514. Ans.
 
(b) (1) and (2) pass through the point x, y.
 
 y  0.516 x  33.73 ...(5)
 
x  0.512 y  32.52 ...(6)
756 Statistics

On solving (5) and (6), we get


 
x  67.6, y  68.61 Ans.
Example 24. The two regression equations of the variables x and y are
x  19.13  0.87 y and y  11.64  0.50 x.
Find (i) Mean of x’s; (ii) Mean of y’s ; (iii) The correlation coefficient between x and y.
(A.M.I.E., Summer 1997, 1996)
Solution. x  19.13  0.87 y ...(1)
y  11.64  0.50 x ...(2)
 
As (1) and (2) pass through x, y :
 
x  19.13  0.87 y ...(3)
 
y  11.64  0.50 x ...(4)
On solving (3) and (4) we get
 
x  15.935, y  3.67
x
From (1) r   0.87 ...(5)
y
y
From (2) r   0.50 ...(6)
x
As x and y are always positive, so r is negative.
Multiplying (5) and (6) we get
x y
r r   0.87   0.50
y x
r2  0.435 or r   0.66 Ans.
Example 25. The regression equations calculated from a given set of observations for
two random variables are
  0.4y  6.4 and y   0.6x  4.6
  x
Calculate x, y and r. (A.M.I.E., Winter 1997)
Solution. The regression equations are
x   0.4y  6.4 ... (1)
y   0.6x  4.6 ... (2)
x
From (1) coefficient of regression of x on y = r   0.4 ... (3)
y
y
From (2) coefficient of regression of y on x = r   0.6 ... (4)
x
From (3) and (4)
 x   y 
r   r     0.4  0.6
 y  x
or r2  0.24
r   0.49
In (3) and (4), x and y are (always) positive so r is negative
Statistics 757

  r   0.49
 find x and y, we solve the equations (1) and (2) simultaneously. Their point of intersection
 To
is x, y.
 
x  6, y  1 Ans.
Example 26. Show that the geometric mean of the coefficients of regression is the
coefficient of correlation.
y x
Solution. The coefficients of regressions are r and r
x y

i.e. G.M.  


 r
y

. r
x

 r
x y
 coefficient of correlation. Proved.
Example 27. Prove that arithmetic mean of the coefficients of regression is greater than
the coefficient of correlation. (A.M.I.E., Summer 2000)
y x
Solution. Coefficients of regression are r , r
x y
We have to prove that A.M.  r
1  y x  1  y x 
or r
2  x
r   r or 
2  x y 
 1
 y   
y x 1
or   2  0 or [2x  2y  2 x y]  0
x y x y
1
or [x  y]2  0 which is true. Proved
x y
Example 28. Find the regression line of y on x for the following data :
x 1 3 4 6 8 9 11 14
y 1 2 4 4 5 7 8 9
Estimate the value of y, when x = 10.
Solution.
S. No. x y xy x2
1 1 1 1 1
2 3 2 6 9
3 4 4 16 16
4 6 4 24 36
5 8 5 40 64
6 9 7 63 81
7 11 8 88 121
8 14 9 126 196
Total 56 40 364 524
758 Statistics

Let y  a  bx be the line of regression of y on x, where a and b are given by the


following equations :
 y  na  b  x or 40  8a  56b ...(1)
 xy  a  x  b  x or 364  56a  524b
2
...(2)
On solving (1) and (2) we get,
6 7
a  and b 
11 11
The equation of the required line is
6 7
y   x or 7x  11y  6  0 Ans.
11 11
6 7 76 10
If x  10, y   10   6 Ans.
11 11 11 11
Example 29. In a study between the amount of rainfall and the quantity of air pollution
removed the following data were collected.
Daily Rainfall in 0.01 cm 4.3 4.5 5.9 5.6 6.1 5.2 3.8 2.1
3
Pollution Removed (mg/m ) 12.6 12.1 11.6 11.8 11.4 11.8 13.2 14.1

Find the regression line of y on x. (A.M.I.E., Summer 2000)


Solution.
S.N. x (metre) y xy x2
1 4.3 12.6 54.18 18.49
2 4.5 12.1 54.45 20.25
3 5.9 11.6 68.44 34.81
4 5.6 11.8 66.08 31.36
5 6.1 11.4 69.54 37.21
6 5.2 11.8 61.36 27.04
7 3.8 13.2 50.16 14.44
8 2.1 14.1 29.61 4.41
37.5 98.6 453.82 188.01

Let y  a  bx be the equation of the line of regression of y on x, where a and b are


given by the following equations.
 y  na  b  x or 98.6  8a  37.5b ... (1)
 xy  a  x  b  x 2
or 453.82  37.5a  188.01b ... (2)
On solving (1) and (2), we get a = 15.49 and b = – 0.675.
The equation of the line of regression is y = 15.49 – 0.675x Ans.
Example 30. The following data regarding the heights (y) and the weights (x) of 100
college students are given :
 x  15000  x2  2272500
 y  6800  y2  46.3025
 xy  1022250
Statistics 759

Find the correlation coefficient between height and weight and state the equation of re-
gression of height on weight.
 x 15000  y 6800
Solution. x    150, y    68
n 100 n 100


     

2 2
 x  x
2 2272500  15000
x  
n n
  100 100   
x  
   
22725  22500   15

225


2
y  y 


2 463025  6800  2
y     
n  n  100 100  
  4630.25  4624  
 
6.25  2.5
 xy   1022250
 x y  150 68
n 100
r  
x y 15  2.5
10222.5  10200 22.5 1.5
    0.6
15  2.5 15  2.5 2.5
Regression equation of y on x we have
 y  2.5
yy  r x  x, y  68  0.6   x  150
x  15 
1
y  68  x  150 or 10y  680  x  150
10
10y  x  530 Ans.
10.25 ERROR OF PREDICTION
The deviation of the predicted value from the observed value is known as the standard
error of prediction. It is given by



Eyx 
 y  yr2
n
where y is the actual value and yr the predicted value.
Example 31. Prove that
(i) Eyx  y  1  r2
 (ii) Exy  x  
1  r2
Solution. The equation of the line of regression of y on x is
 y 
y  y  r x  x
x
 y 
yr  y  r x  x
x
2 12
  
1  
So, Eyx  
 y  yr2
n
y
   y  y  r x  x 
x
n   
12
1   r2 2y  2r y  
    y  y 2  2 x  x2  x  x y  y 
n  x x 
760 Statistics

    12
 y  y2 2 2y x  x2 y x  x y  y 
  r 2   2r  
n x n x n
 
1 2
 2y y 
 2y  r2 2  2x  2r r  x  y
 x x 
12 12
 2y  r2 2y  2r2 2y  2y  r2 2y
   
 y 
1  r2 Proved.
(ii) Similarly (ii) may be proved.
Example 32. Find the standard error of estimate of y on x for the data given below:
x 1 3 4 6 8 9 11 14
y 1 2 4 4 5 7 8 9

Solution. The equation of the line of regression of y on x is


7 6 7x 6
y  x  . So yr   (See Example 28 on page 757)
11 11 11 11
S. No. x y yr y  yr y  yr2
1 1 13 2 4
1 
11 11 121
2 3 27 5 25
2 
11 11 121
3 4 34 10 100
4
11 11 121
4 6 48 4 16
4 
11 11 121
5 8 62 7 49
5 
11 11 121
6 9 69 8 64
7
11 11 121
7 11 83 5 25
8
11 11 121
8 14 104 5 25
9 
11 11 121
308
 y  yr2 
121

E yx  
 y  y
r
2
 
  
308
22
7
 0.564 Ans.
n 121  8
Exercise 10.2
1. Find the coefficient of correlation between x and y from the table of their values :
x 1 3 4 6 8 9 11 14
y 1 2 4 4 5 7 8 9
Ans. 0.977.

2. Find the coefficient of correlation of the following data taking new origin of x at 70 and for y at
67.
Statistics 761

x 67 68 64 68 72 70 69 70
y 65 66 67 67 68 69 71 73
(AMIE winter 2002 ) Ans. 0.472
3. x and y are two random variables with the same standard deviation and correlation coefficient r. Show
1r
that the coefficient of correlation between x and x + y is 
2 .
4. Find the regression line of y on x for the data :
x 1 4 2 3 5
y 3 1 2 5 4
Ans. y = 2.7 + 0.1x
5. Find the correlation coefficient and the equations of regression lines from the following data :
x 1 2 3 4 5
y 2 5 3 8 7
Ans. r = 0.81, x = 0.5y + 0.5, y = 1.3x + 1.1
6. Find the regression line of y on x if
x 40 70 50 60 80 50 90 40 60 60
y 2.5 6.0 4.5 5.0 4.5 2.0 5.5 3.0 4.5 3.0
Ans. y = 0.55 + 0.0583 x
7. The following marks have been obtained by a class of students in statistics.
Paper I 80 45 55 56 58 60 65 68 70 75 85
Paper II 81 56 50 48 60 62 64 65 70 74 90
Compute the coefficient of correlation for the above data. Find the lines of regression.
Ans. r = .918, y – 65.45 = 0.981 (x – 65.18)
x – 65.18 = 0.859 (y – 65.45)
8. Find the equations to the lines of regression and the coefficient of correlation for the following data:
x 2 4 5 6 8 11
y 18 12 10 8 7 5
Ans. y – 10 = – 1.34 (x – 6), x – 6 = – 0.632 (y – 10), r = – 0.92
9. b
Obtain normal equations for fitting a curve of the form y  ax 
x
for n points xr, yr, r = 1, 2, ... n. y 1
Ans.  xy  nb  a  x2, 
 na  b  2
x x
10. The following results were obtained from lineups in Applied Mechanics and Engineering Mathematics
in an examination :
Applied Mechanics Engg. Maths.
(x) (y)
Mean 47.5 10.5
Standard deviation 16.8 10.8
r  0.95
Find both the regression equations. Also estimate the value of y for x = 30.
Ans. y  0.611x  10.5, x  1.478 y  1.143, y  28.83
11. The following results were obtained from records of age (x) and systolic blood pressure (y) of a group
of 10 men :

You might also like