Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Module III: Syllabus

Correlation and regression (6 hours)

Correlation (Discrete Data) – Scatter diagram - Karl Pearson’s Correlation coefficient –


Spearman’s Rank Correlation – Regression lines (Discrete Data).

CORRELATION AND REGRESSION

Correlation

Correlation refers to the study of relationship between two or more variables.


Let X and Y measure some characteristics of a particular system. If X and Y vary in such a way
that change in one variable corresponds to change in the other variable, then the variables X and Y
are correlated.

Types of correlation: Positive and negative

If increase in one variable causes a proportionate increase in the other variable, then the variables
are said to be positively correlated.
If increase in one variable causes a proportionate decrease in the other variable, then the variables
are said to be negatively correlated.
Methods of studying correlation:
(i) Scatter diagram method
(ii) Karl Pearson’s correlation coefficient
(iii)Spearman’s rank correlation coefficient

Scatter diagram method

The given data are plotted on a graph in the form of dots, i.e, for each pair of X and Y, we put dots
and looking at the scatter of the various points, we form an idea as to whether the two variables are
related or not. The more the plotted points scatter over a chart, the lesser is the degree of
relationship between the two variables. The nearer the points come to a line, the higher the
relationship. If the points lie in a haphazard manner, it shows the absence of any relationship
between the variables.
Positive correlation Negative correlation No correlation

Positive Correlations in Science

• As the number of trees cut down increases, the probability of erosion increases.
• In archaeology, a more stable landform means more site visibility.
• As the temperature decreases, the speed at which molecules move decreases.
• As the speed of a wind turbine increases, the amount of electricity that is generated
increases.
• As the amount of moisture increases in an environment, the growth of mold spores
increases.
• As algae increased in the lake a certain species of algae eating fish increased.
• As the percentage of salt in salty water increases, buoyancy increases.
• As you eat more antioxidants, your immune system improves.
Negative Correlation
• COX activity decreases with larger body sizes

No correlation
• Between height and IQ

Levels of correlation
Example of negatively correlated variables

Correlation and causation

Correlation is a statistical measure (expressed as a number) that describes the size and direction of
a relationship between two or more variables. A correlation between variables, however, does not
automatically mean that the change in one variable is the cause of the change in the values of the
other variable.

Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a
causal relationship between the two events. This is also referred to as cause and effect.

Theoretically, the difference between the two types of relationships are easy to identify — an
action or occurrence can cause another (e.g. smoking causes an increase in the risk of developing
lung cancer), or it can correlate with another (e.g. smoking is correlated with alcoholism, but it
does not cause alcoholism). In practice, however, it is difficult to clearly establish cause and effect,
compared with establishing correlation.
Karl Pearson’s correlation coefficient
This gives us a measure of correlation which indicates the degree of correlation in quantitative
terms. It is defined as
1
Cov( X , Y ) n 
(
x−x y− y )( )  (x − x)(y − y )
r (x, y ) = rxy = = =
 x y  x y n x y

Note:
1
n
( )( )
 x − x y − y is called the covariance between X and Y (Cov (X,Y).

n xy −  x y
rxy =
n x 2 − ( x ) n y 2 − ( y )
2 2

Properties of correlation coefficient


(i) The correlation coefficient lies between -1 and +1, i.e. r  1

Note: If r = 1, there is perfect positive correlation


If r = -1, there is perfect negative correlation
If r = 0, the variables are uncorrelated.

(ii) The correlation coefficient is independent of change of scale and origin of the variables X
X −a Y −b
and Y. i.e., if U = ,V = where a, b, h, k, are constants, h > 0, k > 0, then
h k
r(X,Y) = r(U,V)
Examples
1. Find Karl Pearson’ correlation coefficient for the following heights in inches of fathers
(x) and their sons (y):
X: 65 66 67 67 68 69 70 72
Y: 67 68 65 68 72 72 69 71
Ans:
x Y x2 y2 Xy
65 67 4225 4489 4355
66 68 4356 4624 4488
67 65 4489 4225 4355
67 68 4489 4624 4556
68 72 4624 5184 4896
69 72 4761 5184 4968
70 69 4900 4761 4830
72 71 5184 5041 5112

 x = 544  y = 552 x 2
= 37028 y 2
= 38132  xy = 37560

n= 8 , x = 544 ,  y = 552 ,  x 2
= 37028 ,  y 2 = 38132 ,  xy =37560

n xy −  x y
rxy =
n x − ( x ) n y − ( y )
2 2 2 2

=
(8  37560) − (544  552) = 0.603
(8  37028) − (544)2 (8  38132) − (552)2

2. Calculate Karl Pearson’s Coefficient of correlation


X: 25 30 28 29 32 24 36 28 27 21
Y: 18 20 21 16 14 13 22 15 19 12

Ans:
x Y x2 y2 xy
25 18 625 324 450
30 20 900 400 600
28 21 784 441 588
29 16 841 256 464
32 14 1024 196 448
24 13 576 169 312
36 22 1296 484 792
28 15 784 22 420
27 19 729 361 513
21 12 441 144 252

n = 10 ,  x = 280 ,  y =170 ,  x 2
= 8000 ,  y 2 = 3000 ,  xy =4839

n xy −  x y
rxy =
n x 2 − ( x ) n y 2 − ( y )
2 2

=
(10  4839) − (280  170) = 0.5955
(10  8000) − (280)2 (10  3000) − (170)2

3. A computer while calculating rxy from 25 pairs of observations , obtained the following

constants: n = 25,  x = 125 ,  x 2


= 650 ,  y =100 ,  y 2 = 460 ,  xy = 508 . A recheck

showed that two pairs of values (6, 14), (8, 6) were wrong, while the correct values were (8, 12),
(6,8). Obtain the correct value of correlation coefficient.
Ans:
Correct value of x = 125 – 6 - 8 + 8 + 6 = 125

Correct value of y = 100 – 14 – 6 +12 + 8 = 100

Correct value of x 2
= 650 -36 – 64 + 64 + 36 = 650

Correct value of y 2
= 460 – 196 – 36 + 144 + 64 = 436

Correct value of  x y = 508 – 84 -48 + 96 + 48 = 520


n xy −  x y
The correct correlation coefficient = r =
n x − ( x ) n y − ( y )
xy
2 2 2 2
(25  520) − (125  100)
= = 0.6667
(25  650) − 125 2 (25  436 − 100 2

Rank Correlation

Sometimes we have to deal with problems in which data cannot be quantitatively measured but
qualitative measurement is possible. Here, we give ranks to the values in each series separately
and calculate Spearman’s rank correlation coefficient as
6 d 2
 = 1−
(
n n2 −1 ) where d is the difference between the ranks of paired items in the two series.

Rank correlation coefficient varies between -1 and +1.


Note: 1. d will always be equal to 0

2. Spearman’s rank correlation coefficient and Karl Pearson’s correlation coefficient for
a given data, are usually different.
3. Spearman’s rank correlation coefficient has the same value as Karl Pearson’s correlation
coefficient between the ranks.
.
Repeated ranks
If two or more individuals have the same value in a series, then each individual is given the
average of ranks. Then rank correlation coefficient is


6  d 2 +
1 3
m −m + (
1 3
)
m − m + ...... ( )
 = 1−  
12 12
( )
where m is the number of items whose ranks
n n −1
2

are equal.

Examples
1. Calculate the rank correlation coefficient between marks in the selection test (X) and the
proficiency test (Y) of 9 recruits.
Sl.No. 1 2 3 4 5 6 7 8 9
X: 10 15 12 17 13 16 24 14 22
Y: 30 42 45 46 33 34 40 35 39
Ans:
n = 9.

x y u=Rank in x v=Rank in y d=u - v d2


10 30 9 9 0 0
15 42 5 3 2 4
12 45 8 2 6 36
17 46 3 1 2 4
13 33 7 8 -1 1
16 34 4 7 -3 9
24 40 1 4 -3 9
14 35 6 6 0 0
22 39 2 5 -3 9

d 2
= 72

6 d 2 6  72
 = 1− 1− = 0.4
(
n n −12
) =
9(81 − 1)

2. Ten competitors in a music competition are ranked by three judges in the following order:
Competitor: 1 2 3 4 5 6 7 8 9 10
Judge A: 1 6 5 10 3 2 4 9 7 8
Judge B: 3 5 8 4 7 10 2 1 6 9
Judge C: 6 4 9 8 1 2 3 10 5 7
Using rank correlation coefficient, determine which pair of judges have common taste in music.
Ans:

Let x, y, z be the ranks given by judges A, B, C respectively.

x y z d12=(x-y)2 d22=(y-z)2 d32=(x-z)2


1 3 6 4 9 25
6 5 4 1 1 4
5 8 9 9 1 16
10 4 8 36 16 4
3 7 1 16 36 4
2 10 2 64 64 0
4 2 3 4 1 1
9 1 10 64 81 1
7 6 5 1 1 4
8 9 7 1 4 1

d = 200 d = 214 d = 60
2 2 2
1 2 2

n = 10

The rank correlation coefficient between Judge A and Judge B =


6 d 1
2
6  200
1 = 1 − = 1− = −0.212
(
n n −1
2
) 10(100 − 1)

The rank correlation coefficient between Judge B and Judge C =

6 d 2
2
6  214
2 = 1 − = 1− = −0.297
(
n n −1
2
) 10(100 − 1)

The rank correlation coefficient between Judge A and Judge C =

6 d 3
2
6  60
3 = 1 − = 1− = 0.636
(
n n −1
2
) 10(100 − 1)

3. Calculate the rank coefficient of correlation for the following data:


X: 68 64 75 50 64 80 75 40 55 64
Y: 62 58 68 45 81 60 68 48 50 70
x y R1 R2 d = R1 - R2 d2
68 62 4 5 -1 1
64 58 6 7 -1 1
75 68 2.5 3.5 -1 1
50 45 9 10 -1 1
64 81 6 1 5 25
80 60 1 6 -5 25
75 68 2.5 3.5 -1 1
40 48 10 9 1 1
55 50 8 8 0 0
64 70 6 2 4 16

d 2
= 72

In X series,
75 is repeated 2 times (m = 2 )
64 is repeated 3 times (m = 3 )
In Y series
68 is repeated 2 times (m = 2 )


6  d 2 +
1 3
m −m + (
1 3
)
m − m + ...... ( )
 = 1−  =
12 12
n n −1
2
( )

672 +
1 3
2 −2 + (1 3
)
3 −3 +
1 3 
2 −2 ( ) ( )
1−   = 0.545
12 12 12
10 10 − 1
2
( )

Exercises:
1. Calculate Karl Pearson’s Coefficient of correlation between price and supply of a commodity
from the following data:
Price (Rs.): 17 18 19 20 21 22 23 24 25 26
Supply (Kg): 38 37 38 33 32 33 34 29 26 23
2. Compute the coefficient of correlation between the corresponding values of x and y in the
following table:
X: 2 4 5 6 8 11
Y: 18 12 10 8 7 5
3. Calculate Karl Pearson’s correlation coefficient from the data:
Roll No. 1 2 3 4 5 6 7 8 9 10
Marks in Economics: 78 36 98 25 75 82 90 62 65 39
Marks in Maths: 84 51 91 60 68 62 86 58 53 47
4. Calculate the coefficient of correlation from the following data:
X: 9 8 7 6 5 4 3 2 1
Y: 15 16 14 13 11 12 10 8 9
5. Calculate the correlation coefficient between infant gestational age and birth weight from the
following table:
Infant ID: 1 2 3 4 5 6 7 8 9 10 11
Gest. age: 34.7 36.0 29.3 40.1 35.7 42.4 40.3 37.3 40.9 38.3
38.5
Birth weight:1895 2030 1440 2835 3090 3827 3260 2690 3285 2920
3430

Infant ID: 12 13 14 15 16 17
Gest. age: 41.4 39.7 39.7 41.1 38.0 38.7
Birth weight: 3657 3685 3345 3260 2680 2005

6. Calculate the coefficient of rank correlation from the following data:


X: 48 33 40 9 16 16 65 24 16 57
Y: 13 13 24 6 15 4 20 9 6 19
7. The ranking of 10 students in two subjects, maths and physics, are as follows:
Maths: 3 5 8 4 7 10 2 1 6 9
Physics: 6 4 9 8 1 2 3 10 5 7
Find the correlation coefficient.

8. The coefficient of rank correlation between the debenture prices and share prices of a company
was + 0.8. If the sum of the squares of the difference in ranks was 33, find the value of n.
9. If covariance between X and Y is 10 and the variance of X and Y are respectively 16 and 9, find
the coefficient of correlation
10. Calculate Karl Pearson’s correlation between X and Y from the following data:
N = 13,  X = 117 ,  X 2
= 1313 ,  Y = 260 ,  Y 2
= 6580 ,  XY = 2827
11. In two sets of variables X and Y with 50 observations each, the following data were observed:
Mean of X = 10, S.D. of X = 3
Mean of Y =6, S..D. of Y = 2
Coefficient of correlation between X and Y is 0.3. However, on subsequent verification, it
was found that one value of X (=10) and Y (=6) were inaccurate and hence weeded out. With
the remaining 49 pairs of values, how is the original value of correlation coefficient affected?

Regression

Regression is the measure of the average relationship between two or more variables in
terms of the original units of the data. It provides a mechanism for predicting or
forecasting.

If two variables X and Y are correlated, we see that the scatter diagram will be more or less
concentrated around a curve, called the curve of regression. If this curve is a straight line,
then it is called line of regression.

When there is a reasonable amount of scatter, we can draw two different regression
lines depending upon which variable we consider to be the most accurate- the regression
line of Y on X and the regression line of X on Y.
The regression line of Y on X gives the most probable value of Y for given values of X.
The regression line of X on Y gives the most probable value of X for given values of Y.
The equation of the line of regression of Y on X is
r y r y
y−y=
x
(x − x ) where b yx =
x
is the regression coefficient of y on x.

The equation of the line of regression of X on Y is

x−x =
r x
y
(y − y ) where bxy =
r x
y
is the regression coefficient of x on y.
Note:
1. A regression equation allows us to express the relationship
between two (or more) variables algebraically. It indicates the
nature of the relationship between two (or more) variables. In
particular, it indicates the extent to which you can predict some
variables by knowing others, or the extent to which some are
associated with others.
2. A regression line is a line drawn through the points on a scatterplot
to summarize the relationship between the variables being studied.
When it slopes down (from top left to bottom right), this indicates a
negative or inverse relationship between the variables; when it
slopes up (from bottom right to top left), a positive or direct
relationship is indicated.
n xy −  x y
3. b yx =
n x 2 − ( x )
2

n xy −  x y
bxy =
n y 2 − ( y )
2

( )
4. Both the regression lines pass through the point x, y . Hence, by solving the two
regression equations, we can find the means of X and Y.
5. Both the regression coefficients will have the same sign; either both will be positive or both
will be negative.
6. Correlation coefficient is the geometric mean between the regression coefficients.
i.e., rxy = b yx  bxy

If both the regression coefficients are positive, r will be positive; if both the regression
coefficients are negative, r will be negative.
7. Regression coefficients are independent of the change of origin, but not of scale.

Angle between regression lines


If  is the angle between the two regression lines, then
1− r 2   x  y
tan  =   2
 x + y
2
 r
Note:

1. When r = 0, tan =  ,  = , i.e., the two regression lines are perpendicular
2
to each other. Their equations are y = y and x = x

2. If r = 1, then tan = 0 ,  = 0 or  , i.e., the two regression lines coincide. They

cannot be parallel since they have a common point x, y . ( )

Examples

1. Find the correlation coefficient and the equations of the regression lines for the
following data:

X: 1 2 3 4 5
Y: 2 5 3 8 7

Ans:
x y x2 y2 Xy
1 2 1 4 2
2 5 4 25 10
3 3 9 9 9
4 8 16 64 32
5 7 25 49 35
15 25 55 151 88

n= 5 ,  x = 15 ,  y = 25 ,  x 2
= 55 ,  y 2 = 151 ,  xy =88
n xy −  x y
rxy = = 0.8062
n x − ( x ) n y − ( y )
2 2 2 2

n xy −  x y
b yx = = 1.3
n x 2 − ( x )
2

n xy −  x y
bxy = = 0.5
n y 2 − ( y )
2

r y
The equation of the line of regression of Y on X is y − y =
x
(x − x )
y − 5 = 1.3 (x − 3)
y = 1.3 x +1.1

The equation of the line of regression of X on Y is x − x =


r x
y
(y − y )
x − 3 = 0.5( y − 5)
x = 0.5 y + 0.5

2. Marks obtained by 10 students in Mathematics (x) and Statistics (y) are given below:
X: 60 34 40 50 45 40 22 43 42 64
Y: 75 32 33 40 45 33 12 30 34 51
Find the two regression lines. Also find y when x = 55
Ans:
x y x2 y2 Xy
60 75 3600 5625 4500
34 32 1156 1024 1088
40 33 1600 1089 1320
50 40 2500 1600 2000
45 45 2025 2025 2025
40 33 1600 1089 1320
22 12 484 144 264
43 30 1849 900 1290
42 34 1764 1156 1428
64 51 4096 2601 3264

 = 440 385 20674 17253 18499

n = 10

x=
 x = 440 = 44
n 10

y=
 y = 385 = 38.5
n 10

n xy −  x y
b yx = = 1.1865
n x 2 − ( x )
2

n xy −  x y
bxy = = 0.6414
n y 2 − ( y )
2

r y
The equation of the line of regression of Y on X is y − y =
x
(x − x )

y = 1.1865x − 13.706

When x = 55, y = (1.1865  55) − 13.706 = 51.55

The equation of the line of regression of X on Y is

x−x =
r x
y
(y − y )
x = 0.6414 y + 19.3061
3. For the following data, find the most likely price at Madras corresponding to the price 70
at Bombay and that at Bombay corresponding to the price 68 at Madras
Madras Bombay
Average price 65 67
S.D. of price 0.5 3.5
S.D. of the difference between the prices at Madras and Bombay is 3.1

Ans:
Let X dente the price at Madras and Y denote the price at Bombay.
Given x = 65 , y = 67 ,  x = 0.5 ,  y = 3.5 ,  x − y = 3.1

 x 2 +  y 2 −  x− y 2 (0.5) 2 + (3.5) 2 − (3.1) 2


rxy = = = 0.83
2 x y 2(0.5)(3.5)

b yx = 5.81

b xy = 0.12

The regression line of Y on X is y − 67 = 5.81( x − 65)

When x = 68, we get y = 84.43

The regression line of X on Y is x − 65 = 0.12( y − 67)

When y = 70, we get x = 65.36

4. In a partially destroyed laboratory record of an analysis of a correlation data, the following


results only are legible.
Variance of X = 9
Regression equations are 8x – 10y +66 = 0
40x – 18y =214
Find
(a) The mean values of X and Y
(b) The standard deviation of Y
(c) The coefficient of correlation between X and Y

Solution:
(a) Since both the regression lines pass through the point (x , y )
 8 x − 10 y + 66 = 0
40 x − 18 y − 214 = 0

Solving the equations, we get


x = 13, y = 17
(b) Given Var(x ) = 9   x = 3

The equations of the regression lines can be written as


y = 0.8x + 6.6, x = 0.45 y + 5.35
Hence the regression coefficient of Y on X is
r y
b yx = = 0.8
x
The regression coefficient of X on Y is
r x
bxy = = 0.45
y
r y r x
Multiplying, we get,  = r 2 = 0.8  0.45 = 0.36
x y

5. Calculate the correlation coefficient from the following data: N = 10,

 X = 350 ,  Y = 310 ,  ( X − 35 )  (Y − 31) = 222 ,  ( X − 35 )(Y − 31) = 92


2 2
= 162 ,

Also find the regression line of Y on X.


Ans: Given:

 X = 350 ,  Y = 310 ,  ( X − 35 )  (Y − 31) = 222 ,  ( X − 35 )(Y − 31) = 92


2 2
= 162 ,

x = 35
y = 31
r=
 (x − x )(( y − y )) = 0.485
 (x − x )  (y − y )
2 2

b yx =
 (x − x )(( y − y )) = 0.568
 (x − x )
2

Regression line of Y on X is y = 0.568x + 11.12

Exercise:
1. From the following data, obtain the two regression equations
Sales: 91 97 108 121 67 124 51 73 111 57
Purchase: 71 75 69 97 70 91 39 61 80 47
2. Two variables gave the following data: x = 20, y = 15,  x = 4,  y = 3, r = +0.7 . Obtain

the two regression equations and find the most likely value of Y when X = 24.

3. You are given the following information about advertising and sales:

Adv. Expenses(x) Sales(y)


(Rs. lakhs) (Rs. lakhs)
Average price 10 90
S.D. of price 3 12

Correlation coefficient is 0.8


(a) Find the two regression lines.
(b) Find the likely sales when advertisement expenditure is Rs. 15 lakhs.
(c) What should be the advertisement expenditure if the company wants to attain sales
target of Rs. 120 lakhs?

4.. The equations of the two lines of regression for a bivariate data are Y = 10(X – 5) and
X = 2.5(Y – 14). Find the arithmetic means of X and Y as well as the coefficient of
correlation between X and Y.

You might also like