A Positive Relationship

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

Chapter 7

Correlation and Regression

Correlation: Correlation between 2 variables X and Y indicates whether they are related to each
other and also to what extent. There are 2 types of correlation i.e. positive and negative
correlation. If the 2 variables move in the same direction i.e. increasing or decreasing together
then there is a positive correlation between them. If the 2 variables move in the opposite
direction i.e. one increasing and the other decreasing then there is a negative correlation between
them. There are different ways of expressing the correlation between two variables.
(i) Scatter Diagrams: By plotting the values of two variables X and Y, we can understand the
relationship between them.
A Positive Relationship

A Negative Relationship
No Apparent Relationship

For a given change in X, if the change in Y remains the same for all the points and in the same
direction, then all the points lie on the line and the slope remains the same throughout. This is
called as perfect positive correlation between X and Y.
Similarly, for a given change in X, if the change in Y remains the same for all the points but in
the opposite direction, then there is a perfect negative correlation between them.
(ii) Karl Pearson’s coefficient of correlation ‘r’: The value of the correlation coefficient lies
between -1 and 1. This tells us to what extent the variables are related.
If r is between 0.7 and 1, it shows high positive correlation, if it is between -0.7 and -1, it shows
high negative correlation. If r is close to 0.5 or -0.5, it shows moderate positive or negative
correlation. If r is close to 0, it shows low positive or negative correlation.
Karl Pearson’s correlation coefficient can be calculated as
r = Cov(X, Y)/ σx σy
Where Cov(X, Y) is the covariance between X and Y. σ x is the standard deviation of variable X
and σy is the standard deviation of variable Y.
Cov(X, Y) = (∑(X – X bar)*∑(Y-Y bar))/n where n is the number of observations.
σx = sqrt((∑(X – X bar)2/n), σy= sqrt((∑(Y – Y bar)2/n)
Substituting the expressions for Cov(X,Y), σx and σy in the formula for r, we get
r = ∑xy/ sqrt(∑x2 * ∑y2)
Where x = X –X bar, y = Y – Y bar
Q1. The following data relates to Sales (Crores of rupees) and Profits (in lakhs of rupees):
Sales X : 5 7 8 10 15
Profits Y : 12 15 17 20 21
Find the correlation coefficient between sales and profits.
Solution:
X Y x = X –X bar y = Y – Y bar xy x2 y2
5 12 -4 -5 20 16 25
7 15 -2 -2 4 4 4
8 17 -1 0 0 1 0
10 20 1 3 3 1 9
15 21 6 4 24 36 16
X bar = 45/5 = 9, Y bar = 85/5 = 17

∑xy = 51, ∑x2= 58, ∑y2= 54

r = ∑xy/ sqrt (∑x2 * ∑y2) = 51/ sqrt (58*54) = 51/sqrt(3132) = 0.91


This means that there is a very high positive correlation between X and Y. This means that both
X and Y move in the same direction and the extent of relationship between them is very high.

Q2. The following data relates to advertising expenditure (in lakhs of rupees) and sales (in crores
of rupees):

Advertising expenditure X : 10 12 15 23 20 22

Sales Y : 14 17 23 25 21 22

Find the correlation coefficient between advertising expenditure and sales.


Solution:

X Y x = X –X bar y = Y –Y bar xy x2 y2

10 14 -7 -6.3 44.1 49 39.69

12 17 -5 -3.3 16.5 25 10.89

15 23 -2 2.7 -5.4 4 7.29

23 25 6 4.7 28.2 36 22.09

20 21 3 0.7 2.1 9 0.49

22 22 5 1.7 8.5 25 2.89

X bar = 102/6 = 17, Y bar = 122/6 = 20.3

∑xy = 94, ∑x2= 148, ∑y2= 83.34

r = ∑xy/ sqrt (∑x2 * ∑y2) = 94/ sqrt (148 *83.34) = 0.846

This means that there is a high positive correlation between X and Y. So both X and Y move in
the same direction and the extent of relationship between them is high.

(iii) Spearman’s coefficient of rank correlation ‘r s’: If the data is available in the form of
rankings, then we can find the rank correlation coefficient to find the extent of similarity or
dissimilarity between them. The value of rs lies between -1 and +1. We can calculate rs as
follows:

rs= 1- (6∑d2/ (n(n2-1))) where n is the number of observations and d is the difference
between the ranks.

Suppose we ask two employees to give their rankings to various factors affecting job
satisfaction. Then we can find out to what extent the opinion of these two employees is similar or
dissimilar. If rs is positive then the opinion is similar and if it is negative, then the opinion is
dissimilar.

Factors affecting job satisfaction Ranks by employee 1 Ranks by employee 2 d2

(R1) (R2)

Salary 1 3 4

Job conditions 2 2 0

Growth opportunities 3 1 4

Relations with peers 4 4 0

∑d2 = 8, n = 4

rs= 1- (6∑d2/ (n(n2-1))) = 1-(6x8/4(16-1)) = 0.2

So there is a low positive correlation between the opinions of the two employees. This means
that the opinions of the two employees are similar but the extent of relationship is less.

Now, we take an example where marks of students in two subjects are given. We can convert
these marks into ranks and find the rank correlation coefficient.

Marks in Test 1 Marks in Test 2 R1 R2 d2


14 7 3 5 4
18 16 2 2 0
9 19 6 1 25
10 12 5 4 1
20 15 1 3 4
11 6 4 6 4
∑d2 = 38, n = 6

rs= 1- (6∑d2/ (n(n2-1))) = 1-(6x38/6(36-1)) = -0.0857

So there is a low negative correlation between the marks in the two tests. This means that the
marks in the two tests are moving in the opposite direction but the extent of relationship is very
less.

If some values are repeating for one or both the variables, then we give an average rank to these
values and introduce a correction factor in the formula which can be written as

Correction factor = 1/12 * [(m13 –m1) + (m23 – m2) + …] where m1, m2 … are the number of
times a particular value repeats.

Then rs= 1- [6(∑d2 + correction factor)/ (n (n2-1)]

Q1. For the following data, find the rank correlation coefficient after making adjustment for tied
ranks.
X: 48 33 40 9 16 16 65 24 16 57
Y: 13 13 24 6 15 4 20 9 6 19

Solution:
X Y R1 R2 d2
48 13 3 5.5 6.25
33 13 5 5.5 0.25
40 24 4 1 9
9 6 10 8.5 2.25
16 15 8 4 16
16 4 8 10 4
65 20 1 2 1
24 9 6 7 1
16 6 8 8.5 0.25
57 19 2 3 1
∑d2 = 41, n = 10

Correction factor = 1/12 * [(m13 –m1) + (m23 – m2) + …]

= 1/12 * [(33 –3) + (23 – 2) + (23 – 2)] = 1/12 [24 +6 +6] = 3

Then rs= 1- [6(∑d2 + correction factor)/ (n (n2-1)]


= 1-[6(41+3)/10(102 -1)] = 0.734
There is a high positive correlation between X and Y.

Q2. Find Spearman’s rank correlation coefficient.


X Y R1 R2 d2
5 14 5 1.5 12.25
12 7 1 4 9
10 14 2.5 1.5 1
8 5 4 5 1
10 8 2.5 3 0.25
∑d2 = 23.5, n = 5

Correction factor = 1/12 * [(m13 –m1) + (m23 – m2) + …]

= 1/12 * [(23 –2) + (23 – 2)] = 1/12 [6 +6] = 1

Then rs= 1- [6(∑d2 + correction factor)/ (n (n2-1)]


= 1-[6(23.5 +1)/5(52 -1)] = -0.225
There is a low negative correlation between X and Y.

Regression: In the case of regression, we express the relation between 2 variables in the form of
a cause- effect relationship which can be written as a linear equation

Y = a1 + byx X where Y is called as the dependent variable, X is called as the


independent variable and a1 and byx are called as regression constants. This is called as the
simple linear regression equation of Y on X. This equation can be used for forecasting or
predicting the value of the dependent variable Y for some given value of the independent
variable X.

Example, Y = 1 + 2 X

For some given values of X and Y, we can have many lines drawn through them, but there will
be only one line which is the closest to these points and this is called as the best fit line. The
values of a1 & byx can be found by using the method of least squares. In this method, we try to
minimise the value of ∑e2 where e is the difference between the Y coordinates of the point
plotted and the point on the straight line.

The formulae for a and b can be written as

byx = ∑xy/ ∑x2 , a1 = Y bar – byx* X bar

These values of a1 & byx can be substituted in the equation Y = a1 + b yx X and this equation can
be used to forecast the value of Y for some given value of X.

Similarly, we can find the simple linear regression equation of X on Y as

X = a2 + bxy Y

Where bxy = ∑xy/ ∑y2 and a2 = X bar - bxy Y bar

The relation between byx and bxy can be written as

byx x bxy = r2 where r2 is called as the coefficient of determination. It signifies the percentage
variation in the dependent variable that is explained by the independent variable.

r = ± √ byx x bxy

Both byx and bxy should be of the same sign otherwise r will be imaginary. If b yx and bxy are
positive, then r will have a positive sign and if b yx and bxy are negative, then r also will be
negative.

Q1. For the following data, find the simple linear regression equation of Y on X and forecast Y
when X = 20.

X Y x = X –X bar y = Y – Y bar xy x2 y2
5 12 -4 -5 20 16 25
7 15 -2 -2 4 4 4
8 17 -1 0 0 1 0
10 20 1 3 3 1 9
15 21 6 4 24 36 16
X bar = 45/5 = 9, Y bar = 85/5 = 17

∑xy = 51, ∑x2= 58, ∑y2= 54

b = ∑xy/ ∑x2 = 51/58 = 0.8793, a = Y bar – b* X bar = 17-(0.8793)(9)= 9.086

Hence the simple linear regression equation of Y on X is Y = 9.086 + 0.8793 X

Putting X =20, we get Y = 9.086 + 0.8793(20) = 26.672

Q2. The coefficient of correlation between the variables x and y is 0.64, their covariance is 16.
The variance of x is 9. Find the standard deviation of y.

Solution: r = COV (X, Y)/ σx σy

r = 0.64, V(X) = 9, so σ x = 3, COV(X,Y) = 16, σY = ?

0.64 = 16 / (3 σY) σY = 16/ (3 x 0.64) = 8.33

Q3. Calculate the regression of Y on X and the regression equation of X on Y for the following
data. Also find the coefficient of determination.

X : 10 12 15 23 20

Y : 14 17 23 25 21

Solution: X bar = ∑X /n = 80/5 = 16, Y bar = ∑Y /n = 100/5 = 20

x = X –X bar: -6 -4 -1 7 4

y = Y –Y bar: -6 -3 3 5 1

xy : 36 12 -3 35 4

x2 : 36 16 1 49 16
y2 : 36 9 9 25 1

x2 : 36 16 1 49 16

∑xy = 84, ∑ x2 = 118, ∑y2 = 80

byx = ∑xy / ∑ x2 = 84 / 118 = 0.71, a1 = Y bar – byx X bar = 20 – (0.71) (16) = 8.64

Simple linear regression equation of Y on X is

Y = a1 + byx X i.e. Y = 8.64 + 0.71 X

bxy = = ∑xy / ∑ y2 = 84/ 80 = 1.05, a2 = X bar – bxy Y bar = 16 – (1.05)(20) = -5

So the simple linear regression equation of X on Y is X = -5 + 1.05 Y


r2 = byx x bxy = 0.71 x 1.05 = 0.7455
Karl Pearson’s coefficient of correlation =∑xy/sqrt(∑x2. ∑y2) = 84/sqrt(118 x 80) = 0.8646
High positive correlation between X and Y.

Q4. Find Karl Pearson’s coefficient of correlation and the equation of the best fit simple linear
regression line for the following data.
X 9 8 7 6 5 4 3 2 1
Y 15 16 14 13 11 12 10 8 9
Solution:
X Y x = X –X bar y = Y – Y bar xy x2 y2
9 15 4 3 12 16 9
8 16 3 4 12 9 16
7 14 2 2 4 4 4
6 13 1 1 1 1 1
5 11 0 -1 0 0 1
4 12 -1 0 0 1 0
3 10 -2 -2 4 4 4
2 8 -3 -4 12 9 16
1 9 -4 -3 12 16 9
X bar = 45/9 = 5, Y bar = 108/9= 12

∑xy = 57, ∑ x2 = 60, ∑y2 = 60

byx = ∑xy / ∑ x2 = 57 / 60 = 0.95, a1 = Y bar – byx X bar = 12 – (0.95) (5) = 7.25

Simple linear regression equation of Y on X is

Y = a1 + byx X i.e. Y = 7.25 + 0.95 X

r = ∑xy/ sqrt (∑x2 * ∑y2) = 57/ √60 x 60 = 0.95


This means that there is a very high positive correlation between X and Y.

Q5. Following are the average prices of a particular stock and the values of Stock Exchange
index for 6 years:

Stock price X (Rs.) Index Y


245 307
255 322
240 337
390 310
655 350
393 360
Calculate the coefficient of correlation between the share price and the SE index. Also find the
simple linear regression equation of Y on X.

Solution:
X Y x = X –X bar y = Y – Y bar xy x2 y2
245 307 -118 -24 2832 13924 576
255 322 -108 -9 972 11664 81
240 337 -123 6 -738 15129 36
390 310 27 -21 -567 729 441
655 350 292 19 5548 85264 361
393 360 30 29 870 900 841
X bar =2178/6 = 363, Y bar = 1986/6 = 331

∑xy = 8917, ∑ x2 = 127610, ∑y2 = 2336

byx = ∑xy / ∑ x2 = 8917 / 127610 = 0.07, a1 = Y bar – byx X bar = 331 – (0.07) (363) = 305.59

Simple linear regression equation of Y on X is

Y = a1 + byx X i.e. Y = 305.59 + 0.07 X

r = ∑xy/ sqrt (∑x2 * ∑y2) = 8917/ √127610 x 2336 = 8910/(357.22 x 48.33) = 0.516

This means that there is a moderate positive correlation between X and Y.

Q6. Find Karl Pearson’s coefficient of correlation and the equation of the best fit simple linear
regression line for the following data.
Age X 56 42 36 47 49 42 60 68
Blood pressure Y 147 125 118 128 145 140 155 162

Solution:
X Y x = X –X bar y = Y – Y bar xy x2 y2
56 147 6 7 42 36 49
42 125 -8 -15 120 64 225
36 118 -14 -22 308 196 484
47 128 -3 -12 36 9 144
49 145 -1 5 -5 1 25
42 140 -8 0 0 64 0
60 155 10 15 150 100 225
68 162 18 22 396 324 484
X bar = 400/8 = 50, Y bar = 1120/8 = 140

∑xy = 1047, ∑ x2 = 794, ∑y2 = 1636

byx = ∑xy / ∑ x2 = 1047 / 794 = 1.319, a1 = Y bar – byx X bar = 140 – (1.319) (50) = 74.05

Simple linear regression equation of Y on X is

Y = a1 + byx X i.e. Y = 74.05 + 1.319 X

r = ∑xy/ sqrt (∑x2 * ∑y2) = 1047/ √794 x 1636 = 1047/(28.2 x 40.45) = 0.918

This means that there is a high positive correlation between X and Y.

Q7. A research project was undertaken to determine if there is a relationship between the years
of experience on the job (X) and efficiency rating of employees (Y). The objective of the study
was to predict the efficiency rating of the employee. The sample results are as follows:

Years of Job (X) 1 20 6 8 2 1 14 8 4 6

Efficiency rating (Y) 6 5 3 5 2 2 4 5 4 4

(a) Find the correlation coefficient between X and Y.

(b) Find the linear regression of Y on X.

Solution:

X Y x = X –X bar y = Y – Y bar xy x2 y2
1 6 -6 2 -12 36 4

20 5 13 1 13 169 1

6 3 -1 -1 1 1 1

8 5 1 1 1 1 1
2 2 -5 -2 10 25 4

1 2 -6 -2 12 36 4

14 4 7 0 0 49 0

8 5 1 1 1 1 1

4 4 -3 0 0 9 0

6 4 -1 0 0 1 0

X bar = 70/10 = 7, Y bar = 40/10 = 4

∑xy = 26, ∑ x2 = 328, ∑y2 = 16

byx = ∑xy / ∑ x2 = 26 / 328 = 0.079, a1 = Y bar – byx X bar = 4 – (0.079) (7) = 3.447

Simple linear regression equation of Y on X is

Y = a1 + byx X i.e. Y = 3.447 + 0.079 X

r = ∑xy/ sqrt (∑x2 * ∑y2) = 26/ √328 x 16 = 26/(18.12 x 4) = 0.36

This means that there is a low positive correlation between X and Y.

Q8. Quinine may be determined by measuring the fluorescence intensity in IM sulphuric acid.
Standard solutions of quinine gave the following fluorescence values. Calculate the correlation
coefficient.

Concentration of quinine Y 0.00 0.10 0.20 0.30 0.40

Fluorescence intensity X 0.00 5.2 9.80 12.30 17.10

If the intensity was observed to be 14.85 what is the concentration of quinine Y likely to be in
the solution.

Solution:
X Y x = X –X bar y = Y – Y bar xy x2 y2
0 0 -9 -0.2 1.8 81 0.04

5.2 0.1 -3.8 -0.1 0.38 14.44 0.01

9.8 0.2 0.8 0 0 0.64 0

12.3 0.3 3.3 0.1 0.33 10.89 0.01

17.7 0.4 8.7 0.2 1.74 75.69 0.04

X bar =45/5 = 9, Y bar = 1/5 = 0.2

∑xy = 4.25, ∑ x2 = 182.66, ∑y2 = 0.1

byx = ∑xy / ∑ x2 = 4.25 / 182.66 = 0.023

a1 = Y bar – byx X bar = 0.2 – (0.023) (9) = -0.007

Simple linear regression equation of Y on X is

Y = a1 + byx X i.e. Y = -0.007 + 0.023 X

Put X = 14.85, Y = -0.007 + (0.023) (14.85) = 0.33455

r = ∑xy/ sqrt (∑x2 * ∑y2) = 4.25/ √182.66 x 0.1 = 4.25/ √18.266 = 4.25/4.27 = 0.995

This means that there is a very high positive correlation between X and Y.

Q9. The manufacturers of a particular brand of chocolate were interested in examining the
relationship between the sales of chocolates and shelf space allocated to that brand of chocolate
by various stores. Data from 10 stores are as follows:

Sales ( Rs in thousands) 25 15 28 30 17 16 12 21 19 27
Y

Shelf Space (sq ft) X 5 3.2 5.4 6.1 4.3 3. 2.6 6.4 4.9 6
1

Determine the regression to predict sales using shelf space as the independent variable. Also
find the Karl Pearson’s correlation coefficient between X and Y.

Solution:

X Y x = X –X bar y = Y – Y bar xy x2 y2
5 25 0.3 4 1.2 0.09 16
3.2 15 -1.5 -6 9 2.25 36
5.4 28 0.7 7 4.9 0.49 49
6.1 30 1.4 9 12.6 1.96 81
4.3 17 -0.4 -4 1.6 0.16 16
3.1 16 -1.6 -5 8 2.56 25
2.6 12 -2.1 -9 18.9 4.41 81
6.4 21 1.7 0 0 2.89 0
4.9 19 0.2 -2 -0.4 0.04 4
6 27 1.3 6 7.8 1.69 36
Y bar = 210/10 = 21, X bar = 47/10 = 4.7

∑ xy= 63.6, ∑x2 =16.54, ∑y2= 344 .

byx = ∑ xy / ∑x2 = 63.6/ 16.54 = 3.845, a1 = Y bar – b X bar = 21-(3.845)(4.7)= 2.9285

Equation is Y = a1 + byx X i.e. Y = 2.9285 + 3.845 X

r = ∑xy/ sqrt (∑x2 * ∑y2) = 63.6/ √344 x 16.54 = 63.6/ √5689.76 = 63.6/75.44 = 0.843

This means that there is a high positive correlation between X and Y.

Multiple linear regression


Here the dependent variable Y depends on more than one independent variable. It is of the form
Y = a + b1X1 + b2X2 +….bnXn
Where X1, X2, … are the independent variables.
For getting the values of a, b1, b2… we consider three normal equations
∑Y = na + b1∑X1 + b2 ∑X2
∑X1Y= a∑X1 + b1∑X12 + b2∑X1X2
∑X2Y= a∑X2 + b1∑X1X2 + b2∑X22

Consider the following example:

The data below shows the profit (in Rs.’000), sales (in Rs. Lakhs) and advertising expenditure(in
Rs.’00). Find the multiple regression equation of profit on sales and advertising expenditure.

Sales(X1) Advertising expenditure (X2) Profit(Y) X12 X1X2 X22 X1Y X2Y
24 16 10 576 384 256 240 160
35 17 11 1225 595 289 385 187
38 18 12 1444 684 324 456 216
41 19 13 1681 779 361 533 247
42 20 14 1764 840 400 588 280
∑X1= 180, ∑X2= 90, ∑Y = 60, ∑X1X2=3282, ∑X2 2 =1630, ∑X12 =6690, ∑X1Y=2202,
∑X2Y= 1090, n= 5

Substituting in the above equations, we get

60= 5a + 180 b1 + 90 b2

2202= 180a + 6690 b1 + 3282 b2

1090= 90a + 3282 b1 + 1630 b2

Solving the above equations, we will get the values of a, b1 and b2 which we substitute in the
equation

Y = a + b1X1 + b2X2
Y = 8.8 + 0.089 X1 + 0.49 X2

You might also like