Correlation and Regression: Jaipur National University

JAIPUR NATIONAL UNIVERSITY
SCHOOL OF BUSINESS AND MANAGEMENT
QUANTITATIVE TECHNIQUES FOR

MANAGEMENT
ASSIGNMENT
ON
CORRELATION AND REGRESSION
SUBMITTED TO:
SUBMITTED BY:
MR. SWADHEEN JAIN CHARU

SHARMA
(FACULTY – JNU) MBA (I-

B)
DATED: 21-10-2010
CORRELATION
Introduction to Correlation :
Correlation is the relationship between the two or more interrelated series of

variables.because it is an inter-relationship mainly between series of two
variables. Because it is an inter-relationship mainly between series of two
variables, that is why, it is also known as bivariate analysis. There are three
types of distributions:
 Univariate distribution or analysis.

 Bivariate distribution or analysis.
 Multi-variate distribution or analysis.
In a univariate analysis one analyses only one variable. Variables are such data
which vary with some event. In the study of mean, median, mode and measures
of dispersion, we study univarite data while in correlation we study bivariate
data. In multi-variate we study more than two variables for each unit of study.
Meaning of Correlation:
In daily practice we come across a large number of problems involving the use
of two or more than two variables. If two quantities vary in such a way that
movements in one are accompanied by movements in the other, these quantities
are correlated. The degree of relationship between the variables under
consideration is measured through the correlation analysis. The measure of
correlation called the correlation coefficient or correlation index
summarizes in one figure the direction and degree of correlation.
The correlation analysis refers to the techniques used in measuring the

closeness/variation of the relationship between the variables. The detection and
analysis of correlation between two statistical variables requires relationship of
some sort which associates the observation in pairs, one of each pair being a
value of each of the two variables. The computation concerning the degree of
closeness is based on the regression equation. However, it is possible to perform
correlation analysis without actually having a regression equation.
Definitions of Correlation :
“Statistical inquiry into concomitant variation is correlation analysis.”
-Simpson & Kafka
“When the relationship is of a quantitative nature, the appropriate statistical tool

for discovering and measuring the relationship and expressing it in brief
formula is known as correlation.”
-Croxton & Cowden
“If two or more quantities vary in sympathy so that movements in one tend to
be accompanied by corresponding movements in the other then they are said to
be correlated.”
-L. R. Conner
“Correlation means that between two series or group of data there exist some
casual connection.”
-Prof. King
“Whenever some definite connection exists between two or more groups,

classes or series of data, there is said to be correlation.”
-Boddington
Significance of the Study of Correlation :
The study of correlation is of immense use in practical life because of the

following reasons:
 Most of the variables show some kind of relationship. With the help
of correlation analysis we can measure in one figure the degree of
relationship existing between the variables.
 Once we know that the two variables are closely related, we can
estimate the value of one variable given the value of another. This is
known with the help of regression analysis.
 Correlation analysis contributes to the understanding of economic
behaviour, aids in locating the critically important variables on which
others depend, may reveal to the economist the connection by which
disturbances spread and suggest to him the paths through which
stabilizing forces may become effective.
Types of Correlation :
Correlation is described or classified in several different ways. Three of the

most important ways of classifying correlation are :
1. Positive and Negative Correlation : Whether correlation is

positive(direct) or negative(inverse) would depend upon the direction of
change of the variables. If both the variables are varying in the same
direction , i.e., if as one variable is increasing the other, on an average, is
also increasing or, if as one variable is decreasing the other, on an
average, is also decreasing, correlation is said to be positive. If, on the
other hand, the variables are varying in opposite directions, i.e., as one
variable is increasing, the other is decreasing or vice-versa, correlation is
said to be negative. The following examples would illustrate the
difference between positive and negative correlation.
Positive correlation, where expenditure is increasing with the increasing

income-
Income Expenditure (in 1000’s)

15 10
20 12
22 15
25 18
37 20
50 30
Positive correlation, where supply is decreasing with the decreasing
prices-
Price(rs.) Supply (in 100’s 0f units)

15 41
14 36
13 30
12 26
11 20
10 14
More examples:
-relationship between age of husband and age of wife
-no. of accidents and the no. of vehicles
-more the food, more are the calories gained
-more you smile, more happy you feel, etc.
Negative correlation, where demand of product decreases with the

increase in its price-
Price (rs.) Demand (in 100’s of units)

100 73
110 70
115 68
125 63
130 60
135 55
Negative correlation, where crime rate is increasing with the decreasing

employment rate-
Employment rate Crime rate

40 20
30 30
22 40
15 60
10 80
8 85
2. Simple, Partial and Multiple Correlation : The distinction between

simple, partial and multiple correlation is based upon the number of
variables studied. When only two variables are studied it is a problem of
simple correlation. When three or more variables are studied it is a
problem of either multiple or partial correlation. In multiple correlation,
three or more variables are studied simultaneously.
For example :
Relationship between the yield of rice per acre and both the amount of
rainfall and the amount of fertilizers used.
On the other hand, in partial correlation we recognize more than two

variables, but consider only two variables to be influencing each other the
effect of other influencing variables being kept constant.
For example :
In the rice problem taken above if we limit our correlation analysis of

yield and rainfall to periods when a certain average daily temperature
existed it becomes a problem relating to partial correlation only.
3. Linear and Non-linear Correlation : The distinction between linear and

non-linear correlation is based upon the constancy of the ratio of change
between the variables. If the amount of change in one variable tends to
bear constant ratio to the amount of change in the other variable then the
correlation is said to be linear.
For example :
Rainfall (in cms) Wheat (in 1000’s tonnes)

70 10
140 20
210 30
280 40
350 50
60
50
40
30
20
10
0
50 100 150 200 250 300 350 400
It is clear that the ratio of change between the two variables is the same.
If such variables are plotted on a graph paper all the plotted points would fall
on a straight line.
Correlation would be called non-linear or curvilinear if the amount of

change in one variable does not bear a one variable does not bear a constant
ratio to the amount of change in the other variable.
For example :
If we double the amount of rainfall the production of rice or wheat, etc.,

would not necessarily be doubled. It may be pointed out that in most of the
practical situations, we find a non-linear relationship between the variables.
However, since techniques of analysis for measuring non-linear

correlation are far more complicated than those for linear correlation, we
generally make an assumption that the relationship between the variables
is of the linear type.
Thus it is clear from the above discussion that:
1) If changes in two series of variables are in the same direction and having a
constant ratio, the correlation is linear positive.
2) If changes in two groups of variables are in an opposite direction in a
constant ratio, the correlation will be known as linear negative.
3) If changes in two groups of variables are in the same direction but not in a
constant ratio, the correlation is positive non-linear.
4) If changes in two groups of variables are in opposite direction and not in

constant ratio, the correlation is negative non-linear.
These are the various linear and curvilinear correlations which also display the
different degrees of correlation :
a. Perfect Correlation : When the changes in two variables are constant

and in the same direction, the correlation is said to be perfect positive
correlation. Here, coefficient of correlation is +1. And if the changes are
not in the same direction but are in opposite direction and in a constant
ratio. Such a correlation is said to be perfect negative correlation, where
coefficient of correlation is -1.
b. Partial Correlation : This is also called limited correlation as it is
limited between 0 & 1. It is of three types-
High Degree Correlation : It lies between 0.75 to 1 , being high degree
positive correlation for coefficient of correlation being positive and in
case of it being negative it is called high degree of negative correlation.
Moderate Degree Correlation : It lies between 0.3 to 0.75, being
positive or negative depending upon the value.
Low Degree Correlation : It lies between 0 to 0.3, being positive or
negative depending upon its value.
c. Absence of Correlation : When the variables don’t have any type of
relationship neither in positive nor in negative direction then it is said to
be absence of correlation or no correlation. In such a case coefficient of
correlation is 0.
Methods of Studying Correlation :
The various methods ascertaining whether two variables are correlated or not
are :
 Scatter Diagram
 Karl Pearson’s Coefficient of Correlation
 Rank Correlation
1.Scatter Diagram Method :
Scatter diagram is the elementary method of knowing the direction of the two
variables. We take the independent variable on X-axis and the dependent
variable on Y-axis and plot the graph in the dotted form of a dotted chart i.e. for
each pair of X and Y values we put a dot and thus obtain as many points as the
number of observations. By looking at the scatter we can form an idea as to
whether the variables are related or not.
The greater the scatter of the plotted points on the chart, the lesser is the
relationship between the two variables. The more closely the points come to a
straight line falling from the lower left-hand corner to the upper right-hand
corner, correlation is said to be perfectly positive ( i.e. r= 1 as in Fig 1).
On the other hand, if all the points are lying on a straight line rising from the
upper left-hand corner to the lower right-hand corner of the diagram, it is said to
be perfectly negative(i.e. r= -1 as in Fig 5).
If the plotted points fall in a narrow band there would be a high degree of
correlation between the variables- correlation shall be positive if the points
show a tendency from the lower left-hand corner to the upper-right hand corner
(0<r<1 as in Fig 2) and negative if the points show a declining tendency from
the upper left-hand corner to the lower right-hand corner of the diagram(-1<r<0
as in Fig 4).
On the other hand, if the points are widely scattered over the diagram it
indicates very little relationship between the variables.
If the points plotted lie on the straight line parallel to the X-axis or in a
haphazard manner, it shows absence of any relationship between the
variables(i.e. r=0 as in Fig 3).
Illustration 1 :
Represent the following sales and advertisement data by scatter diagram and
comment whether there is any correlation between them :
Sales(00’s units) Advertisements (000’s)

40 2
30 4
70 6
50 8
80 10
90 12
Scatter Diagram
12
10
0
20 30 40 50 60 70 80 90
By the inspection of the plotted points it is clear that they are from lower left-
hand to upper right-hand side and they do not scatter much. Therefore, the two
variables have high degree of positive correlation.
Merits of Scatter Diagram :
 It is a very simple and non-mathematical way of studying correlation

between the variables. As such it can be easily understood and a rough
idea can be easily be formed as to whether or not the variables are related.
 It is not influenced by the size of the extreme items whereas most of the
mathematical methods of finding correlation are influenced by extreme
items.
 It clearly defines whether relationship is positive or negative and at the
same time it also tells about its linearity and non-linearity.
Limitations of Scatter Diagram :

By applying this method we can get an idea about the direction of correlation
and also whether it is high or low, but we cannot establish the exact degree of
correlation between the variables as is possible by applying the mathematical
methods. To overcome the shortcomings of this method Karl Pearson’s method
is used.
2.Karl Pearson’s Coefficient of Correlation :
The Karl Pearson’s method, popularly known as Pearson’s coefficient of

correlation, is most widely used in practice. The Pearson’s Coefficient of
correlation is denoted by the symbol (r). And it is also known as Product
Moment Correlation.
The formula for computing “r” is :

( X − X́ ) (Y −Ý )
r= ∑n . SD ( X ) . SD(Y ) ...(i)
or, it can be transformed as
COV . ( X , Y )
r=
SD( X ) . SD(Y )
Where, COV. (X,Y) = covariance of X & Y
SD(X) = standard deviation of X
SD(Y) =standard deviation of Y
X́ & Ý = actual mean of X & Y
Also,
COV.(X,Y) = ∑ ¿¿ ¿
SD(X) = √ ∑ ¿ ¿ ¿ ¿ ¿ = √ COV .( X , X )
SD(Y) = √ ∑ ¿ ¿ ¿ ¿ ¿ = √ COV .(Y ,Y ) (n = number of
observations)
Note : The covariance of two variables is always smaller than the product of
their standard deviations, and the value of coefficient of correlation always lies
between -1 & 1.
-1≤r≤1
So substituting these values in the following equation, we get :
COV . ( X , Y )
r=
SD( X ) . SD(Y )
= ∑ ¿¿ ¿ ¿
1
= n ∑ ¿¿¿
=∑ ¿¿ ¿ ...(ii)
∑ ∂ ( X ) . ∂(Y )
Or, r= √∑ {∂ ( X ) } . {∂(Y ) }
2 2
Where,∂ ( X ) ∧∂(Y )=¿deviation from ACTUAL mean of X & Y respectively.
Also, COV. (X,Y) = ∑ ¿¿ ¿

1
= n ∑ ( XY − X Ý − X́ Y + X́ Ý )
1
= n (∑ XY −∑ X Ý −∑ X́ Y +∑ X́ Ý )
XY −n X́ Ý −n X́ Ý +n X́ Ý )
= (∑ n
1
= n (∑ XY −n X́ Ý )
XY
= ∑n − X́ Ý
Note:
Because ∑ X Ý =Ý ∑ X ,
∑X
X́ =
n
can also be written as X́ n=∑ X
and so, ∑ X Ý =n X́ Ý
Same ways, ∑ X́ Y =n X́ Ý has been substituted in the above derivation.
XY
And again COV. (X,Y) = ∑n - X́ Ý
XY ∑ X ∑Y
= ∑n −
n n
n ∑ XY – ∑ X . ∑ Y
Therefore, COV.(X,Y) = n2
2
∑ X2 n ∑ X 2−(∑ X )
Also, SD(X) = √ COV .( X , X ) = n −¿ ¿ =
√
2
n
∑ Y2
SD(Y) = √ COV .(Y ,Y ) = n −¿ ¿ = √ n ∑ Y −¿ ¿ ¿ ¿ ¿
2
n ∑ XY −∑ X ∑ Y
So, r= n2 , or we can say that Karl Pearson’s coefficient
√ n ∑ ( X ¿¿ 2)−¿ ¿ ¿ ¿ ¿ ¿ ¿
of correlation can also be determined by the following formula :
r = n ∑ XY −¿∑
¿
X∑ Y
...(iii)
This is the derivation of the DIRECT METHOD of finding out the coefficient
of correlation, and this method is applied only where deviations of items are
taken from actual mean and not from the assumed mean. Out of the above
derived three formulas for correlation coefficient, any can be used to determine
the value of “r” in case of actual mean being used for getting deviations. It will
be more clear from the following illustration.
Steps to follow :
o Find out the actual mean of both the items i.e., X́ and Ý .
o Take the deviations of both the items from their actual mean and sum
them up separately i.e., firstly X − X́ and Y −Ý and then, ∑ (X − X́ ) and
∑ (Y −Ý ).
o Now find the squares of these deviations and then get their individual
2 2
totals i.e., firstly ( X − X́ )2 and ( Y −Ý )2 and then ∑ ( X− X́ ) and ∑ ( Y −Ý ) .
o Then substitute the values in the stated formula and correlation
coefficient is so obtained.
Illustration 2 :
The following table gives indices of industrial production of registered

unemployed ( in hundred thousand ). Calculate the value of the coefficient so
obtained.
Year : 1991 1992 1993 1994 1995 1996 1997 1998
Index of Production : 100 102 104 107 105 112 103 99
No. of Unemployed : 15 12 13 11 12 12 19 26
Solution : Calculation of Karl Pearson’s Coefficient of Correlation –
Year Producti Unemplo

on (X) X − X́ ¿ yed (Y) Y −Ý (Y −Ý )2 ( X − X́ ) (Y −Ý )
1991 100 -4 16 15 0 0 0
1992 102 -2 4 12 -3 9 6
1993 104 0 0 13 -2 4 0
1994 107 3 9 11 -4 16 -12
1995 105 1 1 12 -3 9 -3
1996 112 8 64 12 -3 9 -24
1997 103 -1 1 19 4 16 -4
1998 99 -5 25 26 11 121 -55
∑❑ 832 0 120 120 0 184 -92
∑ ( X − X́ ) (Y −Ý )
Now by applying formula : r= √(X − X́ ) . √(Y −Ý )
2 2❑
−92
= √120∗184 = -0.619
Now in case the deviations in items is made from the assumed mean from the
data, then we use the assumed mean method for finding out the coefficient of
correlation. And the formula for this method is :
r =n
∑ ∂ ( X ) .∂ ( Y ) −¿ ( ∑ ∂ ( X ) . ∑ ∂(Y )) ¿
2 2
√∑
n {∂ ( X ) }
2
−{∑ ∂( X ) } √∑
n {∂ ( Y ) }
2
−{∑ ∂(Y ) }
Where, n = No. of Observations

∂( X) = deviation of X from assumed mean
∂(Y ) = deviation of Y from assumed mean
Steps to follow :
o Take the deviations of both the X & Y series from their assumed means
and obtain their totals separately i.e. firstly ∂( X) and ∂(Y ) and then,
∑ ∂( X ) and ∑ ∂(Y ).
o Then, obtain the squares of the deviations [ ∂( X) and ∂(Y )] i.e., { ∂( X ) }2
2
and { ∂(Y ) } .
2
o Also, obtain the squares of the totals of the deviations i.e., {∑ ∂( X ) } and
2
{∑ ∂(Y )} .
o Now obtain the sum of the product of the individual deviations i.e.,
∑ ∂ ( X ) . ∂(Y ).
o And now substitute the values in the stated formula and the resultant is
the required correlation coefficient.
Illustration 3 :
The following table gives the distribution of items of production and also the
relatively defective items among them, according to the size-groups. Find the
correlation coefficient between the size and defect in quality and its probable
error.
Group Size : 15-16 16-17 17-18 18-19 19-20 20-21

No. of Items(NoI) : 200 270 340 360 400 300
No. of Defective Items(NoDI) : 150 162 170 180 180 114
Solution : Let the average size of group be denoted by X and the % of

defective items by Y.
Assumed mean for X is 17.5 and for Y is 50.
Calculation of coefficient of correlation –
Grp Avrg.size % of
∂ ( X ) =(X {∂(
−17.5) ∂ ( Y )=(Y −50)2
- (X) X )}2 Defective { ∂(Y ) } ∂ ( X ) . ∂(Y )
size =(LL+U Items(Y)
L)/2 =(NoI/NoDI)
*100
15- 15.5 -2 4 75 25 625 -50
16
16- 16.5 -1 1 60 10 100 -10
17
17- 17.5 0 0 50 0 0 0
18
18- 18.5 1 1 50 0 0 0
19
19- 19.5 2 4 45 -5 25 -10
20
20- 20.5 3 9 38 -12 144 -36
21
∑❑ 3 19 18 894 -106
Substitute the values obtained in the following formula :

n ∑ ∂ ( X ) .∂(Y )−{∑ ∂ ( X ) ∑ ( Y ) }
r= 2 2
2 2
√ n ∑ {∂( X )} −{∑ ∂(X )} √ n ∑ {∂(Y )} −{∑ ∂(Y )}
{ 6∗(−106 ) } −(3∗18)
=
√( 6∗19 ) −9 √ ( 6∗894 )−324
= -0.949
Assumptions of Karl Pearson’s Coefficient of Correlation :
 Linear Relationship – Both the variables have linear relationship, i.e.,

when plotted on the scatter diagram, it gives a straight line.
 Normal Distribution – The two variables are affected by a large number
of independent causes so as to form a normal distribution. Variables like
height, weight, price, demand, supply, etc., are affected by such forces
that a normal distribution is formed.
 Cause and Effect Relationship –There is a cause and effect relationship
between the forces affecting the distribution of the items in the two series.
If such a relationship is not formed between the variables, i.e., if the
variables are independent, there can not be any correlation.
Limitations of Karl Pearson’s Coefficient of Correlation :
 The correlation coefficient always assumes linear relationship regardless

of the fact whether it is correct or not.
 Great must be exercised while interpreting the coefficient’s value as very
often the coefficient is misinterpreted.
 The value of the coefficient is unduly affected by the extreme items.
 As compared to other methods, this method takes more time to compute
the value of coefficient.
Coefficient of Correlation and Probable Error :
The probable error of coefficient of correlation helps in interpreting its value.

With the help of probable error it is possible to determine the reliability of the
value of the coefficient in so far as it depends on the conditions of random
sampling. The probable error of the coefficient of correlation is obtained as
follows :
2 1−r 2
P ( E )= ( )
3 √n
(1−r 2)
Also, P ( E )=0.6745
√n
Where, n=no. of pairs of observations
r=coefficient of correlation
o If the value of “r” is less than the P(E), there is no evidence of

correlation, i.e., the value of “r” is not at all significant.
o If the value of “r” is more than 6 times the P(E), the coefficient of
correlation is practically certain , i.e., the value of “r” is significant.
o By adding and subtracting the value of P(E) from the coefficient of
correlation we get respectively the upper and the lower limits within
which “r” can be expected to lie.
Probable Error and Standard Error :
The standard error of a statistic is the standard deviation of the sampling

distribution of that statistic. Standard errors are important because they reflect
how much sampling fluctuation a statistic will show. The inferential
statistics involved in the construction of confidence intervals and significance
testing are based on standard errors. The standard error of a statistic depends on
the sample size. In general, the larger the sample size the smaller the standard
error. The standard error of a statistic is usually designated by the Greek letter
sigma (σ) with a subscript indicating the statistic. For instance, the standard
error of the mean is indicated by the symbol: σM. The formula used to obtain
2
1−r
Standard Error is : S ( E )=
√n
So, the relationship between standard error and probable error comes out to be :
2
P ( E )= S(E)
3
3. Rank Correlation Coefficient :
When facts cannot be measured quantitatively, or it is known that data is not

normal, or when the shape of the distribution is not known, then in such cases
Karl Pearson’s Coefficient of Correlation cannot be calculated. It is difficult to
measure directly the beauty, the intelligence, the harshness, etc. But these can
be ranked according to their quality. Under these cases correlation is calculated
by Rank Correlation method which was developed by Charles Edward
Spearman in 1904; that is why this method is also known as Spearman’s Rank
Differences Method.
Rank Correlation may be studied under two situations –
(a) When ranks are not given : For quantitative data, ranks are not given n
we have to assign it.
(b)When ranks are given : For qualitative data, ranks are always given.
Again, the data here can have two conditions –
o Without repetitive ranks

o With repetitive ranks
Without repetitive ranks :
Spearman’s Coefficient of Correlation(R) is defined as :
6∑ d2
R=1−
n ( n2−1 )
Where, d=difference of ranks between paired items in two series
N= no. of items
Steps to follow :
o Assign different ranks to X and Y variables, i.e., R X and RY . These ranks

are, generally, awarded in the descending order of the values of variables.
o Calculate the rank differences by subtracting the ranks of Y from that of
X, i.e., { R X −RY } and denote it as “d”.
o Now get the squares of these differences and sum them up, i.e.,
∑ { R X −R Y }2.
o And substitute the obtained values in the above stated formula.
Note : In case of qualitative data, ranks are already given. So there we don’t
need to go for the first step and rest of the procedure is same.
Illustration 4: (for qualitative data)
Two ladies were asked to rank 7 different types of lipsticks. The ranks given by
them are as follows :
Lipsticks : A B C D E F G
Neelu : 2 1 4 3 5 7 6
Neena : 1 3 2 4 5 6 7
Calculate Spearman’s rank coefficient of correlation.
Solution : Neelu’s rank be R X and Neena’s ranks be RY .
Calculation of Spearman’s Correlation Coefficient-

RX RY d=R X−R Y d2
2 1 1 1
1 3 -2 4
4 2 2 4
3 4 -1 1
5 5 0 0
7 6 1 1
6 7 -1 1
∑ d 2=12
Substituting the values in the following formula :
6 ∑ d2
R=1− 2
√n ( n −1 )
6∗12
=1− = 0.786
√ 7(7 2−1)
Illustration 5: (for quantitative data)
Quotations of index numbers of security prices of a certain joint stock company

are given below :
Year : 1991 1992 1993 1994 1995 1996 1997
Debenture price : 97.8 99.2 98.8 98.3 98.4 96.7 97.1
Share price : 73.2 85.8 78.9 75.8 77.2 87.2 83.8
Using rank correlation method, find out the relationship between debenture
prices and share prices.
Solution : Taking debenture prices as X variable and share prices as Y.
Calculation of Rank Correlation Coefficient-
Debenture RX Share RY d=R X−R Y d2

price (X) price(Y)
97.8 3 73.2 1 2 4
99.2 7 85.8 6 1 1
98.8 6 78.9 4 2 4
98.3 4 75.8 2 2 4
98.4 5 77.2 3 2 4
96.7 1 87.2 7 -6 36
97.1 2 83.8 5 -3 9
∑ d 2=62
Substituting these values in the formula :
6 ∑ d2
R=1− 2
√n ( n −1 )
6∗62
= 1− √ 7 ( 7 −1 ) = -0.107
2
With repetitive ranks :
Spearman’s Coefficient of Correlation is defined as :
∑ m ( m2−1 ) ∑ m ( m2−1 )
R=1−
6 [ 2
∑ d + 12 + 12 +… ]
n ( n2−1 )
Where, d= difference of rank between paired items in two series
n=no. of items
m=no. of items whose ranks are common
Steps to follow :
o Assign different ranks to X and Y variables, i.e., R X and RY . These ranks

are, generally, awarded in the descending order of the values of variables.
o If in the series some items are of uniform value then average rank is given
to all those items. Eg: in X series the values are 40, 45, 50, 40, 36 and
40, then 1st rank will be given to 50, 2nd to 45, and all the three 40’s will
get (3+4+5)/3=4th rank. And 3rd and 5th rank will not be given to any value
and the following value, i.e., 36 will get 6th rank.
o Calculate the rank differences by subtracting the ranks of Y from that of
X, i.e., { R X −RY } and denote it as “d”.
o Now get the squares of these differences and sum them up, i.e.,
∑ { R X −R Y }2.
o And substitute the obtained values in the above stated formula.
Illustration 6:
Obtain the rank correlation coefficient between the variables X and Y from the
following pairs of observed values :
X : 50 55 65 50 55 60 50 65 70 75
Y : 110 110 115 125 140 115 130 120 115 160
Solution : Calculation of Rank Correlation Coefficient –
X RX Y RY d=R X−R Y d2
50 2 110 1.5 0.5 0.25
55 4.5 110 1.5 3 9
65 7.5 115 4 3.5 12.25
50 2 125 7 -5 25
55 4.5 140 9 -4.5 20.25
60 6 115 4 2 4
50 2 130 8 -6 36
65 7.5 120 6 1.5 2.25
70 9 115 4 5 25
75 10 160 10 0 0
∑ d 2=134
In the series X, 50 has repeated 3 times (m=3), 55 has repeated 2 times (m=2),
and 65 also has repeated 2 times (m=2). In series Y, 110 has repeated 2 times
(m=2) and 115 has repeated 3 times (m=3).
Substituting the values in the following formula :
∑ m ( m2−1 ) ∑ m ( m2−1 )
R=1−
6 [ 2
∑ d + 12 + 12 +… ]
n ( n2−1 )
∑ 3 ( 32−1 ) ∑ 2 ( 23 −1 ) ∑ 2 ( 23−1 ) ∑ 2 ( 23 −1 ) ∑ 3 ( 33 −1 )
R=
6 [ 134 +
12
+
12
+ +
12
+
12 12 ]
10 ( 102−1 )
R=0.155
Merits of Rank Correlation :
 It is easier and simpler to understand and apply as compared to Karl

Pearson’s correlation coefficient. The answers obtained by both the
methods will be same provided no value is repeated , i.e., all the items are
different.
 Where the data are of some qualitative nature like honesty, efficiency,
etc., this method works with great advantage.
 This is the only method that can be used where we are given only ranks
and not the actual data.
Limitations of Rank Correlation :
 This method can not be used for finding out correlation in a grouped
frequency distribution.
 Where the number of items exceed 30, the calculations become quite
tedious and require a lot of time. So this method shouldn’t be used where
“n” exceeds 30 unless we are given the ranks and not the actual values of
the variables.
REGRESSION ANALYSIS
Introduction to Regression :
By correlation we know the direction and extent of relationship in two related

series. But if we want the best estimate of the related value of a dependent series
from the known value of an independent series, the same cannot be calculated
from Correlation. For this purpose we have to make a regression analysis.
For example :
If we know that the yield of rice and rainfall are closely related, we may find
out the amount of rain required to achieve a certain production figure by
regression.
Regression is that method of statistical analysis with the help of which the value
of other series can be estimated from the known value of one series. Regression
analysis reveals average relationship between two variables and this makes the
estimation or prediction possible.
Meaning of Regression :
The dictionary meaning of this term regression is the ‘act of returning’ or ‘going
back’. The meaning of regression is just reverse of Progression. Progression, in
general, means to move forward while regression means to move backward or
going back or in statistical terms the return to the mean value. Regression is a
statistical technique to construct a mathematical relationship in the form of
equations between two correlated variables. It is a statistical device with the
help of which we are in a position to estimate the unknown values of one
variable from known values of another variable. The variable which is used to
predict the variable of interest is called the independent variable or
explanatory variable denoted by X and the variable we are trying to predict is
called the dependent variable or explained variable denoted by Y. The
analysis used is called the simple linear regression analysis – simple because
there is only one predictor or independent variable and linear because of the
assumed linear relationship between the dependent and independent variables.
The term “linear” means that an equation of an straight line of the form
Y =a+bX where, a and b are constants, is used to describe the average
relationship that exists between the two variables.
Definitions of Regression :
“Regression is the measure of the average relationship between two or more

variables in terms of the original units of the data.”
-Morris Myers Blair
“Regression analysis attempts to establish the ‘nature of the relationship’

between variables – that is, to study the functional relationship between the
variables and thereby provide a mechanism for prediction, or forecasting.”
-Ya Lun Chou
“One of the most frequently used techniques in economics and business

research, to find a relation between two or more variables that are related
causally, is regression analysis.”
-Taro Yamane
Utility of Regression Analysis :
Regression analysis is a branch of statistical theory that is widely used in almost

all the scientific disciplines. In economics, it is the basic technique for
measuring or estimating the relationship among the economic variables that
constitute the essence of economic theory and economic life.
For example :
1. If we know that the two variables, price(X) and demand(Y), are closely
related we can find out the most probable value of X for a given value of
Y or the most probable value of Y for the given value of X.
2. If we know that the amount of tax and the rise in price of the commodity
are closely related, we can find out the expected price for a certain
amount of tax levy.
The regression analysis attempts to accomplish the following :
 Regression analysis provides estimates of values of the dependent

variable from values of the independent variable. The device used to
accomplish the estimation procedure is the regression line. The
regression line describes the average relationship existing between X and
Y variables, i.e., it displays mean values of X for given values of Y. The
equation of this line, called as the regression equation, provides
estimates of the dependent variable when values of the independent
variable are inserted into the equation.
 A second goal of regression analysis is to obtain a measure of the error
involved in using the regression line as a basis for estimation. For this
purpose, the standard error of estimate is calculated.
 With the help of regression coefficients we can calculate the correlation
coefficient. The square of coefficient of correlation(r), called the
coefficients of determinations, measures the degree of association of
correlation that exists between the two variables.
Regression lines :
The lines of best fit drawn to show the mutual relationship between X and Y
variables are known as Regression Lines. Every regression problem for linear
equations has two lines on same graph, one representing regression of X on Y
and the other of Y on X for minimizing the differences in the two variables.
These two regression lines always intersect each other at ( X́ , Ý ), i.e.,mean.
Regression Equation of Y on X :
Here, Y is the dependent variable on the independent variable X and this

regression line is used to reduce the error on Y. This regression line is
represented by the following regression equation :
Y =a+bYX X
Where, a= intercept parameter,

b YX=regression coefficient of Y on X or slope parameter
This b YX is further defined as :

COV .( X ,Y )
b YX =
(SD X )2
.........(i)
SD(X) = standard deviation of X
Because, in case of Direct Method: COV . ( X ,Y )=n ∑ XY −( ∑ X ∑ Y )

2
And SDX = n ∑ X 2−( ∑ X )
√
n ∑ XY −( ∑ X ∑ Y )
So it can also be defined as : b YX = 2 .............(ii)
n ∑ X 2 −( ∑ X )
Also, COV . ( X ,Y )=r . SD X SDY

SD X SDY
b YX =r
Hence, (SD¿¿ X)2=r
SD Y
¿
............(iii)
SD X
Now in case of deviations taken from actual mean :
Y SD
Regression equation of Y on X is : ( Y −Ý )=r SD ( X − X́ )
X
Where, X́ and Ý being the actual means of X and Y variables
Also, if “r” is not known then following substitution can be made-

SD Y ∑ ( X − X́ ) ( Y −Ý )
r =bYX = 2
SD X ∑ ( X− X́ )
Regression Equation of X on Y :
Here, X is the dependent variable on the independent variable Y and this

regression line is used to reduce the error on X. This regression line is
represented by the following regression equation:
X =a+b XY Y
Where, a = intercept parameter,

b XY =regression coefficient of X on Y or slope parameter
This b XY can be further defined as :

COV .(X , Y )
b XY =
(SDY )2 ..........(i)
SD(Y) =standard deviation of Y
Because, in case of Direct Method : COV . ( X ,Y )=n ∑ XY −( ∑ X ∑ Y )

2
And SDY = n ∑ Y 2−( ∑ Y )
√
n ∑ XY −( ∑ X ∑ Y )
So it can also be defined as : b XY = 2 ..........(ii)
n ∑ Y 2 −( ∑ Y )
Also, COV . ( X ,Y )=r SD X SDY

SD X SDY
b XY =r
So it can also be defined as : ( SD¿¿ Y )2=r
SD X
¿
........(iii)
SDY
Now in case of deviations taken from actual mean :

X SD
Regression equation of X on Y is : ( X − X́ ) =r SD ( Y −Ý )
Y
Where, X́ and Ý being the actual means of X and Y variables
Also, if “r” is not known then following substitution can be made-

SD X ∑ ( X− X́ )( Y −Ý )
r =b XY = 2
SDY ∑ ( Y − Ý )
Relationship between Regression Coefficients and Correlation Coefficient :

2
{ COV .( X , Y ) } 2
Because from the above equations, b YX . b XY = 2 2
=r
( SDX ) (SD Y )
Hence,”r” can be defined in terms of regression coefficients as :

r =√ b XY . bYX
Illustration 7:
The following data gives the experience and the estimate of machine operators
and their performance ratings as given by the number of good parts turned out
per 100 pieces :
Operator : 1 2 3 4 5 6 7 8
Experience : 16 12 18 4 3 10 5 12
Performance Rating : 87 88 89 68 78 80 75 83
Calculate the regression lines of performance rating on experience as well as

experience on performance rating.
Solution : Let the experience be denoted by X and performance rating by Y.
We are required to get regression line of performance rating on experience, i.e.,

Y on X
And , experience on performance rating, i.e., X on Y

Arithmetic Means calculated are : X́ =10 and Ý =81
Calculating Regression Lines-

2 2
Experien X − X́ ( X − X́ ) Performa Y −Ý ( Y −Ý ) ( X − X́ )
ce nce ¿ ( Y −Ý )
(X) Rating
(Y)
16 6 36 87 6 36 36
12 2 4 88 7 49 14
18 8 64 89 8 64 64
4 -6 36 68 -13 169 78
3 -7 49 78 -3 9 21
10 0 0 80 -1 1 0
5 -5 25 75 -6 36 30
12 2 4 83 2 4 4
∑ ¿ 80 0 218 648 0 368 247
∑ ( X − X́ ) ( Y −Ý ) ( X − X́ )
Regression eq. Of Y on X : ( Y −Ý )= 2
∑ ( X − X́ )
247
( Y −81 )= ( X −10 )
218
So we get, Y =69.67+1.133 X
∑ ( X− X́ ) ( Y −Ý ) ( Y −Ý )
Regression eq. Of X on Y : ( X − X́ ) = 2
∑ ( Y −Ý )
247
( X −10 )= (Y −81 )
368
So we get, X =−44.37 +0.671Y
Hence, the required regression line of performance rating on experience is :

Y =69.67+1.133 X
And that of experience on performance rating is : X =−44.37 +0.671Y

{DATA SOURCE 1 : GOOGLE IMAGE}
{DATA SOURCE 2: STATISTICAL METHODS –S.P.GUPTA}
{DATA SOURCE 3: BUSINESS STATISTICS –YADAV JAIN MITTAL}
{DATA SOURCE 4: CLASSROOM NOTES –MR. SWADHEEN JAIN}

Correlation and Regression: Jaipur National University

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation and Regression: Jaipur National University

Uploaded by

Copyright:

Available Formats

JAIPUR NATIONAL UNIVERSITY

SCHOOL OF BUSINESS AND MANAGEMENT

QUANTITATIVE TECHNIQUES FOR

CORRELATION AND REGRESSION

MR. SWADHEEN JAIN CHARU

(FACULTY – JNU) MBA (I-

Correlation is the relationship between the two or more interrelated series of

 Univariate distribution or analysis.

The correlation analysis refers to the techniques used in measuring the

-Simpson & Kafka

“When the relationship is of a quantitative nature, the appropriate statistical tool

-Croxton & Cowden

“Whenever some definite connection exists between two or more groups,

Significance of the Study of Correlation :

The study of correlation is of immense use in practical life because of the

Correlation is described or classified in several different ways. Three of the

1. Positive and Negative Correlation : Whether correlation is

Positive correlation, where expenditure is increasing with the increasing

Income Expenditure (in 1000’s)

Price(rs.) Supply (in 100’s 0f units)

-relationship between age of husband and age of wife

-no. of accidents and the no. of vehicles

-more the food, more are the calories gained

-more you smile, more happy you feel, etc.

Negative correlation, where demand of product decreases with the

Price (rs.) Demand (in 100’s of units)

Negative correlation, where crime rate is increasing with the decreasing

Employment rate Crime rate

2. Simple, Partial and Multiple Correlation : The distinction between

On the other hand, in partial correlation we recognize more than two

In the rice problem taken above if we limit our correlation analysis of

3. Linear and Non-linear Correlation : The distinction between linear and

Rainfall (in cms) Wheat (in 1000’s tonnes)

Correlation would be called non-linear or curvilinear if the amount of

If we double the amount of rainfall the production of rice or wheat, etc.,

However, since techniques of analysis for measuring non-linear

Thus it is clear from the above discussion that:

4) If changes in two groups of variables are in opposite direction and not in

a. Perfect Correlation : When the changes in two variables are constant

Methods of Studying Correlation :

1.Scatter Diagram Method :

Sales(00’s units) Advertisements (000’s)

Merits of Scatter Diagram :

 It is a very simple and non-mathematical way of studying correlation

Limitations of Scatter Diagram :

2.Karl Pearson’s Coefficient of Correlation :

The Karl Pearson’s method, popularly known as Pearson’s coefficient of

The formula for computing “r” is :

Where, COV. (X,Y) = covariance of X & Y

SD(X) = standard deviation of X

SD(Y) =standard deviation of Y

X́ & Ý = actual mean of X & Y

Where,∂ ( X ) ∧∂(Y )=¿deviation from ACTUAL mean of X & Y respectively.

Also, COV. (X,Y) = ∑ ¿¿ ¿

Same ways, ∑ X́ Y =n X́ Ý has been substituted in the above derivation.

The following table gives indices of industrial production of registered

Year : 1991 1992 1993 1994 1995 1996 1997 1998

Index of Production : 100 102 104 107 105 112 103 99

Solution : Calculation of Karl Pearson’s Coefficient of Correlation –

Year Producti Unemplo

Where, n = No. of Observations

∂(Y ) = deviation of Y from assumed mean