Professional Documents
Culture Documents
Unit 2 - (A) Correlation & Regression
Unit 2 - (A) Correlation & Regression
A
Semester –I
Unit – II
Correlation and Regression
Introduction:
We have studied the different series where various items assumed different value of one variable.
We have discussed up till now, measures of central tendency and measures of dispersion are
calculated in such cases for purpose of comparison and analysis. With the help of these measures
data can be easily understood. There can, however, be such series also where, each item assumes
the values of two or more variables. For examples, if the heights and weights of a group of persons
are measured, we shall get such series where each member of the group would assume two values,
one relating to height and other relating to weight. Such a distribution is known as bivariate
distribution.
But someti mes it appears that the values of the various variables, so obtained are interrelated. It is
likely that such relationship may be obtained in two series relating to the heights and weights of a
group of persons. It may be observed that weight increases with increase in height. So that tall
people are heavier than short sized people. Similarly, if the data are collected about the prices of a
commodity and quantities sold at different prices, two series would be obtained. In two such series
we are again likely to find some relationship. With increases in the price of the commodity the
quantity sold is bound to decrease. We can thus conclude that there is some relationship between
price and demand. Such relationship can be found in many types of series, for example, price and
supply, heights and weights of persons, price of sugar and sugarcane, age of husbands and wives,
ec. So, we can say that “The term correlation (or co-variation) indicates the relationship
between two such variables in which with changes in the values of one variable, the values of
the other variable also changes.” Thus correlation is statistical tool of studying the relationship
between two variables. For correlation it is essential that the two phenomena should have cause-
effect relationship. If such relationship does not exist then one should not talk of correlation.
Types of correlation:
1) By direction of change (Positive and Negative)
Positive Correlation: While studying the relationships of any two related variables, if we find the
deviation of the value of variables are in the same direction i.e. if one variable increases (or
decreases), the corresponding value of the second variable also increases (or decreases), then it is
called a Positive Correlation. For e.g. Height and weight of human beings, demand and supply,
amount of rain fall and yield of crop have positive correlation.
Negative Correlation: While studying the relationships of any two related variables, if we find the
deviation of the value of variables in the opposite direction i.e. if one variable increases (or
decreases), the corresponding value of the second variable decreases (or increases), then it is
called a Negative Correlation. For e.g. price and demand of commodity, temperatures and sales of
woolen clothes have negative correlation.
If plotted dots lie on the straight line rising from the lower left-hand corner to the upper right
hand corner then the correlation is said to be perfect positive correlation.
If plotted dots lies on the straight line from the upper left hand corner to the lower right hand
corner then correlation is said to be perfect negative correlation.
If plotted dots fall in a narrow band showing a rising tendency from the lower left hand corner
to the upper right hand corner, then correlation is high degree positive correlation. As the
band becomes wider the degree of correlation becomes low and we called low degree positive
correlation.
If plotted dots fall in a narrow band showing a decreasing tendency from the upper left hand
corner to the lower right hand corner, then correlation is high degree negative correlation. As
the band becomes wider the degree of correlation becomes low and we called low degree
negative correlation.
If the dots are widely scattered in haphazard manner, it indicates no correlation between two
study variables.
yn
Total of
Frequencies of X fx
N
Karl Pearson’s Correlation coefficient: It measures the degree of correlation between two
variables. It is denoted by denoting the measure of correlation between two variables x
and y. It can be written as
Where,
If the correlation coefficient is close to -1 that means you have a strong negative relationship
Formulas:
(a)For ungrouped bivariate data (without frequency)
Rank Correlation:
In statistics, a rank correlation is any of several statistics that measure an ordinal association—the relationship
between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is
the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular
variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used
to assess the significance of the relation between them.
if, for example, one variable is the identity of a college basketball program and another variable is the identity of
a college football program, one could test for a relationship between the poll rankings of the two types of
program: do colleges with a higher-ranked basketball program tend to have a higher-ranked football program? A
rank correlation coefficient can measure that relationship, and the measure of significance of the rank
correlation coefficient can show whether the measured relationship is small enough to likely be a coincidence.
If there is only one variable, the identity of a college football program, but it is subject to two different poll
rankings (say, one by coaches and one by sportswriters), then the similarity of the two different polls' rankings
can be measured with a rank correlation coefficient.
The Spearman correlation coefficient, rs, can take values from +1 to -1.
A rs of +1 indicates a perfect association of ranks, a rs of zero indicates no association between ranks and
a rs of -1 indicates a perfect negative association of ranks.
The closer rs is to zero, the weaker the association between the ranks.
To calculate a Spearman rank-order correlation on data without any ties we will use the following data:
Exam Marks
English 56 75 45 71 62 64 58 80 76 61
Maths 66 70 40 60 65 56 59 77 67 63
We then complete the following table:
56 66 9 4 5 25
75 70 3 2 1 1
45 40 10 10 0 0
71 60 4 7 3 9
62 65 6 5 1 1
64 56 5 9 4 16
58 59 8 8 0 0
80 77 1 1 0 0
76 67 2 3 1 1
61 63 7 6 1 1
We then substitute this into the main equation with the other information as follows:
as n = 10. Hence, we have a ρ (or rs) of 0.67. This indicates a strong positive relationship between the ranks
individuals obtained in the maths and English exam. That is, the higher you ranked in maths, the higher you
ranked in English also, and vice versa.
+ m(m -1)/12}
Regression Analysis:
Regression Analysis: It is the mathematical measure of the average relationship between two
or more variables in terms of the original units of the data.
Dependent Variable (Regressed or Explained Variable): The Variable whose value is to be
predicted.
Independent Variable (Regressor or Predictor or Explanatory Variable): The variable
which influences the values or is used for prediction.
Simple Linear Regression: It is the technique for estimation of unknown value of the
dependent variable from the known value of independent variable.
Regression Lines:
If we take the case of two variables X and Y, we shall have two regression lines as the regression
lines of X on Y and the regression lines Y on X. The regression line of Y on X gives the most
probable values of Y for given values of X and the regression line of X on Y gives the most
probable values of X for given values of Y. Thus, we have two regression lines. However when
there is either perfect positive or perfect negative correlation between two variables, the two
regression lines will coincide, i.e we will have one line. The two regression lines are far from each
other then, the degree of correlation is less, and the two regression lines are nearer to each other
then, the degree of correlation is more. If the variables are independent, correlation coefficient (r)
is zero and lines of regression are perpendicular.
It should be noted that the regression lines cut each other at the point of average of X and Y, i.e, if
from the point where both the regression lines cut each other, a perpendicular is drawn on the X-
axis, we will get the mean value of X and if from the point a horizontal line is drawn on the Y-axis,
we will get the mean value of Y.
Regression equations: The Regression equation also known as estimating equations, are
algebraic expressions of the regression lines. There are two regression equations – the regression
equation of X on Y is used to describe the variations in the values of X for given changes in Y and
the regression equation of Yon X is used to describe the variation in the values of Y for given
changes in X.
Regression Equation of Y on X:
The regression equation of Y on X is expressed as follows:
Y = a + bX
y = 500+100(X)
500
100+500
2*100+500
3*100+500
4*100+500
It may be noted that in this equation ‘y’ is a dependent variable and ‘x’ is independent variable.
‘a’ is Y-intercept and
‘b’ is the slope of the line and it represents the change in Y variable for a unit change in X variable.
The value of numerical constants ‘a’ and ‘b’ are obtained with the help of the best fit curve and this
based on the principal of least square. The principle of least square is that we minimize the sum of
squares of the deviations or the errors of estimates. Thus the deviations between the given
observed values of the variable and their corresponding estimated values are given by the line of
best fit.
Thus Line of Regression of Y on X written as
Line of Regression of X on Y:
X = c+ dY
Regression coefficient: It gives the rate of change of the dependent variable when independent
variable changes by one unit. It is also called the slope of the line.
i.e. measures the how much unit change in variable y when x change by one unit.
and measures the how much unit change in variable x when y change by one unit.
Formulas:
(a) For ungrouped bivariate data(without frequency)
and
and
and
X: 4,5,6,7,8,9,10 7
Y:10,20,30,40,50,60,70 40
2. When two regression lines are perpendicular to each other than there is no correlation between
two study variables. i.e. rxy = 0
3. When two regression lines are coincides to each other then there is perfect correlation between
two study variables. i.e. rxy = 1
Y= a+ bX+ € e= Y-Y^
Coefficient of Determination
It is useful to measure the strength of the relationship. This is done by calculating the
coefficient of determination R2. In other words, the coefficient of determination gives the ratio
of the explain variance to the total variance. The coefficient of determination is the square of
the coefficient of correlation i.e r2. Thus.
Coefficient of determination =
Remark :This is true for models with only one independent variable.
R2 has a value of 0.6483. This means 64.83% of the variation in the y is explained by your
regression model. The remaining 35.17% is unexplained, i.e. due to error.
In general the higher the value of R2, the better the model fits the data.
R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y.
Correlation Analysis Vs. Regression Analysis
Exercise
Correlation
1. The following data refers to advertisement expense and no. of units sold in last six months.
Ad. Expense (in ‘000 Rs.) 14 21 26 22 15 19
3. From the following data, find out the correlation coefficient between heights of fathers and sons.
Heights of fathers(inches) 65 66 67 67 68 69 70 72
Heights of sons(inches) 67 68 65 68 72 72 69 71
4. Compute Karl Pearson’s coefficient of correlation in the following series relating to cost of living
and wages.
Wages (Rs.) 100 101 102 100 99 98 97 98 96 95
Cost of living 98 99 99 97 95 92 95 94 90 91
5. A prognostic test in Mathematics was given to 10 students who were about to bring a course in
statistics. The scores (X) in their test were examined in relations to score (Y) in the final
examination in Statistics. The following result were obtained:
∑x = 71, ∑y = 70, ∑x2 = 555, ∑y2 = 526, and ∑xy =527.
Find the coefficient of correlation between x and y.
6. Calculate correlation coefficient from the following results:
N=10, ∑ (x- 14)2 =180, ∑ (y – 15)2 = 215, and ∑(x – 14 )(y – 15) = 60.
8. If coefficient of correlation between X and Y is -0.92 then find coefficient of correlation between
(i) U = 2X + 6 and V = 3Y-15. (ii) U= 2X+6 and V = -3Y + 15
iii) U= - 2X+6 and V = -3Y + 15
9. From the following data, compute the compute the coefficient of correlation and interpret it.
x y
No. of pairs of observations 15 15
Arithmetic mean 25 18
Standard deviation 3.01 3.03
Sum of squares of deviations from mean 136 138
Sum of product of deviations of x and y 122
from their respective means
10. The following table gives bivariate frequency distribution of age and marks of 100 students in a
test.
Regression
13. Given the following information:
Year 1999 2000 2001 2002 2003 2004
Research expense (in ‘000 Rs.) 5 11 4 5 3 2
(X)
Annual Profit ( in ‘000 Rs.) (Y) 31 40 30 34 25 20
(i) Develop the estimating equation that best describes the given data. Y on X -regression eq.
(ii) Estimate the annual profit when research expense made will 7000.
(iii) How much variation in the annual profits (Y) is explained by the variation in the research
expenditure(X)? –coeff. of determination – r2
14. From the following data of the age of husband and the age of wife, form two regression lines.
Calculate the husband’s age when wife’s age is 16. Calculate wife’s age when husband’s age is
25.
Husband’s 36 23 27 28 28 29 30 31 33 35
age
Wife’s 29 18 20 22 27 21 29 27 29 28
age
15. Given the following results for the height (x) and weight (y) in appropriate units of 1000
students.
Mean of X = 68, mean of y = 150, σx =2.5, σy =20, and r=0.6.
Obtain the equations of two regression lines. Estimate height of a student whose weight 200 units
and also estimate weight of a student whose height is 60 units.
16. Find out the regression equation showing the regression of capacity utilization on product from
the following data.
Average Standard deviation
Production (in lack units ) 35.6 10.5
Capacity utilization (in %) 84.8 8.5
r = 0.62
Estimate the production, when capacity utilization is 70%.
17. To know what relationship exist between unemployment and suicide attempts, a sociologist
surveyed twelve citied and obtained the following data.
city 1 2 3 4 5 6 7 8 9 10 11 12
Unemployment rate percent 7.3 6.4 6.2 5.5 6.4 4.7 5.8 7.9 6.7 9.6 10.3 7.2
No. of suicide attempts per 22 17 9 8 12 5 7 19 13 29 33 18
1000 residents
(i) Develop the estimating equation that best describes the given data.
(ii) Estimate attempted suicide rate when unemployment rate happens to be 6%.
(iii) Calculate coefficient of determination and interpret it.
18. The equations of two regression lines between two variables are expressed as 2x – 3y = 0 and 4y -
5x -8 = 0.
(i) Identify which of the two can be called regression of y on x and of x on y.
(ii) Find mean of x and mean of y.
(iii) Find coefficient of correlation between x and y.
LET 2x – 3y = 0 IS X ON Y REGRESSION EQUATION X = c +d Y X= 3/2 y bxy =3/2 = 1.5
4y - 5x -8 = 0 IS Y ON X REGRESSION EQUATION Y= a+ bX y =5/4 x + 2 byx=5/4 = 1.25
Actual regression coefficient : byx = 2/3 = 0.6666 and bxy= 4/5 =0.8 r = +_ sqrt of (bxy. byx)
19. Find the regression equation of x on y and the coefficient of correlation from the following data.
∑x = 60, ∑y = 40, ∑x2 = 4160, ∑y2 = 1720, and ∑xy = 1150 and N = 10.
20. From the following data, find out the probable yield when the rainfall is 29”.
Rainfall Yield
Mean 25” 40 units per hectare
Standard deviation 3” 6 units per hectare
Correlation coefficient between rainfall and production = 0.8
21. The following are the two regression equations. Find the correlation coefficient and mean of the
variables. If s.d. of x is 1.2 then find variance of y.
8x - 10y + 61 = 0 and 40x -18 y – 2/4.
22. A student obtained the following two regression equations. Do yo agree with him?
6x = 15Y + 21 and 21X + 14 Y=56
23. Calculate lines of regressions from the following data.
Sales Advertising Expenditure
revenue 5-15 15-25 25-35 35-45
75-125 3 4 4 8
125-175 8 6 5 7
175-225 2 2 3 4
225-275 2 3 2 2
24. A business Statistics student has taken a random sample of starting salaries and college grade-
point averages for some recently graduated friends of his, to check are good grades in college
important for earning a good salary? The data are as follow:
Starting salary 36 30 30 24 27 33 21 27
($ thousand)
Grade-point 4.0 3.0 3.5 2.0 3.0 3.5 2.5 2.5
average
(i) Plot the scatter diagram and interpret it.
(ii) Develop the estimating equation that best describes these data.
(iii) Predict the starting salary for a student having grade point average 3.5.