Professional Documents
Culture Documents
Correlation and Simple Linear Regression: Y. I.E. X
Correlation and Simple Linear Regression: Y. I.E. X
Correlation and Simple Linear Regression: Y. I.E. X
Regression
CHAPTER 9
9. CORRELATION AND SIMPLE LINEAR REGRESSION
Here we will discuss on the method employed to determine if there exists any
relationship between two variables and express the r/p numerically.
Example: we my need to know whether there exists a r/p between:
Income & Expenditure Fertilizer & Plant yield
Age & Blood pressure Height & Weight
9.1. Correlation Analysis: is a statistical technique that can be used to describe the
degree
to which one variable is linearly related to other variable.
I.e. Correlation: is the degree (strength) of linear r/p b/n two variables (say X & Y).
Two variables X & Y are said to be highly correlated if they have a strong relationship.
i.e. X Y , X Y orX Y or Vise versa.
If higher (lower) values of one variable (Say X) is accompanied by higher (lower) of the
other variable (say Y), then we say there exists a Positive (Direct) Correlation b/n X &
Y. i.e. X Y , X Y. Example: The greater the radius of a circle, the greater will be
the circumference.
If a higher (lower) value of one variable (Say X) is accompanied by lower (higher) of the
other variable (say Y), then we say there exist a Negative (Indirect) Correlation b/n X
& Y. i.e. X Y or X Y. Example: Saving versus Expenditure.
A scatter plot can be used to see the direction & degree of correlation b/n X & Y.
Page 1 of 9
12 12
8 8
Y
Y
4 4
0 0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
X X
Positive (Direct) Correlation b/n X & Y. Negative (Indirect) Correlation b/n X & Y.
10
6
Y
0
1 2 3 4 5 6 7
X
Note: If there is a Perfect (Exact) r/p b/n X & Y, then all the points will fit exactly on the
line.
13 13
9 9
Y
Y
5 5
1 1
1 2 3 4 5 6 7 1 2 3 4 5 6 7
X X
Exact (Perfect) Positive Correlation. i.e. r=+1 Exact (Perfect) Negative Correlation. i.e. r=-1
The simple correlation coefficient (Pearsons’s Correlation Coefficient) : is a measure used to
determine the degree of correlation b/n two or more variables.
PopulationCorr.coefficent and r SampleCorr.coefficent (r used to estimate ).
(X
i 1
i X)(Yi Y)
r
(X i X) 2 (Yi Y) 2
n XY ( X)( Y)
The Short-cut formula: r or
[n X 2 ( X) 2 ] [n Y2 ( Y) 2 ]
r
XY nXY
[ X 2
nX ] [ Y
2 2
nY 2 ]
Remark:
Always this r lies between -1 and 1 inclusively and it is also symmetric.
Interpretation of r :
1. Perfect positive linear relationship ( if r 1)
2. Perfect negative linear relationship ( if r 1)
3. Some Positive linear relationship ( if r is between 0 and 1)
4. No linear relationship ( if r 0)
5. Some Negative linear relationship ( if r is between -1 and 0)
Example: Compute the value of Pearson’s correlation coefficient based on the study of Age (X)
and
Blood pressure (Y) of a person.
Age=X Blood Pressure=Y XY X2 Y2
43 128 5504 1849 16384
48 120 5760 2304 14400
56 135 7560 3136 18225
61 143 8723 3721 20449
67 141 9447 4489 19881
70 152 10640 4900 23104
X 345 Y 819 XY 47634 X 2
20,399 Y 2
112,443
n XY ( X)( Y)
n=6 and r
[n X 2 ( X) 2 ] [n Y2 ( Y) 2 ]
Interpretation: There is strong positive linear r/p b/n age & blood pressure
The above formula and procedure is applicable for quantitative data. When we have qualitative
data (efficiency, honesty, intelligence and others), we go for Spearman’s Rank Correlation
Coefficient ( rs ).
Rank Correlation Coefficient ( rs ).
It is a measure of correlation based on rank of observations and not on the actual magnitudes
(values).
Steps: 1st: Rank the different values in X & Y.
2nd: Find the difference of the ranks in a pair. i.e. Di=Xi-Yi
3rd: Then use the formula:
6 D i 2
rs 1
n(n 2 1)
Interpretation of rs is similar to that of r.
Example: The following are rankings of seven football players by two Coaches.
Find rs and comment on the opinion of the two coaches.
Player Coach-1=X Coach-2=Y Di=Xi-Yi Di2
A 4 4 0 0
B 1 2 -1 1
C 6 5 1 1
D 5 6 -1 1
E 3 1 2 4
F 2 3 -1 1
G 7 7 0 0
D i
2
=
8
6 D i 2
n=7 and r 1 6(8)
s 1 0. 857(Closeto 1)
n(n 1)
2
7(7 2 1)
Exercise: Aster & Almaze were asked to rank 7 different types of lipsticks (Namely: A,B C D E
F & D). Using the following information, check whether there is a correlation between the tests
of the two ladies.
Lipstick Type A B C D E F G
Aster 2 1 4 3 5 7 6
Almaze 1 3 2 4 5 6 7
Two variables X and Y are said to Linearly Related if their relationship can be expressed by
simple linear model:
Y X ,….Referred as Regression of Y on X.
Where: Y=dependent var., X=Independent var., =Intercept of the regression line,
=Slope of the regression line, = Error (random disturbance term).
Here (Intercept) and (Slope) are the parameters. Also they are known as regression
coefficients.
There are different methods of estimating the parameters and . The most commonly
applied method is the Ordinary least Squares (OLS) method.
The OLS method tries to estimate the parameters by minimizing the sum of squares of the error
terms .
2
b
(Xi X)(Yi Y) XY nXY n XY X Y and a Y bX .
( X i X) 2 X2 nX 2 n X 2 ( X) 2
Where a & b are the least squares estimates of the regression line of Y on X.
Example: The following hypothetical data set shows income and monthly food expenditure of
household
in hundreds of birr. Then,
a. Fit the least squares regression line to the given data.
b. Calculate a simple correlation coefficient (r)
c. Predict the food expenditure for 800 birr (8).
Income=X Expenditure=Y XY X2 Y2
3.8 3.1 11.78 14.44 9.61
4.5 3.6 16.2 20.25 12.96
2.5 2.3 5.75 6.25 5.29
4.8 3.7 17.76 23.04 13.69
7.7 4.6 35.42 59.29 21.16
5.0 4.1 20.5 25 16.81
12.6 6.5 81.9 158.76 42.25
8.5 5.1 43.35 72.25 26.01
5.5 4 22 30.25 16
7.1 4.1 29.11 50.41 16.81
3.5 3.2 11.2 12.25 10.24
ˆ a bX
a. Y
n XY X Y
11( 294.97) (65.45)(44.43)
b
0.38 and
n X ( X) 11( 472.19) (65.45) 2
2 2
6 6
5 5
Expenditure.Y
Expenditure.Y
4 4
3 3
2 2
2 4 6 8 10 12 14 2 4 6 8 10 12 14
Income.X Income.X
n XY ( X)( Y)
b. r 0.9829
[n X 2 ( X) 2 ] [n Y2 ( Y) 2
c. X=8 ˆ a bX 1.79 0.38X 1.79 0.38(8) 4.83
Y
Exercise: A car rental agency is interested in studying the relationship between the
distance driven in kilometer(Y) and the maintenance cost for their cars(X in birr). The
following summarized information is given based on samples of size 5.
2 5
2
Yi 314
5
i 1
Xi 147,000,000 i 1
5 5 5
i 1
Xi 23,000 , i 1
Yi 36 , i 1
Xi Yi 212, 000
regression
equation (line) Not good (bad) fit.
Example: compute the Coefficient of Determination (r2) for the above data.
We already calculate r=0.9829 r2=0.967=0.97
Interpretation: 97% of the total variation in Y is explained by the regression equation
(line).
Since r2=0.97 is close to one the fit is a good one.
Covariance of X and Y is the measure of joint variability between two variables (say X &
Y).
SX Y
(X i X)(Yi Y)
XY nXY
n 1 n 1
Sxy also measure the correlation (degree of association) b/n X & Y in the same way as
r, but Sxy is not standardized. i.e. 1 r 1 , but Sxy can take any value.
If Sxy >0 Positive correlation exists b/n X & Y. If Sxy <0 Negative correlation exists b/n
X & Y.
1
Syy =Variance of Y= (Yi Y) 2 Sy =Standard deviation of Y. i.e.
n 1
Sy Syy
SXY Sxy
i. r ii. b
SXSY SX
2
rSx Sy Sy Sx
but from (i) Sxy =r. Sx Sy b 2
r rb
Sx Sx Sy
Example: The correlation coefficient b/n the variables X & Y is 0.6. Their covariance is
4.8 and the standard deviation of x is 2. Find the variance of Y?
SXY SXY 4 .8
r Sy 4 Var(Y) Syy (4) 2 16
SXSY rSX 0.6 * 2