Correlation and Simple Linear Regression: Y. I.E. X

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 9

Lecture note on Stat 173 Chapter 9: Correlation & Simple Linear

Regression

CHAPTER 9
9. CORRELATION AND SIMPLE LINEAR REGRESSION
Here we will discuss on the method employed to determine if there exists any
relationship between two variables and express the r/p numerically.
Example: we my need to know whether there exists a r/p between:
 Income & Expenditure  Fertilizer & Plant yield
 Age & Blood pressure  Height & Weight
9.1. Correlation Analysis: is a statistical technique that can be used to describe the
degree
to which one variable is linearly related to other variable.
I.e. Correlation: is the degree (strength) of linear r/p b/n two variables (say X & Y).
Two variables X & Y are said to be highly correlated if they have a strong relationship.
i.e. X  Y , X  Y orX  Y or Vise versa.
If higher (lower) values of one variable (Say X) is accompanied by higher (lower) of the
other variable (say Y), then we say there exists a Positive (Direct) Correlation b/n X &
Y. i.e. X  Y , X  Y. Example: The greater the radius of a circle, the greater will be
the circumference.

If a higher (lower) value of one variable (Say X) is accompanied by lower (higher) of the
other variable (say Y), then we say there exist a Negative (Indirect) Correlation b/n X
& Y. i.e. X  Y or X  Y. Example: Saving versus Expenditure.

A scatter plot can be used to see the direction & degree of correlation b/n X & Y.

Page 1 of 9
12 12

8 8
Y

Y
4 4

0 0

1 2 3 4 5 6 7 1 2 3 4 5 6 7
X X

Positive (Direct) Correlation b/n X & Y. Negative (Indirect) Correlation b/n X & Y.
10

6
Y

0
1 2 3 4 5 6 7
X

No Specific relationship (Correlation) exist b/n X & Y.

Note: If there is a Perfect (Exact) r/p b/n X & Y, then all the points will fit exactly on the
line.

13 13

9 9
Y

Y
5 5

1 1

1 2 3 4 5 6 7 1 2 3 4 5 6 7
X X

Exact (Perfect) Positive Correlation. i.e. r=+1 Exact (Perfect) Negative Correlation. i.e. r=-1
The simple correlation coefficient (Pearsons’s Correlation Coefficient) : is a measure used to
determine the degree of correlation b/n two or more variables.
  PopulationCorr.coefficent and r  SampleCorr.coefficent (r used to estimate  ).

 (X
i 1
i  X)(Yi  Y)
r
 (X i  X) 2  (Yi  Y) 2

n  XY  ( X)( Y)
The Short-cut formula: r  or
[n  X 2  ( X) 2 ] [n  Y2  ( Y) 2 ]

r
 XY  nXY
[ X 2
 nX ] [ Y
2 2
 nY 2 ]
Remark:
Always this r lies between -1 and 1 inclusively and it is also symmetric.
Interpretation of r :
1. Perfect positive linear relationship ( if r  1)
2. Perfect negative linear relationship ( if r  1)
3. Some Positive linear relationship ( if r is between 0 and 1)
4. No linear relationship ( if r  0)
5. Some Negative linear relationship ( if r is between -1 and 0)

Example: Compute the value of Pearson’s correlation coefficient based on the study of Age (X)
and
Blood pressure (Y) of a person.
Age=X Blood Pressure=Y XY X2 Y2
43 128 5504 1849 16384
48 120 5760 2304 14400
56 135 7560 3136 18225
61 143 8723 3721 20449
67 141 9447 4489 19881
70 152 10640 4900 23104
 X  345  Y  819  XY  47634  X 2
 20,399 Y 2
 112,443

n  XY  ( X)( Y)
n=6 and r 
[n  X 2  ( X) 2 ] [n  Y2  ( Y) 2 ]

6( 47,634)  (345)(819) 3249


r 
13128993
 0. 897(Closeto  1)
6( 20,399)  (345) ][6(112,443  (819) ]
2 2

Interpretation: There is strong positive linear r/p b/n age & blood pressure

The above formula and procedure is applicable for quantitative data. When we have qualitative
data (efficiency, honesty, intelligence and others), we go for Spearman’s Rank Correlation
Coefficient ( rs ).
Rank Correlation Coefficient ( rs ).
It is a measure of correlation based on rank of observations and not on the actual magnitudes
(values).
Steps: 1st: Rank the different values in X & Y.
2nd: Find the difference of the ranks in a pair. i.e. Di=Xi-Yi
3rd: Then use the formula:
6 D i 2
rs  1 
n(n 2  1)
Interpretation of rs is similar to that of r.
Example: The following are rankings of seven football players by two Coaches.
Find rs and comment on the opinion of the two coaches.
Player Coach-1=X Coach-2=Y Di=Xi-Yi Di2
A 4 4 0 0
B 1 2 -1 1
C 6 5 1 1
D 5 6 -1 1
E 3 1 2 4
F 2 3 -1 1
G 7 7 0 0
D i
2
=
        8

6 D i 2
n=7 and r  1  6(8)
s 1  0. 857(Closeto  1)
n(n  1)
2
7(7 2  1)
Exercise: Aster & Almaze were asked to rank 7 different types of lipsticks (Namely: A,B C D E
F & D). Using the following information, check whether there is a correlation between the tests
of the two ladies.

Lipstick Type A B C D E F G
Aster 2 1 4 3 5 7 6
Almaze 1 3 2 4 5 6 7

9.2. Simple Linear Regression (Regression of Y on X)


Regression is a r/p existing b/n a dependent (effect) variable and given independent (cause)
variable(s).
i.e. by regression the dependent variable (Y) is given as a function of the independent variable or
variables (X).
Regression Analysis: is a statistical technique that can be used to develop a mathematical
equation
showing how variables are related.
Simple Regression: is a regression where there is one dependent variable (Y) and one
independent
variable (X).
Multiple Regression: is a regression where there is one dependent variable (Y) and two or more
independent variable (X).
Simple Linear Regression: is a regression where there is a linear relationship b/n dependent
variable (Y)
and independent variable.
Example: Study Hours (X=independent variable=Cause) Versus
Grade Obtained (Y=dependent variable=Effect)

Two variables X and Y are said to Linearly Related if their relationship can be expressed by
simple linear model:
Y    X   ,….Referred as Regression of Y on X.
Where: Y=dependent var., X=Independent var.,  =Intercept of the regression line,
 =Slope of the regression line,  = Error (random disturbance term).

Here (Intercept) and  (Slope) are the parameters. Also they are known as regression
coefficients.
There are different methods of estimating the parameters  and  . The most commonly
applied method is the Ordinary least Squares (OLS) method.
The OLS method tries to estimate the parameters by minimizing the sum of squares of the error
terms   .
2

Y    X   ˆ  a  bX . Where: a and b are OLS estimates of  and  ,


is estimated by Y
respectively.
Therefore, the regression line which minimizes     (Y  Y ˆ ) 2 is Y
2
ˆ  a  bX , where:

b
 (Xi  X)(Yi  Y)   XY  nXY  n XY   X Y and a  Y  bX .
 ( X i  X) 2  X2  nX 2 n  X 2  ( X) 2

Where a & b are the least squares estimates of the regression line of Y on X.

Example: The following hypothetical data set shows income and monthly food expenditure of
household
in hundreds of birr. Then,
a. Fit the least squares regression line to the given data.
b. Calculate a simple correlation coefficient (r)
c. Predict the food expenditure for 800 birr (8).

Income=X Expenditure=Y XY X2 Y2
3.8 3.1 11.78 14.44 9.61
4.5 3.6 16.2 20.25 12.96
2.5 2.3 5.75 6.25 5.29
4.8 3.7 17.76 23.04 13.69
7.7 4.6 35.42 59.29 21.16
5.0 4.1 20.5 25 16.81
12.6 6.5 81.9 158.76 42.25
8.5 5.1 43.35 72.25 26.01
5.5 4 22 30.25 16
7.1 4.1 29.11 50.41 16.81
3.5 3.2 11.2 12.25 10.24

 X  65. 5  Y  44. 3  XY  294. 97  X 2


 472. 19 Y 2
 190. 83
n=11, X =5.95 and Y  4.03

ˆ  a  bX
a. Y
n  XY   X Y
11( 294.97)  (65.45)(44.43)
b 
 0.38 and
n X  ( X) 11( 472.19)  (65.45) 2
2 2

a  Y  bX  4.03  (0.38)(5.95)  1.79


ˆ  a  bX  1.79  0.38X
Therefore, Y
Interpretation: whenever income (X) is zero, the expenditure on food will be Birr 1.79 (179)
and for every Birr increase in income 38% of it will be spent on food.

6 6

5 5
Expenditure.Y

Expenditure.Y
4 4

3 3

2 2

2 4 6 8 10 12 14 2 4 6 8 10 12 14
Income.X Income.X

n  XY  ( X)(  Y)
b. r   0.9829
[n  X 2  ( X) 2 ] [n Y2  ( Y) 2
c. X=8  ˆ  a  bX  1.79  0.38X  1.79  0.38(8)  4.83
Y
Exercise: A car rental agency is interested in studying the relationship between the
distance driven in kilometer(Y) and the maintenance cost for their cars(X in birr). The
following summarized information is given based on samples of size 5.

2 5

2
Yi  314
5
i 1
Xi  147,000,000 i 1

  
5 5 5
i 1
Xi  23,000 , i 1
Yi  36 , i 1
Xi Yi  212, 000

a) Find the least squares regression equation of Y on X.


b) Compute the correlation coefficient and interpret it.
c) Estimate the maintenance cost of a car which has been driven for 6 km.
Coefficient of Determination (r2): is a measure of the proportion of the total
variation in Y that is explained by it’s r/p with X. That is, to know how far the
regression equation has been able to explain the variation in Y we use r2.
i.e. r2=Measures the goodness of fit of the regression line.
0  r

. Where: r2=explained variation and 1-r2=unexplained variation.


2
 1

Interpretation: If r2 close to 1  r/p b/n X & Y is well explained by the regression


equation
(line)  Good fit.
If r close to zero  r/p b/n X & Y is not well explained by the
2

regression
equation (line)  Not good (bad) fit.
Example: compute the Coefficient of Determination (r2) for the above data.
We already calculate r=0.9829  r2=0.967=0.97
Interpretation: 97% of the total variation in Y is explained by the regression equation
(line).
Since r2=0.97 is close to one the fit is a good one.

9.3. Covariance of X and Y

Covariance of X and Y is the measure of joint variability between two variables (say X &
Y).

Cov( X, Y)   x ,y =Population Covariance and Cov(X, Y)  Sx ,y =Sample Covariance

SX Y 
 (X i  X)(Yi  Y)

 XY  nXY
n 1 n 1

Sxy also measure the correlation (degree of association) b/n X & Y in the same way as
r, but Sxy is not standardized. i.e.  1  r  1 , but Sxy can take any value.

If Sxy >0  Positive correlation exists b/n X & Y. If Sxy <0  Negative correlation exists b/n
X & Y.

Relationship Between r, b, Sx , Sy andSxy


1
Let: Sxx =Variance of X=  (Xi  X) 2  Sx =Standard deviation of X. i.e.
n 1
Sx  Sxx

1
Syy =Variance of Y=  (Yi  Y) 2  Sy =Standard deviation of Y. i.e.
n 1
Sy  Syy

Next the relation ship between the coefficients:

SXY Sxy
i. r ii. b 
SXSY SX
2

rSx Sy Sy Sx
but from (i) Sxy =r. Sx Sy b 2
r rb
Sx Sx Sy

Example: The correlation coefficient b/n the variables X & Y is 0.6. Their covariance is
4.8 and the standard deviation of x is 2. Find the variance of Y?

Given: r=o.6, SXY =4.8, Sx =2

SXY SXY 4 .8
 r  Sy    4  Var(Y)  Syy  (4) 2  16
SXSY rSX 0.6 * 2

You might also like