Session 19 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Regression

Regression Analysis
• Having determined the correlation between X and Y, we
wish to determine a mathematical relationship between
them.
• Dependent variable: the variable you wish to explain
• Independent variables: the variables used to explain the
dependent variable
• Regression analysis is used to:
▪ Predict the value of a dependent variable based on
the value of at least one independent variable
▪ Explain the impact of changes in an independent
variable on the dependent variable
Types of Relationships

Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
Types of Relationships

Strong relationships Weak relationships

Y Y

X X

Y Y

X X
Types of Relationships
No relationship

X
Simple Linear Regression Analysis
• The simplest mathematical relationship is
• Y = a + bX + error (linear)
• Changes in Y are related to the changes in X
• What are the most suitable values of
▪ a (intercept) and b (slope)?

Y b
1
y = a + b.x X
}a
X
Method of Least Squares
Y
(xi,yi)

ERROR

a + bxi

xi X
The best fitted line would be for which all the ERRORS
are minimum.
Least Squares Procedure
• We want to fit a line for which all the errors are
minimum.
• We want to obtain such values of a and b in
Y = a + bX + error for which all the errors are
minimum.
• To minimize all the errors together we minimize
the sum of squares of errors (SSE).
n
S =  ( y i − a − bxi ) 2

i =1
• To get the values of a and b which minimize SSE, we
proceed as follows:
SSE n
= 0  −2 ( yi − a − bxi ) = 0
a i =1
n n
  yi = na + b xi (1)
i =1 i =1

SSE n
= 0  −2 ( yi − a − bxi ) xi = 0
b i =1
n n n
  yi xi = a xi + b xi2 (2)
i =1 i =1 i =1

• Eq (1) and (2) are called normal equations.


• Solve normal equations to get a and b
• Solving these normal equations, we get
n n

y
i =1
i = na + b xi
i =1
n n n

 i i  i  i
y x
i =1
= a x + b x 2

i =1 i =1

n
 n  n  n
n yi xi −   yi   xi   (y i − y )(xi − x )
b = i =1  i =1  i =1  = i =1
=
SSXY
2 n
  ( )
 xi − x
n n SSX
n xi −   xi 
2
2

i =1  i =1  i =1

a = y − bx
• The values of a and b obtained using least squares
method are called as least squares estimates (LSE)
of a and b.
• Thus, LSE of a and b are given by
SSXY
aˆ = y − bˆx, ˆ
b= .
SSX
• Also the correlation coefficient between X and Y is

Cov( X , Y ) SSXY SSXY SSX ˆ SSX


rXY = = = =b
Var( X )Var(Y ) SSX SSY SSX SSY SSY
Example
x y x−x y− y ( x − x )2 ( y − y )2 ( x − x )( y − y )
1.25 125 -0.9 45 0.8100 2025 -40.50
1.75 105 -0.4 25 0.1600 625 -10.00
2.25 65 0.1 -15 0.0100 225 -1.50
2.00 85 -0.15 5 0.0225 25 -0.75
2.50 75 0.35 -5 0.1225 25 -1.75
2.25 80 0.1 0 0.0100 0 0
2.70 50 0.55 -30 0.3025 900 -16.50
2.50 55 0.35 -25 0.1225 625 -8.75
17.50 640 0 0 1.560 4450 -79.75
SSX SSY SSXY

x = 2.15, y = 80.
SSXY
r= = −0.957
SSX SSY

= −51.12 aˆ = y − bˆx = 189.91


ˆ SSXY
b=
SSX
Fitted Line is ŷ = 189.91 − 51.12 x

140
120
100
80
60
40

0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
Fitted Line is ŷ = 189.91 − 51.12 x
◙ We can predict the value of y for some given value
of x.
◙ 189.91 is the predicted value of y when the value
of x is zero.
◙ -51.12 is the change in the predicted value of y as
a result of a one-unit change in x.
◙ For x = 2.15 (say), predicted value of y is
◙ 189.91 – 51.12 x 2.15 = 80.002
The Residuals and their analysis
◙ Regression is an attempt to
explain the value of dependent
variable Y in terms of
independent variable X.
Y ◙ Residual is the unexplained part
(xi,yi)
of Y.
Residual ◘ Variation in Y that is not
explained by X.
◙ The smaller the residuals, the
better the utility of Regression.
◙ Sum of Residuals is always zero.
Least Square procedure ensures
that.
◙ Residuals play an important role
in investigating the adequacy of
xi X the fitted model.
◙ We obtain coefficient of
determination (R2) using the
residuals.
◙ R2 is used to examine the
adequacy of the fitted linear
model to the given data.
Coefficient of Determination

( y − yˆ ) (y − y)
( yˆ − y )

n
◙ Total Sum of squares SST = (y
i =1
i − y) 2
n
◙ Residual (Error) Sum of squares SSE =  ( yi − yˆ i ) 2
i =1 n

◙ Regression Sum of squares SSR = SST - SSE = iˆ − 2


( y y )
i =1

◙ What fraction of total sum of squares is explained by Regression?

◙ Coefficient of Determination R2 = SSR/SST = 1-(SSE/SST)


• The fraction of SST explained by Regression is given by R2
• R2 = SSR/ SST = 1 – (SSE/ SST)
• Clearly, 0 ≤ R2 ≤ 1
• R2 close to 1 means that regression explains most of the
variability in Y. (Fit is good)
• R2 close to 0 means that regression does not explain
much variability in Y. (Fit is not good)
• R2 is the square of correlation coefficient between X and
Y. (proof omitted)
r = -1

r=1 0 < R2 < 1


R2 = 0
R2 = 1 Weak linear
No linear
Perfect linear relationships
relationship
relationship Some but not all of
None of the
100% of the variation the variation in Y is
variation in Y is
in Y is explained by X explained by X
explained by X
X Y Yˆ (Y − Y ) (Y − Yˆ ) (Yˆ − Y ) (Y − Y ) 2 (Y − Yˆ ) 2 (Yˆ − Y ) 2
1.25 125 126.0 45 -1 46 2025 1 2116
1.75 105 100.5 25 4.5 20.5 625 20.25 420.25
2.25 65 74.9 -15 -9.9 -5.1 225 98.00 26.01
2.00 85 87.7 5 -2.2 7.7 25 4.84 59.29
2.50 75 62.1 -5 12.9 -17.7 25 166.41 313.29
2.25 80 74.9 0 5.1 -5.1 0 26.01 26.01
2.70 50 51.9 -30 -1.9 -28.1 900 3.61 789.61
2.50 55 62.1 -25 -7.1 -17.9 625 50.41 320.41
17.20 640 4450 370.54 4079.4
6

• Coefficient of Determination: R2 = (4450-370.5)/4450 = 0.916


• Correlation Coefficient: r = -0.957
• Coefficient of Determination = (Correlation Coefficient)2
Example:
• Watching television also reduces the amount of physical exercise,
causing weight gains.
• A sample of fifteen 10-year old children was taken.
• The number of pounds each child was overweight was recorded
(a negative number indicates the child is underweight).
• Additionally, the number of hours of television viewing per weeks
was also recorded. These data are listed here.
TV 42 34 25 35 37 38 31 33 19 29 38 28 29 36 18
Overweight 18 6 0 -1 13 14 7 7 -9 8 8 5 3 14 -7
• Calculate the sample regression line and describe what the
coefficients tell you about the relationship between the two
variables.
▪ Y = -24.709 + 0.967 X and R2 = 0.768
Exercise
A random sample of 14 elementary school students is selected, and
each student is measured on a creativity score (X) using a well-defined
testing instrument and on a task score (Y ) using a new instrument. The
task score is the mean time taken to perform several hand-eye
coordination tasks. The data are:
X: 28 35 37 50 69 84 40 65 29 42 51 45 31 40
Y: 4.5 3.9 3.9 6.1 4.3 8.8 2.1 5.5 5.7 3 7.1 7.3 3.3 5.2

Also given that X = 646;  Y = 70.7, X2 = 33312; Y2 = 401.59;


XY = 3481.4

(a) Fit a regression line for this data using the method of least squares.

(b) Obtain the correlation coefficient between X and Y .

(c) Obtain the value of coefficient of determination and discuss the


admissibility of the linear model.
Solution:
Line: Y=a+bX

b= SSXY/SSX = .0625
a = y − bx = 5.0500 − .0625 * 46.1429 = 2.1661

Y = 2.1661 + 0.0625 X

SSXY
r= = .5545
SSX SSY

R2 = r2 =.3075

You might also like