Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 63

CLASSICAL LINEAR

REGRESSION AND
ITS ASSUMPTIONS
◦ The first assumption here is that our regression curve is
correctly specified. That is the dependent variable is
statistically influenced by the explanatory variables and the
shape of the regression curve is linear. Formally, it is
stated that the dependent variables (say Y) is assumed a
linear function in variable(s) and parameter(s).
◦ Another assumption is that the random disturbance term U
is normally distributed with mean zero. That is, U ranges
from -∞ to +∞ symmetrically distributed around its mean
and that these deviations on average nullify themselves – it
arises from the randomness of the disturbance term. This
is denoted E(u) =0 or E(ei) = 0 for all i.
◦ In addition to the above assumption, it is stated that the way individual observations scatter
around the line depends on the pattern of variation of the disturbance term [and this is
expected to be constant]. That is, we assume a constant variance of the disturbance terms
whatever the value of X. Formally stated Var (Ui) = E {Ui – E (Ui)}2 = . This is the
assumption of homoscedasticity. This is a double assumption one of homoscedasticity and the
other of non-autocorrelation, i.e.

2  2
E (ei ) = u for all i
E (ei ej) = 0 for all i ≠j alternatively, Cov (uiuj) = 0
◦ Another noteworthy assumption is that X is nonstochastic, that is, X is fixed in
repeated sampling. That is we have perfect control over X when carrying out
sampling. This allows us to avoid the uneven distribution of X-values over its
range.

◦ However, the X’s can be stochastic provided they do not co-vary with the U’s.
This implies that the X’s are independent of the error term, symbolically denoted
E (XU) = 0
◦ Or E ( , i.e. the explanatory variable(s) is (are) pair wise uncorrelated with the
error (disturbance) term.
◦ There is this assumption of non-multicollinearity. The
implication of this assumption is that there exists no exact
linear relationship among the regressors in the equation, and
that the number of observations (n) must always be at least
as large as the number of parameters to be estimated (k say)
formally stated as X has full rank,
 x   k  n
With the above assumption, we can describe the probability
distribution of Y.

Given Y = x + e
The expectation of Y written E(Y) is
E Y   X sin ceE e   0
=

Variance of Y or Var., (Y) = E{ Y  E (Y )](Y  EY )}


= E ( X  e  X )( X  e  X ) )

Var (Y) = E ( ee )   2


I
◦ Since Y is a linear function of U and Ui is assumed normally distributed, then Yi is normally distributed
and thus

Y ~ N(X  ,  I .)
2
Estimation of the Model
◦ Here we consider estimation for both univariate and multivariate
cases. The univariate linear regression model has one explanatory
variable and is specified as follows

Y  
 X  
nx1 nx2 2x1 nx1
Y – (nx1) vector of observation on the regressand
X - (nx2) matrix of observations on the regressor
β – (2x1) vector of coefficients
◦ ε - (nx1) vector of residuals
The
+ eprocess of estimation is as follows:

1. Determine the estimated/sample relationship



Y= X´  + e ෢ is the estimate of β and e, the estimate of ε
𝛽
nx1 nx2 2x1 nx1
2.Determine the residual deviations
e = Y - X´ 

3.Square and sum the residual deviations.


The residual sum of squares
e´e = (Y - X´  ) ( y  X ˆ ) (3.11)
e´e = Y´Y - 2 ̂ ’X¹Y + ̂ X¹X ̂
1x1 1 x 1 1 x 1
Note Y ' Xˆ  ˆ X Y  2ˆ X 1Y
1 x1 1 x 1 1 x 1
2Y Xˆ
1x1
1. Determine ̂ that minimizes residual sum of squares (RSS, for short S( ̂ ).
S( ̂ ).= Y Y  2ˆ X 1Y  ˆ X 1 Xˆ

To minimize we differentiate S( ̂ ). With respect to ( ̂ ) and equate to zero [first order condition for stationarity]
S
 2 X Y  2 X Xˆ (3.12)
ˆ


X Xˆ  X 1Y
2xn nx2 2xn nx1
 
ˆ  X X 1 1 X 1Y (3.13)
2X2 2X1
The expression (1.13) constitutes the set of normal equations
The expression (1.13) now becomes

1 X 12   Y1 
1 X   Y 
22 
1 1 . . 1      1 1 . . 1  2
X   . .   1  =   .
 21 X 22 . . X 2 N     2   X 21 X 22 . . X 2 N   
. .  .
1 X N 2  YN 
 Yi  nˆ1  ˆ2  X i 2
(3.14)
 X 12Yi  ˆ11  X i  ˆ22  X 2i 2
2

the equation (3.14) are the Normal equations

The multivariate linear regression model with k-parameters and (k-1)


explanatory variables represented by

Y = X´  
Nx1 nxk kx1 nx1 (3.15)
Y – (n x 1) vector of observations on the regressand
X- (n x k) matrix of observations on the regressors
β- (k x 1) vector of coefficients
Yi   2 X i 2  .....  k X ik   ii  1,2,...n.... (3.16)
The following system of simultaneous equations are derived as the Normal equations
n   X i 2  ....   X ik   Yi
2
 X i 2   X i 2 .....   X i 2 X IK   X i 2Yi

 X ik   X i 2 X ik    X ik2   X i 2Yi
Here we consider estimates of the parameters given by:

    1
E ˆ  E X 1 X ( X 1 X  X 1 ) 
=
E   ( X 1 X ) 1 X 1     ( X 1 X ) 1 X 1 E    
since E (ε)=0

The mean of ̂ , i.e. E( ̂ =  (3.17)
The covariance of the estimator, Cov( ̂ ) is given as

Cov(  )  E{[ ˆ  E ( ˆ )][ ˆ  E ( ˆ )]1} …….(3.18)


because E(ε´ε) =σ2
3.1 PROPERTIES OF THE MODEL
  1
With the estimated parameters ̂  X 1 X X 1 Y and the vector of residuals e=Y- X´  . We can present the properties
of the model.
1. The sum of the cross-products of the explanatory variables and the residuals is equal to zero

X 1e  0 …..(3.19)

Proof:
Given e = Y - X ̂
Post-multiply by X 1
X 1 e = X 1 Y - X 1 X ̂

Recalling that X 1 Y = X 1 X ̂
X 1 e = X 1 Y- X 1 X ̂ - X 1 X ̂ = 0
1. The sum of the cross-products of estimated Y and the residuals is equal to zero, i.e.
Yˆ1e  0 (3.20)
Proof:
Given Yˆ  X̂

Transpose Yˆ  ˆ 1 X 1
Post multiply by e
Yˆ 1e  ˆ 1 X 1e  ˆx0  0
Hence
Yˆ 1e  ˆX 1e  0
1. The total variations of Y are the sum of explained variations and unexplained variations.
Y 1Y  Yˆ 1Yˆ  e1e i.e. TSS=ESS + RSS (3.21)

TSS – Total sum of Squares; ESS – Explained Sum of Squares RSS – Residual sum of Squares.
Proof:
e  Y  Xˆ
 
1
 e1e  Y  Xˆ Y  Xˆ 
 Y 1Y  2 ˆ 1 X 1Y  ˆ 1 X 1 Xˆ
 Y 1Y  2 ˆ 1 X 1 Xˆ  ˆ 1 X 1 Xˆ sin ceX 1Y  X 1 Xˆ
 Y 1Y  ˆ 1 X 1 Xˆ
 Y 1Y  Yˆ 1Yˆ sin ceXˆ  Yˆ , then ˆ 1 X 1  Yˆ 1
Hence: e1e  Y 1Y  Yˆ1Yˆ  Y 1Y  Yˆ1Yˆ  e1e
Y 1Y - Total sum of Squares
Yˆ 1Yˆ - Explained sum of Squares
e1e - Residual sum of Squares
1. The Gaus-Markov Theorem states that among the class of linear and unbiased estimators, the Ordinary
Least Squares (OLS) has minimum variance.
  1
The OLS ̂ = X 1 X X 1Y  CY
Thus OLS estimator is linear in Y and the OLS  is unbiased since

E ˆ   (Note 3.17)
ˆ    X 1 X 1 X 1
From 3.17
ˆ    X 1 X 1 X 1
Cov =

As in (1.18) is minimum.
To prove the Gaus-Markov theorem – that the Least Square Estimator is the best of all linear unbiased estimators (b.l.u.e)
– consider the estimator ˆ  [( X 1 X ) 1 X 1 +P]Y …….(3.21)

Where P is a k x n non stochastic perturbation matrix representing a perturbation from the ̂ estimator. The estimator
̂ becomes the least squares estimator if and only if P vanishes. Thus (3.19) defines a whole set of estimators which are
determined once a P matrix is given. This set consists of all estimators that are linear in Y and under appropriate
conditions the estimators in this set are unbiased. Substituting Y=X´β+e into (3.19)
Yields ˆ  [ ( X 1 X ) 1 X 1 +P](X   e)    ( X 1 X ) 1 X 1e  px  pe ………(3.22)
ˆˆ
Taking expectations, all terms except  and P X B vanish. Thus i is unbiased if PX=0, since P can be any perturbation
matrix subject only to these conditions, the class of estimators defined by (1.20) contains all linear unbiased estimators
of  .
ˆ
To show that ̂ is best (most efficient) among this class requires calculating the covariance matrix of ˆ since efficiency
ˆ
is based on Cov ˆ 
 
 ˆ

and Cov ̂ respectively, since both  ˆ  and ̂ are unbiased.
 
e = Y - X
 Y  X ( X 1 X ) 1 X 1Y
 [ I  X ( X 1 X ) 1 X 1 ]Y
 MY
M= I  X ( X X ) 1 X  said to be symmetric and idempotent i.e. M = M¹ = MM¹ = M
e = MY
 M  X  U   MX  MU

MU since MX = 0 [Proof an exercise]


e´e = (MU)¹ (MU) = u¹ M¹ Mu = u ¹Mu …..(3.24)

Taking expectations of (1.24)


E(e´e) = E(U´Mu) = E[tr(u¹Mu)] since u¹Mu is a scalar
= E[tr Muu¹] = tr M σ u2 I n = σ u2 I n tr(M) = σ u2 (n - k) ……(3.25)
Note

tr[I k ] = k
tr M = tr(I n ) - tr I k = n – k.
thus if we define
2 e1e
S = ……….(3.26)
nk
 2
(n  k ) 2
2
It follows E(S ) =   which implies S 2 is an unbiased estimator of σ 2 . The square root
nk
S is often referred to as the standard error of the estimate, and may be regarded as the standard deviation
of the Y values about the regression plane.
a. MODEL IN DEVIATION FORM
The essence of this approach is to express all data in terms of deviations from sample means. This enables us to
estimate the slope coefficients at one stage and then the intercept term if necessary. The approach is straight
forward by the use of a transformation matrix.
1
A = I - ( ) ii 
n
Where і denotes a column vector of n ones (n units). This matrix is symmetric-idempotent. Its properties
include premultiplying any vector of identical elements by A yields a zero vector, i.e. Ai = 0; Ae = e since the
mean of residuals is zero, e is already in deviation form..
To illustrate consider the least square equation in k variables
Averaging over the sample observations gives
Y  b1  b2 X 2  b3 X 3    bk xk   (e  0) …..(1.27)
subtracting (1.27) from (1.26) we obtain

yt = b2x2t + b3x3t +…bkxkt + et t=1,…..n …..(1.28)

Where y T  Yt  Y , x T  X T  X  lower case letters denote deviations from sample means and the intercept
vanishes. The intercept is calculated thus
b1  Y  b2 X 2  b3 X 3   bk X k
the OLS estimator b and residual vector e are connected by y=Xb+e
the partition of X matrix is X=[x 1 x 2 ] where X 1 (=I) the usual column of units and X 2 is the nx(k-1) matrix of
observations on the variables X 2 , X 3, , X K , thus
Y  Xb  E  [ LX 2 ] h 12 e where b 2 is the (k-1)
element vector containing the coefficients b2 , b3 ,, bk .
premultiplying by A gives
AY  [ AiAX 2 ]b1  Ae  [0 AX 2 ] b
1 + Ae
b 0
2 2

AY=(AX 2 )B 2 +e or y*=X*b 2 + e ….(1.29)

Where y*=AY and X*=AX 2 . Since X 1 e = 0, it follows that X 1 *e = 0


or X 12 AY = X 1 2 AX 2 b 2 ….(1.30)
The above are the familiar normal equations except that data is in deviation form and b 2 vector contains k-1 slope
coefficients. Finally, using the symmetric idempotency of A means that (1.30) is equivalent to
AX 2  Ay  ( AX 2 )AX 2 b2 …(1.31)

(1.31) cam be interpreted as follows:-


X2 is the sub vector of OLS slope coefficients
Ay is the y vector expressed in deviation form
AX 2 is the matrix of explanatory variables in deviation form.

Equation (1.31) is a set of normal equations in terms of deviations, whose solution yields the OLS slope coefficients.
The sum of squared deviations in the dependent variable, denoted by TSS, is TSS =y 1 Ay
…(1.32)
…(1.32) can be decomposed into an explained sum of squares (ESS) and a residual (unexplained) sum of squares (RSS).
Recall
Ay  AX 2b2  e
transposing and multiplying
Transposing and multiplying: yAy  b21 X 2AX 2b2  ee ….(1.33)
(TSS) = (ESS) (RSS)

Cross-product terms vanish since X 1e  0

Dividing 1.33 by yAy gives


b1 2 X 1 2 AX 2b2 ee
1  …(1.34)
yAy yAy

b21 X 21 AX 2b2 ee


 1 …(1.34)
yAy yAy

ESS RSS
 1
TSS TSS

ESS ee
R 2 1,2,3,…K =  1
TSS yAy

b21 X 21 AX 2b2 b21 X 21 Ay


Equivalently R 2
=  …(1.35)
yAy yAy
R2 measures the proportion of the total variation in Y explained by the linear combination of the regressors. An
adjusted R 2 , denoted R 2 , is a statistic that takes account of the number of regressors used in the equation. It is useful
for comparing the fit of specifications that differ in the addition or deletion of explanatory variables. The unadjusted R 2
will never decrease with the addition of any variable to the set of regressors. The adjusted R 2 , however, may decrease
with the addition of variables of low explanatory power. It is defined as
RSS (r  k )
R 2  1 ….(1.36)
TSS (n  1)

N 1
alternatively R 2  1 (1 - R 2 ) …..(1.37)
nb

two other frequently used criteria for comparing the fit of various specifications involving different number of regressors
are the Schwarz criterion.

ee k
SC  In  ln n
n n
ee 2k
and the Akaike information criterion ALC  n 
n n
one looks for specifications that will reduce the sum of squares, but each criterion adds on a penalty, which increases
with the number of regressors.

Example 1.1. Given the sample data below, estimate the parameters

3 1 3 5
1 1 1 4
   
Y = 8 and X = 1 5 6
   
3 1 2 4
5 1 4 6

 5 15 25   20 

X¹X = 15 55 81
 and X¹Y =
 76 
   
25 81 129 109
The Normal Equations are then

 5 15 25   1   20 
15 55 81    =
 76 
   2  
25 81 129   3  109

Gaussian elimination techniques can be used thus

New row 2 = row 2 – 3 times row 1


New row 3 = row 3 – 5 times row 1

5 15 25   1  20
0 10 6     = 16 
   2  
0 6 4    3   9 
6
new row 3 = row 3 - row 2
10
b
5 15 25 1 20
b
0 10 6 2 16
b
0 0 0.4 3 0.6

The third equation gives 0.4 b3 = - 0.6


b
3 = - 1.5

10b2  6b3  16
Second equation 10b2  16  6b3  16  6(1.5)  16  9
b
2  25 / 10  2.5
5b1  15b2  25b3  20
5b1  15(2.5)  25(1.5)  20
Equation 1
5b1  37.5  37.5  20
b1  20 / 5  4
The regression equation is thus :- Yˆ  4  2.5 X 2  1.5 X 3
Alternatively, in deviation form (noting X 2  3, X 3  5, Y  4)

  1 0 0
 3  2  1
   
Y* = Ay =  4  X* = AX2 =  2 1
   
  1   1  1
 1   1 1 

10 6
X X *  X 2 AX 2  
1
*
1

 6 4 
the relevant normal equations are then

10 6  2  16
6    = 
 4   3   9 
1
 2  10 6 16  1  1.5 16  2.5 
  =  6 4  9  =  1.5 2.5   9  =  1.5
 3         

TSS= y¹ Ay = 28
10 6  2.5 
ESS = b´2X 2 1 AX 2 b 2 = [2.5-1.5]     = 26.5
 6 4  1.5

1
16
or, more simply, from: b¹*X y* = [2.5
* -1.5]   = 40 – 13.5 = 26.5
9

RSS == TSS - ESS = 28 - 26.5 = 1.5

RSS 26.5
R2    0.95
TSS 28
5.1
AdjustedR 2  R 2  1  (0.05)  1  0.1  0.90
5.3
Example 1.2. Estimate the slope coefficients in the regression model

Yt   o  1 X 1t   2 X 2t   3 X 3t  U t
given the following sums of squares and products of deviation from means for 24 observations

y 2
 60  x  10
2
1 x 2
2  30 x 2
3  20

 yx 1 7  yx2  7  yx  26
3

x x1 2  10 x x 5
1 3  X X  15
2 3
Solution: y 1 Ay = 60
 7  10 10 5 

X 12 Ay =  7
 X 12 AX 2 =
10 30 15
   
 26  5 15 20

10 10 5    1   7 
Thus
10 30 15    =  7 
   2  
 5 15 20   3   26 

X 2 AX 2 = 2500
1
 1  10 10 5   7 
  = 10 30 15  7 
 2    
  3   5 15 20  26 
̂1 10 10 5 1 7
̂ 2 = 2  10 30 15 -7
̂ 3 5 15 20 -26

= 0.15 -0.05 0 7 1.40


-0.05 0.07 -0.04 -7 = 0.20
0 -0.04 0.08 -26 -1.80

Alternatively, the Gaussian elimination method will yield

New row 2 = row 2 – row 1


New row 3 = 2 times row 3 – row 2

10 10 5 1 7
0 10 10 2 = -14
0 0 25 3 -45

From third row: 256 3 = -45

45 9
̂ 3 =    1.8
25 5
From Second row:
20  2  10  3  14
20  2  10 18  14
20  2  18  14
4
ˆ2   0.20
20
10 1  10  2  5  3  7
substituting for ̂ 2 and ̂ 3 yields ̂1 = 1.4

thus the technique use for deriving the slope coefficients is immaterial. However the inverse method provides a component needed
variances of ̂1 , ̂ 2 and ̂ 3 .

Example 1.3: given the data in example 1.2 and the slope coefficients in the solution, calculate the variance covariance matrix of
the multiple coefficient of determination.
Solution Given
y1 Ay  60  TSS
 2  1 .4
0 .2
 1 .8
ESS   2 1 X 21 AX 2  2   2 X 1 Ay
1

= [1.4 0.2 -1.8] 10 10 5 1.4


10 30 15 0.2 = 55.2
5 15 20 -1.8

= [1.4 0.2 -1.8] 7 = 9.8 - 1.4 + 46.8 = 55.2


-7
-26
RSS = TSS =ESS = 60 - 55.2 = 4.8
e1e 4.8
S 
2
  0.24
n  k 20
var  Cov(  2 )  0.24. 0.15 -0.05 0
-0.05 0.07 -0.04
0 -0.04 0.08

Var 1   0.240.15  0.0360


S .e1   0.036
Var  3   0.240.08  0.0192

Standard errors of ̂ 3 , ̂ 2 and variance ̂ 2 is an exercise for students.

ESS e1e 4.8 55.2


R 
2
12.3 1 1 1   0.92
TSS y Ay 60 60
4.8 20
R2  1   1  0.092  0.908
60 23
The result indicates a high explanatory power of the variables.
1.7 TESTING LINEAR
HYPOTHESES ABOUT β
In the previous sections the properties of OLS estimator of β have been established. In this section we show how
to use this estimator to Test hypotheses about β. The following are examples of typical hypotheses.

(i) H 0 :  i  0 . This is generally referred to as the significance test. It basically tests whether a particular regressor
Xi has no influence on the regressand.

(ii) H o :  i   io. This implies  i has a specified value. If for instances denotes marginal propensity to consume
in Sierra Leone, one might be tempted to test that the marginal propensity is or test that price elasticity is unitary,
ε = 1.

(iii) H o :  2   3  1. If the indicate capital and labour elasticities of a production function, this formulation is
tantamount to testing for constant returns to scale (CRTS).

(iv) H 0 :  3   4 , or 3   4  0. this is the test of equal coefficients.


(v) H 0: : 2 0
3  0
. .
. .
. .
k 0

This is known as the joint significance test. It is used to set the hypotheses that the compete set of regressor
no effect on the dependent variable (say Y). it tests the overall goodness of fit.
(vi) H 0 :  2  0. The  vector is partitioned into two sub vectors. 1 (containing k1 elements) and  2 (containi
k 1 = k 2 elements). This sets up the hypothesis that a specified subset of regressors play the role
determination of the dependent variable (Y say)
the above forms of linear hypotheses are incorporated in
R  r where
R – q x k matrix of known constants, q  k
R – q vector of known constants

Each null hypothesis determines the relevant elements in R and r. for the above examples we have
(i) R = [0 0 …1 ….0 ….0] r = 0 q = 1 with 1 in the ith position.
For a 3 parameter case r = [0 1 0] r = 0 q = 1 i.e.  2 = 0

(ii) R = [ 0……0 1 0 ……0] r =  i 0 q =1


with 1 in the ith position (depicting a prescribed value for the  in that position. For a 4 parameter case in which
 4 is specified to be 1
R = [0 0 0 1] r = 1 (say) q = 1 i.e. ˆ  1
4

(iii) R = [0 1 1 0 ….0] r =r1 q = 1 i.e. ˆ2  ˆ3  1

(iv) R = [0 0 1 -1 0 ….0] r = 0 q =1
i.e. ˆ3  ˆ4  0
(v) R = [0 1 k 1 ] r = 0 q=k–1 where 0 is a vector 0 k – 1 zeros.
R= 0 1 0 …..0 0
0 0 1 ….0 r = 0
…………………………
0 0 0 ….1 0
(vi) R = [ 0k 2 xk1I k 2 ] r = 0 q = k2
where 0 is null matrix of (k - k 1 ) = k 2 x k 1 ) and I k 2 is… k 2

element column vector that is, the last k2 elements in  are jointly zeros for example, in an equatics explain ing the rate of infla
explanatory variables might be grouped into the subsets – those measuring expectations and those measuring pressure of dema
significance of either subset might be tested by using the formulating by numbering of the variables so arranged that those in th
to be tested come at the . to devise a practical test procedure we used a determine the sampling distribution of R 

  
E Rˆ  RE ˆ  R
Var Rˆ   E{R ( ˆ   )( ˆ   ) R }  0 R R R 
1 2 1 1
R1
Since ̂ is multivariate normal
   1
Rˆ  N R ,02 R R1R R1 ) OR
R ˆ     N 0 : 0 N ( R R 
2 1 1
R1 ) …(1.36)
We can replace R in equation (1.36) by for, obtaining
R̂  r   N 0 : 0 N (R R R ) , thus
2 1 1 1

  1
( Rˆ  r )1[02 r ( N R1R R1 ]1 ( Rˆ  r )  X 2 (q) …(1.38)

010
 X 2 n  k  thus, we can form an …… ratio so that the unknown 02 will cancel out.
2
If R =r is true, then
  1
( Rˆ  r )1[02 r ( N R1R R1 ]1 ( Rˆ  r )
E q, n  k  …..(1.40)
1
en  k
The test procedure is to reject the hypothesis R =r, if the computed F value exceeds the preselected critical value.

Example 1.3 using values in example 1.2, test the following:


(i) joint significance of X 1t X 2t andX 3t
(ii) 1   2   3  0
(iii) 1   3or1   3  0
(iv) 2  0
Solution
(i) H 0 : 1   2   3  0 .
For joint significance, use the straight forward method
ESS k  1 55.2 3
F   76.66
RSS n  k 48.8 20

F(3,20) = 3.10

Since F cal  F 3,20  data is not consistent with the null hypothesis 1   2   3  0

(ii) R = [1 1 1], r = 0, q = 1

R X 1X 
1
R1  [1 1 1] .15 -05 0 1
-05 07 04 1 = 0.12
0 -04 .08 1

R  r  [1 1 1] 1.4
0.2 - 0 = -0.2
-1.8

Fcalc<F (1.20) evidence from data does not reject H 0 .

You might also like