Professional Documents
Culture Documents
Classical Linear Regression and Its Assumptions
Classical Linear Regression and Its Assumptions
REGRESSION AND
ITS ASSUMPTIONS
◦ The first assumption here is that our regression curve is
correctly specified. That is the dependent variable is
statistically influenced by the explanatory variables and the
shape of the regression curve is linear. Formally, it is
stated that the dependent variables (say Y) is assumed a
linear function in variable(s) and parameter(s).
◦ Another assumption is that the random disturbance term U
is normally distributed with mean zero. That is, U ranges
from -∞ to +∞ symmetrically distributed around its mean
and that these deviations on average nullify themselves – it
arises from the randomness of the disturbance term. This
is denoted E(u) =0 or E(ei) = 0 for all i.
◦ In addition to the above assumption, it is stated that the way individual observations scatter
around the line depends on the pattern of variation of the disturbance term [and this is
expected to be constant]. That is, we assume a constant variance of the disturbance terms
whatever the value of X. Formally stated Var (Ui) = E {Ui – E (Ui)}2 = . This is the
assumption of homoscedasticity. This is a double assumption one of homoscedasticity and the
other of non-autocorrelation, i.e.
2 2
E (ei ) = u for all i
E (ei ej) = 0 for all i ≠j alternatively, Cov (uiuj) = 0
◦ Another noteworthy assumption is that X is nonstochastic, that is, X is fixed in
repeated sampling. That is we have perfect control over X when carrying out
sampling. This allows us to avoid the uneven distribution of X-values over its
range.
◦ However, the X’s can be stochastic provided they do not co-vary with the U’s.
This implies that the X’s are independent of the error term, symbolically denoted
E (XU) = 0
◦ Or E ( , i.e. the explanatory variable(s) is (are) pair wise uncorrelated with the
error (disturbance) term.
◦ There is this assumption of non-multicollinearity. The
implication of this assumption is that there exists no exact
linear relationship among the regressors in the equation, and
that the number of observations (n) must always be at least
as large as the number of parameters to be estimated (k say)
formally stated as X has full rank,
x k n
With the above assumption, we can describe the probability
distribution of Y.
Given Y = x + e
The expectation of Y written E(Y) is
E Y X sin ceE e 0
=
Y ~ N(X , I .)
2
Estimation of the Model
◦ Here we consider estimation for both univariate and multivariate
cases. The univariate linear regression model has one explanatory
variable and is specified as follows
Y
X
nx1 nx2 2x1 nx1
Y – (nx1) vector of observation on the regressand
X - (nx2) matrix of observations on the regressor
β – (2x1) vector of coefficients
◦ ε - (nx1) vector of residuals
The
+ eprocess of estimation is as follows:
To minimize we differentiate S( ̂ ). With respect to ( ̂ ) and equate to zero [first order condition for stationarity]
S
2 X Y 2 X Xˆ (3.12)
ˆ
X Xˆ X 1Y
2xn nx2 2xn nx1
ˆ X X 1 1 X 1Y (3.13)
2X2 2X1
The expression (1.13) constitutes the set of normal equations
The expression (1.13) now becomes
1 X 12 Y1
1 X Y
22
1 1 . . 1 1 1 . . 1 2
X . . 1 = .
21 X 22 . . X 2 N 2 X 21 X 22 . . X 2 N
. . .
1 X N 2 YN
Yi nˆ1 ˆ2 X i 2
(3.14)
X 12Yi ˆ11 X i ˆ22 X 2i 2
2
Y = X´
Nx1 nxk kx1 nx1 (3.15)
Y – (n x 1) vector of observations on the regressand
X- (n x k) matrix of observations on the regressors
β- (k x 1) vector of coefficients
Yi 2 X i 2 ..... k X ik ii 1,2,...n.... (3.16)
The following system of simultaneous equations are derived as the Normal equations
n X i 2 .... X ik Yi
2
X i 2 X i 2 ..... X i 2 X IK X i 2Yi
X ik X i 2 X ik X ik2 X i 2Yi
Here we consider estimates of the parameters given by:
1
E ˆ E X 1 X ( X 1 X X 1 )
=
E ( X 1 X ) 1 X 1 ( X 1 X ) 1 X 1 E
since E (ε)=0
The mean of ̂ , i.e. E( ̂ = (3.17)
The covariance of the estimator, Cov( ̂ ) is given as
X 1e 0 …..(3.19)
Proof:
Given e = Y - X ̂
Post-multiply by X 1
X 1 e = X 1 Y - X 1 X ̂
Recalling that X 1 Y = X 1 X ̂
X 1 e = X 1 Y- X 1 X ̂ - X 1 X ̂ = 0
1. The sum of the cross-products of estimated Y and the residuals is equal to zero, i.e.
Yˆ1e 0 (3.20)
Proof:
Given Yˆ X̂
Transpose Yˆ ˆ 1 X 1
Post multiply by e
Yˆ 1e ˆ 1 X 1e ˆx0 0
Hence
Yˆ 1e ˆX 1e 0
1. The total variations of Y are the sum of explained variations and unexplained variations.
Y 1Y Yˆ 1Yˆ e1e i.e. TSS=ESS + RSS (3.21)
TSS – Total sum of Squares; ESS – Explained Sum of Squares RSS – Residual sum of Squares.
Proof:
e Y Xˆ
1
e1e Y Xˆ Y Xˆ
Y 1Y 2 ˆ 1 X 1Y ˆ 1 X 1 Xˆ
Y 1Y 2 ˆ 1 X 1 Xˆ ˆ 1 X 1 Xˆ sin ceX 1Y X 1 Xˆ
Y 1Y ˆ 1 X 1 Xˆ
Y 1Y Yˆ 1Yˆ sin ceXˆ Yˆ , then ˆ 1 X 1 Yˆ 1
Hence: e1e Y 1Y Yˆ1Yˆ Y 1Y Yˆ1Yˆ e1e
Y 1Y - Total sum of Squares
Yˆ 1Yˆ - Explained sum of Squares
e1e - Residual sum of Squares
1. The Gaus-Markov Theorem states that among the class of linear and unbiased estimators, the Ordinary
Least Squares (OLS) has minimum variance.
1
The OLS ̂ = X 1 X X 1Y CY
Thus OLS estimator is linear in Y and the OLS is unbiased since
E ˆ (Note 3.17)
ˆ X 1 X 1 X 1
From 3.17
ˆ X 1 X 1 X 1
Cov =
As in (1.18) is minimum.
To prove the Gaus-Markov theorem – that the Least Square Estimator is the best of all linear unbiased estimators (b.l.u.e)
– consider the estimator ˆ [( X 1 X ) 1 X 1 +P]Y …….(3.21)
Where P is a k x n non stochastic perturbation matrix representing a perturbation from the ̂ estimator. The estimator
̂ becomes the least squares estimator if and only if P vanishes. Thus (3.19) defines a whole set of estimators which are
determined once a P matrix is given. This set consists of all estimators that are linear in Y and under appropriate
conditions the estimators in this set are unbiased. Substituting Y=X´β+e into (3.19)
Yields ˆ [ ( X 1 X ) 1 X 1 +P](X e) ( X 1 X ) 1 X 1e px pe ………(3.22)
ˆˆ
Taking expectations, all terms except and P X B vanish. Thus i is unbiased if PX=0, since P can be any perturbation
matrix subject only to these conditions, the class of estimators defined by (1.20) contains all linear unbiased estimators
of .
ˆ
To show that ̂ is best (most efficient) among this class requires calculating the covariance matrix of ˆ since efficiency
ˆ
is based on Cov ˆ
ˆ
and Cov ̂ respectively, since both ˆ and ̂ are unbiased.
e = Y - X
Y X ( X 1 X ) 1 X 1Y
[ I X ( X 1 X ) 1 X 1 ]Y
MY
M= I X ( X X ) 1 X said to be symmetric and idempotent i.e. M = M¹ = MM¹ = M
e = MY
M X U MX MU
tr[I k ] = k
tr M = tr(I n ) - tr I k = n – k.
thus if we define
2 e1e
S = ……….(3.26)
nk
2
(n k ) 2
2
It follows E(S ) = which implies S 2 is an unbiased estimator of σ 2 . The square root
nk
S is often referred to as the standard error of the estimate, and may be regarded as the standard deviation
of the Y values about the regression plane.
a. MODEL IN DEVIATION FORM
The essence of this approach is to express all data in terms of deviations from sample means. This enables us to
estimate the slope coefficients at one stage and then the intercept term if necessary. The approach is straight
forward by the use of a transformation matrix.
1
A = I - ( ) ii
n
Where і denotes a column vector of n ones (n units). This matrix is symmetric-idempotent. Its properties
include premultiplying any vector of identical elements by A yields a zero vector, i.e. Ai = 0; Ae = e since the
mean of residuals is zero, e is already in deviation form..
To illustrate consider the least square equation in k variables
Averaging over the sample observations gives
Y b1 b2 X 2 b3 X 3 bk xk (e 0) …..(1.27)
subtracting (1.27) from (1.26) we obtain
Where y T Yt Y , x T X T X lower case letters denote deviations from sample means and the intercept
vanishes. The intercept is calculated thus
b1 Y b2 X 2 b3 X 3 bk X k
the OLS estimator b and residual vector e are connected by y=Xb+e
the partition of X matrix is X=[x 1 x 2 ] where X 1 (=I) the usual column of units and X 2 is the nx(k-1) matrix of
observations on the variables X 2 , X 3, , X K , thus
Y Xb E [ LX 2 ] h 12 e where b 2 is the (k-1)
element vector containing the coefficients b2 , b3 ,, bk .
premultiplying by A gives
AY [ AiAX 2 ]b1 Ae [0 AX 2 ] b
1 + Ae
b 0
2 2
Equation (1.31) is a set of normal equations in terms of deviations, whose solution yields the OLS slope coefficients.
The sum of squared deviations in the dependent variable, denoted by TSS, is TSS =y 1 Ay
…(1.32)
…(1.32) can be decomposed into an explained sum of squares (ESS) and a residual (unexplained) sum of squares (RSS).
Recall
Ay AX 2b2 e
transposing and multiplying
Transposing and multiplying: yAy b21 X 2AX 2b2 ee ….(1.33)
(TSS) = (ESS) (RSS)
ESS RSS
1
TSS TSS
ESS ee
R 2 1,2,3,…K = 1
TSS yAy
N 1
alternatively R 2 1 (1 - R 2 ) …..(1.37)
nb
two other frequently used criteria for comparing the fit of various specifications involving different number of regressors
are the Schwarz criterion.
ee k
SC In ln n
n n
ee 2k
and the Akaike information criterion ALC n
n n
one looks for specifications that will reduce the sum of squares, but each criterion adds on a penalty, which increases
with the number of regressors.
Example 1.1. Given the sample data below, estimate the parameters
3 1 3 5
1 1 1 4
Y = 8 and X = 1 5 6
3 1 2 4
5 1 4 6
5 15 25 20
X¹X = 15 55 81
and X¹Y =
76
25 81 129 109
The Normal Equations are then
5 15 25 1 20
15 55 81 =
76
2
25 81 129 3 109
5 15 25 1 20
0 10 6 = 16
2
0 6 4 3 9
6
new row 3 = row 3 - row 2
10
b
5 15 25 1 20
b
0 10 6 2 16
b
0 0 0.4 3 0.6
10b2 6b3 16
Second equation 10b2 16 6b3 16 6(1.5) 16 9
b
2 25 / 10 2.5
5b1 15b2 25b3 20
5b1 15(2.5) 25(1.5) 20
Equation 1
5b1 37.5 37.5 20
b1 20 / 5 4
The regression equation is thus :- Yˆ 4 2.5 X 2 1.5 X 3
Alternatively, in deviation form (noting X 2 3, X 3 5, Y 4)
1 0 0
3 2 1
Y* = Ay = 4 X* = AX2 = 2 1
1 1 1
1 1 1
10 6
X X * X 2 AX 2
1
*
1
6 4
the relevant normal equations are then
10 6 2 16
6 =
4 3 9
1
2 10 6 16 1 1.5 16 2.5
= 6 4 9 = 1.5 2.5 9 = 1.5
3
TSS= y¹ Ay = 28
10 6 2.5
ESS = b´2X 2 1 AX 2 b 2 = [2.5-1.5] = 26.5
6 4 1.5
1
16
or, more simply, from: b¹*X y* = [2.5
* -1.5] = 40 – 13.5 = 26.5
9
RSS 26.5
R2 0.95
TSS 28
5.1
AdjustedR 2 R 2 1 (0.05) 1 0.1 0.90
5.3
Example 1.2. Estimate the slope coefficients in the regression model
Yt o 1 X 1t 2 X 2t 3 X 3t U t
given the following sums of squares and products of deviation from means for 24 observations
y 2
60 x 10
2
1 x 2
2 30 x 2
3 20
yx 1 7 yx2 7 yx 26
3
x x1 2 10 x x 5
1 3 X X 15
2 3
Solution: y 1 Ay = 60
7 10 10 5
X 12 Ay = 7
X 12 AX 2 =
10 30 15
26 5 15 20
10 10 5 1 7
Thus
10 30 15 = 7
2
5 15 20 3 26
X 2 AX 2 = 2500
1
1 10 10 5 7
= 10 30 15 7
2
3 5 15 20 26
̂1 10 10 5 1 7
̂ 2 = 2 10 30 15 -7
̂ 3 5 15 20 -26
10 10 5 1 7
0 10 10 2 = -14
0 0 25 3 -45
45 9
̂ 3 = 1.8
25 5
From Second row:
20 2 10 3 14
20 2 10 18 14
20 2 18 14
4
ˆ2 0.20
20
10 1 10 2 5 3 7
substituting for ̂ 2 and ̂ 3 yields ̂1 = 1.4
thus the technique use for deriving the slope coefficients is immaterial. However the inverse method provides a component needed
variances of ̂1 , ̂ 2 and ̂ 3 .
Example 1.3: given the data in example 1.2 and the slope coefficients in the solution, calculate the variance covariance matrix of
the multiple coefficient of determination.
Solution Given
y1 Ay 60 TSS
2 1 .4
0 .2
1 .8
ESS 2 1 X 21 AX 2 2 2 X 1 Ay
1
(i) H 0 : i 0 . This is generally referred to as the significance test. It basically tests whether a particular regressor
Xi has no influence on the regressand.
(ii) H o : i io. This implies i has a specified value. If for instances denotes marginal propensity to consume
in Sierra Leone, one might be tempted to test that the marginal propensity is or test that price elasticity is unitary,
ε = 1.
(iii) H o : 2 3 1. If the indicate capital and labour elasticities of a production function, this formulation is
tantamount to testing for constant returns to scale (CRTS).
This is known as the joint significance test. It is used to set the hypotheses that the compete set of regressor
no effect on the dependent variable (say Y). it tests the overall goodness of fit.
(vi) H 0 : 2 0. The vector is partitioned into two sub vectors. 1 (containing k1 elements) and 2 (containi
k 1 = k 2 elements). This sets up the hypothesis that a specified subset of regressors play the role
determination of the dependent variable (Y say)
the above forms of linear hypotheses are incorporated in
R r where
R – q x k matrix of known constants, q k
R – q vector of known constants
Each null hypothesis determines the relevant elements in R and r. for the above examples we have
(i) R = [0 0 …1 ….0 ….0] r = 0 q = 1 with 1 in the ith position.
For a 3 parameter case r = [0 1 0] r = 0 q = 1 i.e. 2 = 0
(iv) R = [0 0 1 -1 0 ….0] r = 0 q =1
i.e. ˆ3 ˆ4 0
(v) R = [0 1 k 1 ] r = 0 q=k–1 where 0 is a vector 0 k – 1 zeros.
R= 0 1 0 …..0 0
0 0 1 ….0 r = 0
…………………………
0 0 0 ….1 0
(vi) R = [ 0k 2 xk1I k 2 ] r = 0 q = k2
where 0 is null matrix of (k - k 1 ) = k 2 x k 1 ) and I k 2 is… k 2
element column vector that is, the last k2 elements in are jointly zeros for example, in an equatics explain ing the rate of infla
explanatory variables might be grouped into the subsets – those measuring expectations and those measuring pressure of dema
significance of either subset might be tested by using the formulating by numbering of the variables so arranged that those in th
to be tested come at the . to devise a practical test procedure we used a determine the sampling distribution of R
E Rˆ RE ˆ R
Var Rˆ E{R ( ˆ )( ˆ ) R } 0 R R R
1 2 1 1
R1
Since ̂ is multivariate normal
1
Rˆ N R ,02 R R1R R1 ) OR
R ˆ N 0 : 0 N ( R R
2 1 1
R1 ) …(1.36)
We can replace R in equation (1.36) by for, obtaining
R̂ r N 0 : 0 N (R R R ) , thus
2 1 1 1
1
( Rˆ r )1[02 r ( N R1R R1 ]1 ( Rˆ r ) X 2 (q) …(1.38)
010
X 2 n k thus, we can form an …… ratio so that the unknown 02 will cancel out.
2
If R =r is true, then
1
( Rˆ r )1[02 r ( N R1R R1 ]1 ( Rˆ r )
E q, n k …..(1.40)
1
en k
The test procedure is to reject the hypothesis R =r, if the computed F value exceeds the preselected critical value.
F(3,20) = 3.10
Since F cal F 3,20 data is not consistent with the null hypothesis 1 2 3 0
(ii) R = [1 1 1], r = 0, q = 1
R X 1X
1
R1 [1 1 1] .15 -05 0 1
-05 07 04 1 = 0.12
0 -04 .08 1
R r [1 1 1] 1.4
0.2 - 0 = -0.2
-1.8