GLM Lec3

Stat 7410 Estimation for General Linear Model Prof.
Goel
Broad Outline
General Linear Model (GLM):
 Training Sample Model: Given n observations, [ [(Yi , x′i ), x′i = ( xi1 , , xir )] ,
i = 1, 2, n, the sample model can be expressed as
=Yi µ ( xi1 , xi 2 , =
, xir ) + ε i , i 1, 2, , n, (1)
where, ε i , i = 1, 2,, n , denote the noise (random errors), each with mean
zero and variance σ 2 .
 From now on, we denote the features f j , j = 1, 2,, p , themselves as coded

predictor variables x1 , x2 ,, x p . In the simplest setting, the random errors are
assumed to be uncorrelated with equal variance.
 Thus the sample GLM can be expressed as

p
∑ β j xij + ε i , E[ε i ] =
Yi =
j =1
0, Var[ε i ] =
σ 2 , Cov(ε i , ε k ) =
0, i ≠ k . (2)
p
] µ=
E[Yi | x i= i ∑β x .
j =1
j ij (3)
 Vector/matrix notation for the response, predictor variables, error terms and
the unknown coefficients:
 Y1   β1   ε1   x1.′   x1 j 
Y  β  ε   x′  x 
=Yn×1  2
= 
, β p×1 =
2
,ε  2
=
,and X n× p  2. 
let x. j   ,
. Also,=
2j
        
       ′  
Yn  β p  ε n   xn .   xnj 
denote the jth column of X , i.e., X n× p = [x.1 , x.2 , , x. p ].
 Given the response vector Y, and the design matrix X , the sample GLM can
be written as
Xβ + ε , E[ε ] =
Y= 0, Cov[ε ] =
((cov(ε i , ε j ))) =
(( E (ε iε j ))) =
E[εε ′] =
σ 2I. (4)
 Thus, E[ Y]= μ= Xβ , Cov ( Y)= E[( Y - Xβ )( Y - Xβ )′]= E[εε ′]= σ 2I. (5)
 Ordinary Least Square (OLS): For an estimate β of β , corresponding
n p n
∑ (Yi − ∑ β j xij )2 =
Residual Sum of Squares: l2 ( β ) =
=i 1 =j 1
∑ ei2 (β ) =
=i 1
e′( β )e( β ).
• Problem: Find an estimated coefficient vector βˆ = arg min l2 ( β )

p
β∈R
 In matrix notation,
• min

e′( β )e( β )= min(

Y − Xβ )′(Y − Xβ )= e′( βˆ )e( βˆ ). (6)
β ∈R p
β ∈R p
 Expand S ( β ) =(Y − Xβ )′(Y − Xβ ) =Y′Y − Y′Xβ − β ′X′Y + β ′X′Xβ .
 On setting the partial derivatives of S ( β ) with respect to β equal to zero, we

get the Normal Equations
X′Xβ = X′Y . (7)
 Any solution of (7) is an optimal solution to the OLS problem.
 Full rank case:

 Example – Simple Linear Regression, Multiple Regression with linearly
independent features.
 If the matrix X′X is non-singular (the design matrix X is of full column rank
p), inverse of X′X exists, and the unique optimal least square solution is
βˆ = ( X′X ) −1 X′Y. (8)
 Note that E[βˆ ] (=

= X ′X ) −1 X ′E[ Y] = ( X ′X ) −1 X ′Xβ β .
Since, Cov(TY) = TCov(Y)T′, therefore,

 =
Cov ( βˆ ) Cov
= ( TY ) σ 2 [TIT′], =
where T ( X ′X ) −1 X ′.
Therefore,Cov ( βˆ ) = σ 2 ( X ′X ) −1.
 Not full-rank case:
 Example - ANOVA for Designed Experiments

 The normal equations (7) are consistent, but the system has infinitely many
solutions. Each solution can be expressed as β 0 = ( X′X) − X′Y, where ( X′X) −
is a generalized inverse of X′X . In fact,
• E  β=
0
 ( X′X) − X′E[=
Y] ( X′X) − ( X′X=
) β Hβ ≠ β , so some components of
β do not possess unbiased estimators.
 The biases in β 0 may depend on the generalized inverse used in

obtaining the particular OLS solution.
• All these estimators can’t be regarded as optimal estimator of the

vector β .
 Why?
 Optimal with respect to what criteria? May need to add an
additional criterion to OLS to get an unique solution, e.g.,
• Minimum norm OLS estimator: β + = ( X′X) + X′Y, where ( X′X) + is

the Moore-Penrose generalized inverse (Pseudo-inverse) of X′X .
 Coordinate Free (Vector Space) Approach: Interpret the model μ = Xβ as

μ ∈ C[ X] . However, note that µ̂ = Xβ 0 = PY = X(X′X)- X′Y , the
projection of Y onto C[ X] . The symmetric matrix P = X( X′X) − X′ is the
orthogonal projection matrix onto C[ X] , the space spanned by the columns
of X.
 Even though in non-full rank there are infinitely many solutions (in β ) to the
normal equations case, the projection µˆ = PY (also called Yˆ ) is unique, i.e., the
matrix P does not change with the choice of a generalized inverse of X′X .
 For a vector u ∈ C[ X] , i.e., u = Xb, for some vector b,
Pu = X( X′X) − X=
′u X( X′X) − X′Xb
= Xb
= u.
• Thus the projection of a vector u ∈ C[ X] onto C[ X] is u itself.

 Furthermore, for an arbitrary vector Y ∈ Vn , PY ∈ C[ X], therefore, P(PY) = PY
holds true for all Y ∈ R n . Hence, (P 2 - P)= 0 ⇔ P(I - P) = 0. Thus, P 2 = P, (i.e., P
is an idempotent matrix). Is P a symmetric matrix?.
• Fact: Every symmetric idempotent matrix is an orthogonal projection

matrix onto the space spanned by its columns.
 Since P (I-P) = 0, rows (columns) of P are orthogonal to the columns (rows)

of (I-P), i.e., Py and (I - P)y are orthogonal.
 Note that, Xβ 0 = µˆ = PY = Yˆ , Y − Yˆ = (I - P)Y = e, the vector of residuals.

n
 Therefore, the vectors Yˆ and e are orthogonal, i.e.,
= ˆ ′e ∑
Y = yˆ i ei 0.
i −1
 Example: Details of full-rank linear regression model.
 The Key Question:

p
• How to characterize the class of linear functions c′β = ∑ c j β j that can
j =1
be estimated uniquely through the least squares solutions?
 Estimable functions: A linear parametric function c′β is said to be

estimable, if there exist at least one unbiased estimator.
• If there does not exist any unbiased estimator of the linear function
c′β , it is said to be non-estimable.
• Why consider estimable functions? We will discuss its connection

with the concept of Identifiability.
 Note that Yi is an unbiased estimator of x′i. β . (Why?) Thus x′i. β is estimable

for each row of the matrix X. Hence c′β , where the vector c′ is some linear
combination of rows of X, is also estimable.
 Fact: A linear parametric function c′β of β j ' s is estimable if and only if

c ∈ C ( X′) ≡ Row space of X. (Prove it.)
• c ∈ C=
( X′) ⇔ c X′l for some l ∈  n .
 Therefore, for any OLS β 0 , c=
′β 0 l′X ( X ′X ) − X
= PY l′µˆ .
′Y l′=
 Thus c′β 0 is invariant to the choice of generalized inverse [Unique OLS

solution for unbiased estimator of c′β .
 Gauss-Markov Theorem - c′β 0 Best (Minimum Variance) Linear Unbiased

Estimator (B.L.U.E.) of an estimable linear function c′β .
 Generalized Least Squares: Var (ε ) = σ 2 V, where V is a known p.d. matrix.

• Reduce this problem to an OLS problem by a non-singular
transformation
• Since V is a positive definite matrix, there exists a non-singular
matrix T such that V −1 = T′T.
• Now, consider the linear transformation Z = TY.
• Note E (Z) ==
TE (Y) TXβ , Cov
= (Z) σ= 2
TVT′ σ 2 T(T= ′T) −1 T′ σ 2 I.
• Can consider OLS problem for Z.
 Background - Vector differentiation:
p p p
=
Vector of Partial derivatives of a linear form l′u ∑
= li ui ,and a quadratic form u′Au
=i 1
∑∑ a u u ,
=i 1 =j 1
ij i j
for a symmetric matrix A :

 ∂ ′   ∂ 
 ∂u ( l u)   ∂u ( u′Au)   2a11u1 + 2∑ a1 j u j 
 1   l1   1   j ≠1   a′1u 
 ∂     ∂   2a u + 2 a u 
∂ ′
( l u) 
 l2  l; ∂ (=
′
( u Au)   22 2 ∑ 2j j 
 a′ u 
= ( l′u) =∂u2 = u ′Au)  ∂u2 = j ≠ 2= 2=  2  2Au.
∂u   ∂u       
            
 ∂   lp   ∂   2a u + 2 a u  a′n u 
 ( l′u)   ( u′Au)  
nn n ∑ nj j 

 ∂u   ∂u  j ≠n 
 p   p 
When A is not symmetric, u′Au = u′{( A + A′) / 2}u, with {( A + A′) / 2} symmetric. Therefore,
∂
( u′Au= ) ( A + A′)u.
∂u

GLM Lec3

Uploaded by

Copyright:

Available Formats

You might also like

GLM Lec3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GLM Lec3

Uploaded by

Copyright:

Available Formats

Stat 7410 Estimation for General Linear Model Prof.

General Linear Model (GLM):

 From now on, we denote the features f j , j = 1, 2,, p , themselves as coded

 Thus the sample GLM can be expressed as

• Problem: Find an estimated coefficient vector βˆ = arg min l2 ( β )

 Expand S ( β ) =(Y − Xβ )′(Y − Xβ ) =Y′Y − Y′Xβ − β ′X′Y + β ′X′Xβ .

 On setting the partial derivatives of S ( β ) with respect to β equal to zero, we

 Any solution of (7) is an optimal solution to the OLS problem.

 Full rank case:

βˆ = ( X′X ) −1 X′Y. (8)

 Note that E[βˆ ] (=

Since, Cov(TY) = TCov(Y)T′, therefore,

 Example - ANOVA for Designed Experiments

 The biases in β 0 may depend on the generalized inverse used in

• All these estimators can’t be regarded as optimal estimator of the

• Minimum norm OLS estimator: β + = ( X′X) + X′Y, where ( X′X) + is

 Coordinate Free (Vector Space) Approach: Interpret the model μ = Xβ as

 For a vector u ∈ C[ X] , i.e., u = Xb, for some vector b,

• Thus the projection of a vector u ∈ C[ X] onto C[ X] is u itself.

• Fact: Every symmetric idempotent matrix is an orthogonal projection

 Since P (I-P) = 0, rows (columns) of P are orthogonal to the columns (rows)

 Note that, Xβ 0 = µˆ = PY = Yˆ , Y − Yˆ = (I - P)Y = e, the vector of residuals.

 Example: Details of full-rank linear regression model.

 The Key Question:

be estimated uniquely through the least squares solutions?

 Estimable functions: A linear parametric function c′β is said to be

• Why consider estimable functions? We will discuss its connection

 Note that Yi is an unbiased estimator of x′i. β . (Why?) Thus x′i. β is estimable

 Fact: A linear parametric function c′β of β j ' s is estimable if and only if

 Thus c′β 0 is invariant to the choice of generalized inverse [Unique OLS

 Gauss-Markov Theorem - c′β 0 Best (Minimum Variance) Linear Unbiased

 Generalized Least Squares: Var (ε ) = σ 2 V, where V is a known p.d. matrix.

for a symmetric matrix A :

You might also like