Professional Documents
Culture Documents
Regression Modelling and Least-Squares: GSA Short Course: Session 1 Regression
Regression Modelling and Least-Squares: GSA Short Course: Session 1 Regression
Regression Modelling and Least-Squares: GSA Short Course: Session 1 Regression
The aim of regression modelling is to explain the observed variation in a response variable, Y , by
relating it to some collection of predictor variables, X1 , X2 , ..., XD . The variation is decomposed into
a systematic component, that is, a deterministic function of the predictor variables, and a random
component that is the result of measurement procedures (or unmeasurable variability). The simplest
model is the Normal Linear Model, where the systematic component is a function of the predictors
and some model parameters, β, and the random variation is assumed to be the result of additive
normally distributed random error terms. This model is explained in section 1.
We assume that the variables to be modelled are as follows; we will observe paired data, with
response data yi paired to predictor variables stored in vector form xi = (xi1 , ..., xiD )T , and our aim is
to explain the variation in (y1 , ..., yn ). We achieve this by modelling the conditional distribution of
response variable Yi given the observed value of predictor variable Xi = xi . Specifically, we may write
D
X
Yi = β 0 + β 1 x1 + ... + β D xiD + εi = β 0 + β j xij + εi (1)
j=1
where εi ∼ N (0, σ 2 ) for i = 1, ...n are independent and identically distributed random error terms.
Note that this implies
D
X D
X
Yi |Xi = xi ∼ N (β 0 + β j xij , σ 2 ) ∴ EfY |X [Yi |Xi = xi ] = β 0 + β j xij . (2)
j=1 j=1
Y = Xβ + ε
c
°David A. Stephens 2005 page 1 of 6
GSA Short Course: Session 1 Regression
The formulation of the linear model above can be extended to allow for more general dependence
on the predictors. Suppose that g1 , g2 , ..., gK are K (potentially non-linear) functions of the D original
predictors, that is
gk (xi ) = gk (xi1 , ..., xiD )
is some scalar function, for example, we could have
and so on. This reformulation does not effect our probabilistic definition of the model in (3); we can
simply redefine design matrix X as
1 g1 (x1 ) · · · gK (x1 )
1 g1 (x2 ) · · · gK (x2 )
X = 1 g1 (x3 ) · · · gK (x3 )
.. .. .. ..
. . . .
1 g1 (xn ) · · · gK (xn )
now an n × (K + 1) matrix. In the discussion below, we will regard the transformed variables
(g1 (X), g2 (X), ..., gK (X)) as the predictors and drop the dependence on the transformation functions.
Hence we have
• Y as a n × 1 column vector
• β as a (K + 1) × 1 column vector
Equation (4) illustrates a way that parameter estimates can be obtained. For any parameter vector
β, the fitted value is Xβ, and thus the term in the exponent
is a measure of goodness of fit; if S(β) is small, then y and the fitted values are closely located. It can
be shown (below) that the minimum value of S(β) is obtained when
XT Xβ = XT y.
However, without further distributional assumptions, we cannot proceed to understand the uncertainty
in the estimation.
c
°David A. Stephens 2005 page 2 of 6
GSA Short Course: Session 1 Regression
Maximum likelihood estimation for the normal linear model is straightforward. If θ = (β, σ 2 )
then the mle b
θ is given by
b
θM L = arg max fY |β,σ2 (y; β, σ 2 ) = arg max L(β, σ 2 ; y, x)
θ∈Θ θ∈Θ
and thus,
= y T y − y T Xβ − β T XT y + β T XT Xβ
= y T y − 2y T Xβ + β T XT Xβ.
d n T o d n T T o
y Xβ = y T X β X Xβ = 2XT Xβ (6)
dβ dβ
X n
1 b = 1
b T (y − Xβ)
σ̂ 2 = (y − Xβ) (yi − ŷi )2 (9)
n n
i=1
where ŷi = xT b T
i β is the fitted value, and yi − ŷi is the residual. Note that X X is a symmetric
matrix. The expression (y − Xβ) b (y − Xβ)
T b is termed the residual sum of squares (or RSS). A
common adjusted estimate is
1 b T (y − Xβ)
b
σ̂ 2ADJ = (y − Xβ) (10)
n−K −1
the justification for this result depends on the sampling distribution of the estimator.
c
°David A. Stephens 2005 page 3 of 6
GSA Short Course: Session 1 Regression
where
n
1X 2
Sx2 = xi − (x)2
n
i=1
and so
P
n P
n P
n
x2i − xi yi
1 i=1 i=1 i=1
βb = (XT X)−1 XT y = n
n Sx2
2 P
n P
− xi n xi yi
i=1 i=1
P
n P
n P
n P
n
x2i yi − xi xi yi
1 i=1 i=1 i=1 i=1
=
n Sx2
2 P
n Pn Pn
n xi yi − xi yi
i=1 i=1 i=1
For K > 1 calculations can proceed in this fashion, but generally the matrix form
βb = (XT X)−1 XT y
Fitted values are readily obtained from this model. The fitted value yb is obtained as
= Hy
c
°David A. Stephens 2005 page 4 of 6
GSA Short Course: Session 1 Regression
or
b + (β
S(β) = S(β) b − β)T (XT X)(βb − β)
where
S(β) = (y − Xβ)T (y − Xβ) (1)
n
X
b = (y − Xβ)
S(β) b T (y − Xβ)
b = (y − ŷ)T (y − ŷ) = (yi − ŷi )2 (2)
i=1
S(β) b
S(β)
∼ χ2n ∼ χ2n−K−1 (12)
σ2 σ2
so that
b
S(β)
s2 = is an UNBIASED estimator of σ 2
(n − K − 1)
and the quantity
βb − β βb − β
= √ ∼ Student(n − K − 1).
b
s.e.(β) s vii
It also follows that
b
S(β) − S(β) (βb − β)T (XT X)(βb − β)
= ∼ χ2K+1
σ2 σ2
c
°David A. Stephens 2005 page 5 of 6
GSA Short Course: Session 1 Regression
so that finally h i
b /(K + 1)
S(β) − S(β)
∼ F isher(K + 1, n − K − 1)
b
S(β)/(n − K − 1)
It follows that in this case the ML estimator is the Minimum Variance Unbiased Estimator (MVUE)
and the Best Linear Unbiased Estimator (BLUE).
Now, using the distributional results above, we can construct the following ANOVA Table to test the
hypothesis
H0 : β 1 = ... = β K = 0
against the general alternative that H0 is not true.
This test allows a comparison of the fits of the two competing models implied by the null and alternative
hypotheses. Under the null model, if H0 is true, then the model has Yi ∼ N (β 0 , σ 20 ) for i = 1, 2, ...n,
for some β 0 and σ 20 to be estimated. Under the alternative hypothesis, there are a total of K + 1
β parameters to be estimated using equation (8). The degrees of freedom column headed (D.F.)
details how many parameters are used to describe the amount of variation in the corresponding row of
the table; for example, for the FIT row, D.F. equals K as there are K parameters used to extend the
null model to the alternative model.
c
°David A. Stephens 2005 page 6 of 6