Week03 Lecture BB

MATH5806 Applied Regression Analysis
Lecture 3 - Linear Gaussian models
Boris Beranger
Term 2, 2021
1/112
Chapter 3 - Linear Gaussian models
3.1 - Introduction
3.2 - Basic results
3.2.1 - Maximum likelihood estimation
3.2.2 - Least squares estimation
3.3 - Diagnostics
3.4 - Analysis of Variance (ANOVA)
3.5 - Analysis of Covariance (ANCOVA)
3.6 - General linear models

3.7 - Extensions
3.8 - Potential problems
2/112
3.1 - Introduction
3.2 - Basic results
3.3 - Diagnostics

3.7 - Extensions
3/112
Goals
This lecture is about obtaining estimates for
• Multivariate linear regression

• ANOVA
This is a case where analytical solutions exist.
4/112
Introduction
Linear models of the form
E(Yi ) = µi = x>
i β
Yi ∼ N(µi , σ 2 ),
where the r.v. Yi are independent, are the basis of most analyses of continuous data.
There are two main models of this form
• Multivariate regression: association between a continuous response and several

explanatory variables
• Analysis of variance (ANOVA): comparisons of more than two means
• Analysis of covariance (ANCOVA)
5/112
Introduction
This model (usually called general linear model) is usually written as
y = Xβ + ε (1)
where
y> = [Y1 , . . . , YN ]
β > = [β1 , β2 , . . . , βp ]
ε> = [ε1 , . . . , εN ]
i.i.d.
where εi ∼ N (0, σ 2 ) for i = 1, . . . , N.
The error term includes all the terms we have missed with the model: the true
relationship is not linear, there are variables not considered, there is measurement
error. The error term is usually considered independent from the X.
6/112
Introduction
Additionally, X is an N × p design matrix, which in the case of a multiple regression

above is set to
 
1 X12 X13 . . . X1p
 
. 
 
X=
.
.

.
 

1 XN2 XN3 . . . XNp
The parameter βj is interpreted as the average effect on Y of a one unit increase in the
covariate xj , holding all the other predictors fixed.
7/112
Introduction
The model is linear in the parameters, which means we have, for instance:
E(Yi ) = β1 + β2 Xi2 + β3 X2i3
E(Yi ) = β1 + γ1 δ1 Xi2 + exp(β2 )Xi3 .
But NOT:
E(Yi ) = β1 + β2 Xβi22
E(Yi ) = β1 exp(β2 Xi2 ).
8/112
3.1 - Introduction
3.2 - Basic results
3.3 - Diagnostics

3.7 - Extensions
9/112
3.1 - Introduction
3.2 - Basic results
3.3 - Diagnostics

3.7 - Extensions
10/112
Maximum likelihood estimation
The joint distribution is

N
Y
f (Y1 , . . . , YN |θ1 , . . . , θN ) = exp[Yi b(θi ) + c(θi ) + d(Yi )]
i=1
N N N
" #
X X X
= exp Yi b(θi ) + c(θi ) + d(Yi )
i=1 i=1 i=1
For each Yi , the log-likelihood is

ì = Yi b(θi ) + c(θi ) + d(Yi ), (2)
which gives
0 b 00 (θi )c 0 (θi )−c 00 (θi )b 0 (θi )

E(Yi ) = µi = − bc 0 (θ i)
(θi ) , Var(Yi ) = [b 0 (θi )]3
, g (µi ) = x>
i β = ηi .
11/112
The log-likelihood for all the Yi ’s is then

N
X N
X N
X N
X
`(θ; Y1 , . . . , YN ) = ì = Yi b(θi ) + c(θi ) + d(Yi ).
i=1 i=1 i=1 i=1
To obtain the maximum likelihood estimator for the parameter βj we derive the score
function using the chain rule:
N XN
d` X dì dì dθi dµi
= Uj = = · · . (3)
dβj dβj dθi dµi dβj
i=1 i=1
12/112
Now, let’s consider each term separately
• dì
dθi = Yi b 0 (θi ) + c 0 (θi ) = b 0 (θi )(Yi − µi )
−1 00 −1
−c (θi ) c 0 (θi )b 00 (θi )
• dθi
dµi = dµdθi
i
= 0
b (θi ) + 0
[b (θ )] 2 = (b 0 (θi )Var(Yi ))−1
i
dµi dµi dηi dµi
• dβj = dηi · dβj = dηi Xij
Hence, the score function is
N
X (yi − µi ) dµi
Uj = Xij (4)
Var (Yi ) dηi
i=1
13/112
The variance-covariance matrix of the score is
Ijk = E[Uj Uk ] (5)
which represents the information matrix

( N N )
X (Yi − µi ) dµi X (Yi − µi )

dµi
Ijk = E Xij Xik
Var(Yi ) dηi Var(Yi ) dηi
i=1 i=1
N
E[(Yi − µi )2 Xij Xik ] dµi 2
X
=
[Var(Yi )]2 dηi
i=1
because E[(Yi − µi )(Yl − µl )] = 0 for i 6= l.
14/112
Since E[(Yi − µi )2 ] = Var(Yi )

N
dµi 2

X Xij Xik
Ijk = (6)
Var(Yi ) dηi
i=1
dµi
Furthermore, since dηi = 1, the information is of the form
N
X Xij Xik
Ijk =
Var (Yi )
i=1
or in matrix notation
1 T
I= X X (7)
σ2
15/112
If we want to apply the method of scoring to approximate the MLE, the estimating
equation is
β̂ (m) = β̂ (m−1) + [I (m−1) ]−1 u(m−1)

[I (m−1) ]β̂ (m) = [I (m−1) ]β̂ (m−1) + u(m−1) (8)
From Equation (6), the information matrix can be written as
I = X> WX (9)
2
dµi
where wii = Var1(Yi ) dηi = 1
σ2
16/112
Finally, the expression on the right hand side of (8) can be written as
p X
N N
dµi 2 (m−1) X (Yi − µi )Xij dµi

X Xij Xik
β̂k +
Var(Yi ) dηi Var(Yi ) dηi
k=1 i=1 i=1
which can be written in matrix terms as
[I (m−1) ]−1 β̂ (m−1) + u(m−1) = X> Wz (10)
where
p p p
!
X (m−1) dηi X (m−1)
X (m−1)
Zi = Xik β̂k + (Yi − µi ) = Xik β̂k + Yi − Xik βk = Yi
dµi
k=1 k=1 k=1
17/112
And, therefore,
1 > 1
2
X Xβ̂ = 2 X> y =⇒ β̂ = (X> X)−1 X> y
σ σ
Properties:
• Unbiasedness
E(β̂) = (X> X)−1 X> E(Y) = (X> X)−1 X> Xβ = β
• Consistency: the variance-covariance matrix is I −1 , therefore
I −1 = σ 2 (X> X)−1
• Normality
β̂ ∼ N (β, σ 2 (X> X)−1 )
18/112
10
10
5
5
Y
Y
0
0
−5
−5
−10
−10
−2 −1 0 1 2 −2 −1 0 1 2
X X
Figure 1: Left: Simulated data. Red line gives the true relationship (population regression
line), blue line gives the least square estimate. Right: light blue lines are 10 least square lines
19/112
3.1 - Introduction
3.2 - Basic results
3.3 - Diagnostics

3.7 - Extensions
20/112
Least squares estimation
It is possible to derive an estimator without making any further assumption about the
distribution of y.
Under Gauss-Markov assumptions, i.e.
• E(ε) = 0
• E(ε> ε) = σ 2 IN
and assuming that N > p, the least squares function is
RSS(β) = (y − Xβ)> (y − Xβ) = ε> ε (11)
This is a multivariate function in β which is (strictly) convex.
21/112
Hence there is a unique minimiser β̂, satisfying

d d >
(y − Xβ)> (y − Xβ) = y y − 2β > X> y + β > X> Xβ
dβ dβ
= −2X> y + 2X> Xβ = 0
→ X> Xβ= X> y
Assuming that X has rank N ≥ p, we can write
β̂ = (X> X)−1 X> y.
While this estimator is unbiased, it is necessary to introduce an assumption on the

distribution of the y to derive the distribution of the estimator.
22/112
23/112
3.1 - Introduction
3.2 - Basic results
3.3 - Diagnostics

3.7 - Extensions
24/112
Confidence intervals
The estimate of the uncertainty of the estimates is given by the standard error, SE(β̂):
2
" #
2 2 1 X̄
SE(β̂0 ) = σ + PN
n i=1 (Xi − X̄)
2
σ2
SE(β̂1 )2 = PN ,
i=1 (Xi − X̄)2
where σ 2 = Var(ε).
Similarly in the presence of other covariates.
Remark: the SE(β̂1 ) is smaller when the Xi are more spread out: intuitively, we are
more able to estimate the slope of the line in this case.
25/112
Confidence intervals
Standard errors can be used to compute (1 − α)100% confidence intervals as:

h i
β̂k − tα/2,n−2 SE β̂k , β̂k + tα/2,n−2 SE β̂k ,
k = 0, 1, where tα/2,n−2 is α/2 critical value of a Student-t distribution with n − 2

degrees of freedom.
26/112
Deviance
The predicted values given the MLE are Xβ̂, therefore the residuals are defined as
y − Xβ̂
and the square of residuals
N
X
>
(y − Xβ̂) (y − Xβ̂) = (Yi − Ŷi )2
i=1
and the deviance
D = 2[`(β̂max ) − `(β̂)]

1 2 1 > 1 2
= 2 − N log(2πσ ) + 2 (y − Xβ̂) (y − Xβ̂) + N log(2πσ )
2 2σ 2
1 >
= 2 (y y − 2β̂ > X> y + β̂ > X> Xβ̂)
σ
1
= 2 (y> y − β̂ > X> y) 27/112
σ
Hypothesis testing
Now, let’s consider a null hypothesis H0 and an alternative hypothesis H1 .

  
β1 β1
. .
H0 = β = β0 =  ..  , H1 = β = β1 =  .. 
βq βp
The scaled deviance can be used for model comparison.

1 h i
∆D = D0 − D1 = (y> y − β̂0> X> > > >
0 y) − (y y − β̂1 X1 y)
σ2
1 h i
= 2 β̂1> X>
1 y − β̂ > >
X
0 0 y
σ
28/112
Hypothesis testing
Then we have D0 ∼ χ2 (N − q) and D1 ∼ χ2 (N − p), and thus
∆D ∼ χ2 (p − q)
when N is large.
Remark: If the models do not fit the data well, it is possible that either D0 or D1
follow a non-central χ2 distribution with non-centrality parameter ν.
If the values of ∆D is in the critical region, then we would reject H0 in favour of H1 on

the grounds that model M1 provides a significantly better description of the data.
29/112
Hypothesis testing
The standard deviation σ 2 , however, is unknown; replacing it by its estimate σ̂ 2 results

in (??) being inaccurate.
But σ 2 may be eliminated by using the ratio

β̂1> X> > >
D0 −D1 1 y−β̂0 X0 y
p−q p−q
F = = (12)
D1 y> y−β̂1> X>
1 y
N−p N−p
This may be explicitly calculated from the fitted values.
• Under the null hypothesis H0 , the statistic F follows the Fp−q,N−p distribution
• Otherwise, the statistic F has a non-central distribution
30/112
Hypothesis testing
The F -statistic is usually used to test the hypothesis
H0 : β1 = β2 = . . . = βp = 0
H1 : at least one βj is non-zero
and
(TSS − RSS)/(p − 1)
F = (13)
RSS/(N − p)
where TSS = i=1 (Yi − Ȳ)2 and RSS = N
PN P 2
i=1 (Yi − Ŷi )
It is possible to show that E(RSS/(N − p − 1)) = σ 2 and, under H0 ,

E[(TSS − RSS)/p] = σ 2 . Therefore, under H0 , the F -statistic is expected to be close
to 1, while under H1 , E[(TSS − RSS)/p] > σ 2 and the F -statistic is larger than 1.
31/112
Hypothesis testing
Remark 1: the F -statistic is important because if you test for the effect of any
predictor separately (without any correction) about 5% of the p-values will be under α
(e.g. 0.05) by chance. The F -statistic does not suffer from this problem because it
adjusts for the number of predictors.
Remark 2: The approach using the F -statistic works when p < N; however, sometimes
we have a very large number of variables and p > N. In this case, multiple regression
cannot be fitted and the F -statistic cannot be used.
32/112
Hypothesis testing
The task of determining which predictors are associated with the response is referred
to as variable selection. We will discuss more about this in Week 8.
One problem is that there are 2p possible models to compare, containing subsets of p
variables. Automated approaches to choose the set of models to consider are
• forward selection: starting with the null model and add variables, by selecting always the
model with the lowest RSS; can always be used
• backward selection: starting with the model with all the variables and remove the variable
with the largest p-value and continue; cannot be used when p > N
• mixed selection: combination of forward and backward selection, until all the variables
outside the model are associated with larger p-values than variables within the model
33/112
Distribution of residuals
The estimated residuals are defined as
ei = yi − X>
i β̂ = yi − ŷi (14)
which can be also written as
e = y − Xβ̂ = y − X(X> X)−1 X> y (15)
The hat (or projection) matrix H := X(X> X)−1 X> maps the vector of response values
to the vector of fitted values
Hy = X(X> X)−1 X> y = Xβ̂ = ŷ.
34/112
H is symmetric and idempotent, therefore, it is a projection matrix. It defines a

transformation of y in N-dimensional space to vector ŷ in a subspace such that ŷ is as
close to y as possible.
Remark: Note also that for an idempotent matrix A: rank(A) = tr(A).
Then the residuals are given by
e = y − ŷ = (I − H)y. (16)
35/112
The variance-covariance matrix of the estimated residuals is
Var(e) = Var[(I − H)y]

= Var[(I − H)(Xβ + ε)]
= Var[(I − H)ε]
= (I − H)Var[ε](I − H)>
= σ 2 (I − H)(I − H)>
= σ 2 (I − H)
= σ 2 [I − X(X> X)−1 X> ]
36/112
Theorem
If the Gauss-Markov assumptions hold, then
e ∼ N (0, σ 2 (I − H)).
(Note that the covariance matrix is singular.)

Since Cov [e] is not diagonal, the residuals are not independent; since
Var[ei ] = σ 2 (1 − hii ), their variance is not constant.
37/112
Proof.
If the model assumptions hold, y is a Gaussian random vector, and then (16) shows
that e is a Gaussian random vector. Its mean is
E[H] = (I − H)E[y] = (I − H)Xβ = 0
and its covariance matrix is
Cov [e] = Cov [(I − H)y] = (I − H)Cov [y](I − H)> = (I − H)(σ 2 I)(I − H)
= σ 2 (I − H)
since H = H (a linear projection) and I − H is idempotent:
(I − H)(I − H) = I(I − H) − H(I − H) = I − H − H + HH = I − H.

38/112
Theorem
Under the Gauss-Markov assumptions,
N
X
−2 > −2
σ e e=σ ei2 ∼ χ2N−p
i=1
39/112
Proof.
Using the general result on the distribution of quadratic forms, we need to see what
is the noncentrality parameter and find the rank of matrix I − H.
The distributions of residuals is e ∼ N (0, σ 2 (I − H)).
Additionally, since E(e) = 0 the noncentrality parameter λ = 0. Moreover,
rank(I − H) = tr(I − H) = tr(I) − tr(H) = N − tr(X(X> X)−1 X> )

= N − tr(Ip ) = N − p
This is true because rank(H) = tr(H) and H has rank p (which is the rank(X)):
therefore the tr(H) can be seen as the tr(Ip ).
Which concludes the proof of this theorem.
40/112
PN
It follows that σ̂ 2 := 2
i=1 ei /(N − p) is an unbiased estimator of σ 2 .
This is called the residual standard error and it is the estimate needed to estimate the
coefficient standard error.
Residual plots are useful graphical tools for identifying non-linearity in the data: we
can plot the residuals (Yi − Ŷi ) versus the fitted values Ŷi .
Ideally the residual plot will not show any discernible pattern.
If the residual plot indicates that there are non-linear associations in the data, then a
√
simple approach is to use non-linear transformations of the predictors, i.e. log x, x,
x 2 , etc.
41/112
42/112
Studentized residuals
Plotting the residuals vs. fitted values may reveal possible violations of linearity or
homoscedasticity. Standardising the residuals may lead to a better feeling for their
magnitude.
We have calculated the variance of the i-th residual to be
Var[ei ] = σ 2 [I − H]ii = σ 2 (1 − hii ). (17)
Accordingly, one defines the studentized residuals:

ei
e0i = p .
2
σ̂ (1 − hii )
43/112
Studentized residuals
Recommendations about outliers: Points should not be routinely deleted from an

analysis just because they do not fit the model. Outliers and bad leverage points are
signals, flagging potential problems with the model. Outliers often point out an
important feature of the problem not considered before. They may point to an
alternative model in which the points are not an outlier.
44/112
Leverage
The i-th diagonal entry hii of H is called the leverage of the i-th observation. An
observation with high leverage can make a substantial difference to the fit of the model.
Model fits are particularly sensitive to data with high leverage.
Also note from (17) that residuals at points with high leverage have small variance.
45/112
Leverage
The sum of leverages equals the number of parameters:

N
X
hii = tr(H) = tr(X(X> X)−1 X> ) = tr(Ip ) = p
i=1
[As said before: This is true because rank(H) = tr(H) and H has rank p (which is the
rank(X)): therefore the tr(H) can be seen as the tr(Ip ).]
Hence on average the leverage is p/N. As a rule of thumb, if hii is greater than two or
three times p/N, it may be a concern.
46/112
Cook’s distance
Definition
The Cook’s distance is given by:
1
Di = kŷ − ŷ−i k2
p σ̂ 2
Cook’s distance hence measures the (rescaled) sum of squared differences between
fitted values when the i-th datum is removed.
It is a measure for the influence of the i-th datum on the entire model fit.
It can be showed that
1 hii
Di = e 2.
p 1 − hii i
47/112
Cook’s distance
One sees that Cook’s distance statistic Di is particularly large at high leverage points
with large residuals.
• According to Weisberg (2005, p. 200), if the largest value of Di is substantially

less than one, deletion of a case will not change the estimate ŷ by much;
• Fox (2002, p. 198) is among many authors who recommend 4/(n − 2) as a rough
cutoff for noteworthy values of Di for simple linear regression;
• In practice, it is important to look for gaps in the values of Cook’s distance and
not just whether values exceed the suggested cut-off.
48/112
Coefficient of determination, R 2
The strength of a linear relationship is usually measured through the sample correlation
coefficient R.
• An R close to 1 indicates a positive linear relationship
• An R close to -1 indicates a negative linear relationship.
If we square R then this boils down to R 2 close to one giving a numerical indication of
the strength of the regression.
The residual sum of squares
N
X
RSS = ei2 = e> e = (y − Xβ)> (y − Xβ)
i=1
is minimised by the least squares estimate β̂ = (X> X)−1 X> y, therefore

d = (y − Xβ̂)> (y − Xβ̂) = y> y − β̂ > X> y
RSS
49/112
For the minimal model

Yi = β0 + εi ,
X> X = N and X> y = N

P
i=1 yi , then RSS is minimised for β0 = ȳ , and then
PN 2
RSS0 = i=1 (yi − ȳ ) . This is the worst possible value for RSS. It is also sometimes
called the total sum of squares (TSS).
If parameters are added to the model, then RSS must decrease. The relative amount
of decrease
TSS − RSS
R 2 := (18)
TSS
is called the coefficient of determination. It is the proportion of the total variation in
the data which is explained by the model.
50/112
For the maximal model, we will have RSS = 0 and thus R 2 = 1: this means that the
R 2 always increases when more variables are added to the model, even if those
variables are weakly associated with the response. The fact that adding a variable
leads to a small increase of the R 2 statistic provides evidence that the contribution of
that variable is small.
The coefficient of determination can be interpreted as the proportion of variance
explained by the model.
In case of one covariate, the coefficient of determination is the squared correlation
between Y and x. In multiple regression, it turns out that R 2 = Cor (Y , Ŷ )2 (the
property of the least squares estimates is that they maximises the correlation among
the responses and the fitted linear model among all the possible linear models).
51/112
There is a relationship between the F -statistics and the R 2 coefficient. We know that
we can write:
RSS
R 2 × TSS = (TSS − RSS) and = 1 − R 2.
TSS
and from (13) we have
TSS−RSS
p−1
F= RSS
N−p
2
R × TSS N −p
=
RSS p−1
R2 N − p
= ∼ Fp−1,N−p .
1 − R2 p − 1
52/112
Example - Cheddar cheese
As cheese ages, various chemical processes take place that determine the taste of the
final product. In a study of Cheddar cheese from the La Trobe Valley of Victoria,
Australia, samples of cheese were analysed for their chemical composition and were
subjected to taste tests. Overall taste scores were obtained by combining the scores
from several tasters.
Data are measured on concentrations of various chemicals in 30 samples of mature
Cheddar cheese, and a subjective measure of taste for each sample. The variables
acetic and H2S are the natural logarithm of the concentration of acetic acid and
hydrogen sulfide respectively. The variable lactic has not been transformed.
53/112
Let’s upload the data

1 # We use the package " faraway "
2 # If needed , install the package by running the command
3 # install . packages (" faraway ")
4 library ( faraway )
5
6 data ( cheddar )
7 attach ( cheddar )
54/112
The model (M1) is

taste = β0 + β1 Acetic + β2 H2S + β3 Lactic +
We can analytically compute the MLE

1 X <- cbind ( rep (1 , nrow ( cheddar )) , Acetic , H2S , Lactic )
2 y <- matrix ( taste , ncol =1)
3
4 b . hat <- solve ( t ( X ) % * % X ) % * % t ( X ) % * % y

5 b . hat
6
7 # [ ,1]
8 # -28.8767696
9 # Acetic 0.3277413
10 # H2S 3.9118411
11 # Lactic 19.6705434
55/112
The estimate of the variance can be computed through the residuals

1
σ̂ 2 = (y − Xβ̂)> (y − Xβ̂)
N −p
√
The residual standard deviation is σ̂ 2 and the standard deviations of the estimates
are the square root of the elements on the diagonal of σ̂ 2 (X> X)−1
56/112
1 N <- nrow ( cheddar )

2 p <- ncol ( X )
3 sigma2 . hat <- t ( y - X % * % b . hat ) % * % ( y - X % * % b . hat ) / ( N - p )
4 sigma2 . hat
5
6 # [ ,1]
7 # [1 ,] 102.6312
8
9 as . vector ( sigma2 . hat ) * solve ( t ( X ) % * % X )

10 sqrt ( diag ( as . vector ( sigma2 . hat ) * solve ( t ( X ) % * % X )))
11
12 # Acetic H2S Lactic

13 # 19.735418 4.459757 1.248430 8.629055
57/112
And the R 2 statistics is

1 RSS <- t ( y - X % * % b . hat ) % * % ( y - X % * % b . hat )
2 TSS <- t ( y - mean ( y )) % * % ( y - mean ( y ))
3
4 TSS
5 # [ ,1]
6 # [1 ,] 7662.887
7
8 RSS
9 # [ ,1]
10 # [1 ,] 2668.411
11
12 R2 <- ( TSS - RSS ) / TSS

13 R2
14 # [ ,1]
15 # [1 ,] 0.6517747
58/112
We check whether any of the three possible predictors are significant. Define
H0 : β1 = β2 = β3 = 0
H1 : at least one βj 6= 0
In other terms, does the model (M1) offers any significant improvement over the
empty model (M0): taste = β0 + ?
1 Fstat <- R2 / (1 - R2 ) * (N - p ) / (p -1)
2 Fstat
3 # [ ,1]
4 # [1 ,] 16.22143
59/112
The F statistics is F (3, 26) distributed. The p-value is the probability to be greater
than the observed test statistic, i.e.:
1 pval <- 1 - pf ( Fstat , df1 =p -1 , df2 =N - p )
2 pval
3 # [ ,1]
4 # [1 ,] 3.810185 e -06
So there is very strong evidence that at least one of the three predictors is significant.
60/112
All these statistics can be computed through the lm function.

1 cheese . lm <- lm ( taste ~ Acetic + H2S + Lactic )
2 summary ( cheese . lm )
3
4 # Call :
5 # lm ( formula = taste ~ Acetic + H2S + Lactic )
6
7 # Residuals :
8 # Min 1Q Median 3Q Max
9 # -17.390 -6.612 -1.009 4.908 25.449
61/112
10 # Coefficients :
11 # Estimate Std . Error t value Pr ( >| t |)
12 # ( Intercept ) -28.8768 19.7354 -1.463 0.15540
13 # Acetic 0.3277 4.4598 0.073 0.94198
14 # H2S 3.9118 1.2484 3.133 0.00425 * *
15 # Lactic 19.6705 8.6291 2.280 0.03108 *
16 # ---
17 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’
18
19 # Residual standard error : 10.13 on 26 degrees of freedom

20 # Multiple R - squared : 0.6518 , Adjusted R - squared : 0.6116
21 #F - statistic : 16.22 on 3 and 26 DF , p - value : 3.81 e -06
In the model M1 with three predictors, is there evidence that H2S is significant?
62/112
3.1 - Introduction
3.2 - Basic results
3.3 - Diagnostics

3.7 - Extensions
63/112
Analysis of Variance - ANOVA
Analysis of variance is a method to compare means of groups of continuous

observations where the groups are defined by the levels of the factors.
• Y: continuous variable
• x: categorical variable(s)
The element of X (design matrix) are dummy variables.
64/112
Example: One-factor analysis
Genetically similar seeds are randomly assigned to be raised in either nutritionally

enriched environment (treatment A or treatment B) or standard conditions (control
group) using a completely randomised experimental design. After a predetermined
time all plants are harvested, dried and weighted.
1 library ( dobson )
2 data ( " plant . dried " )
3 attach ( plant . dried )
4
5 head ( plant . dried )

6
7 # group weight
8 #1 Control 4.17
9 #2 Control 5.58
10 #3 Control 5.18
11 #4 Control 6.11
12 #5 Control 4.50 65/112
Example: One-factor analysis
Remark: if experimental units are randomly allocated to groups corresponding to J

levels of a factor, this is called a completely randomised experiment.
The responses at level j, i.e. Yj1 , . . . , Yjnj are called replicates. In the example nj = K
for j = 1, 2, 3, but this is not always true. We assume nj = K throughout for simplicity.
The response vector (of length N = JK ), is given by:
y = [Y11 , Y12 , . . . , Y1K , Y21 , . . . , Y2K , . . . , YJ1 , . . . , YJK ]
We consider three specifications of the model:

(1) E(Yjk ) = µj for k = 1, . . . , K
(2) E(Yjk ) = µ + αj for k = 1, . . . , K
(3) E(Yjk ) = µ + αj for k = 1, . . . , K , under constraint α1 = 0
66/112
Model 1 - E(Yjk ) = µj for k = 1, . . . , K
Model (1) can be re-written as E(Yi ) = Jj=1 xij µj for i = 1, . . . , N where xij represent
P
an element of the design matrix through:
• xij = 1 if response Yi corresponds to level j

• xij = 0 otherwise
This gives E(y) = Xβ where
   
µ1 1 0 ··· 0
 µ2  0 1 · · · 0
   
β=
 .. 
 and X=
 .. .. .
.. 
. . . ··· .
µJ 0 0 ··· 1
67/112
Model 1 - E(Yjk ) = µj for k = 1, . . . , K
The design matrix is given by

13 X <- cbind ( c ( rep (1 ,10) , rep (0 ,20)) , c ( rep (0 ,10) , rep (1 ,10) , rep (0 ,10)) ,
14 c ( rep (0 ,20) , rep (1 ,10)) )
and the estimate β̂ is the vector of sample means for each group
15 y <- matrix ( weight , ncol =1)
17 b . hat
18
19 # [ ,1]
20 # [1 ,] 5.032
21 # [2 ,] 4.661
22 # [3 ,] 5.526
The disadvantage of this simple formulation of the model is that it cannot be extended
to consider more than one factor. 68/112
Model 2 - E(Yjk ) = µ + αj for k = 1, . . . , K
In this model µ is an average effect for all levels and αj is an additional effect due to
level j. In this case we have:
  
µ 1 1 0 ··· 0
α  1 0 1 · · · 0
 1  
1 0 0 · · ·
   
α2 
β= and X= 0 .

 ..   .. .. .. .. 
 .  . . . ··· .
αJ 1 1 1 ··· 1
The design matrix as an additional column of elements equal to 1.

The first row (or column) of the (J + 1) × (J + 1) matrix X> X is the sum of the
remaining rows (or columns), therefore X> X is singular and there is no unique solution.
69/112
The general solution can be written

     
µ̂ 0 −1
α̂1  Y  1
     
1 
 1. 
 ..  = K  ..  − λ  .. 
β̂ =    
 .   .   . 
α̂J YJ. 1
Pnj
where λ is an arbitrary constant and Yj. = i=1 Yij .
Usually a sum-to-one constraint is used, such that Jj=1 αj = 0, i.e.
P
J J
1 X 1 X Y..
Yj. − Jλ = 0 ⇐⇒ λ= Yj. =
K JK N
j=1 j=1
and therefore
Y.. Yj. Y..
µ̂ = and α̂j = − j = 1, . . . , J
N K N 70/112
1 X <- cbind ( rep (1 , nrow ( plant . dried )) ,

2 c ( rep (1 ,10) , rep (0 ,20)) ,
3 c ( rep (0 ,10) , rep (1 ,10) , rep (0 ,10)) ,
4 c ( rep (0 ,20) , rep (1 ,10)) )
5
6 lambda <- mean ( weight )

7 mu . hat <- lambda
8 mu . hat
9
10 # [1] 5.073
11
12 alpha . hat <- aggregate ( weight , list ( group ) , mean ) $ x - lambda

13 alpha . hat
14
15 # [1] -0.041 -0.412 0.453
71/112
Model 3 - E(Yjk ) = µ + αj for k = 1, . . . , K , under constraint α1 = 0
In this case µ represents the effect of the first level and αj measures the difference
between the first level and the j-th level of the factor.
This is called a corner point parametrization. We have
 
  1 0 0 ··· 0
µ 1 1 0 · · · 0
α2 
   
1 0 1 · · ·
 
β=
 .. 
 and X= 0 .
 .   .. .. .. .. 
. . . ··· .
αJ
1 0 0 ··· 1
72/112
Now X> X is not singular, so there a unique solution for

 
Y1.
1 Y2. − Y1. 
 
β̂ =  .. 
K  .


YJ. − Y1.
1 X <- cbind ( rep (1 , nrow ( plants )) ,
2 c ( rep (0 ,10) , rep (1 ,10) , rep (0 ,10)) ,
3 c ( rep (0 ,20) , rep (1 ,10)) )
4

6 b . hat
7 # [ ,1]
8 # [1 ,] 5.032
9 # [2 ,] -0.371
10 # [3 ,] 0.494 73/112
In the analysis of the variance, it is important to compare the alternative hypothesis

(means for each level differ) with the null hypothesis (means are all equal)
For the null model, E(Yjk ) = µ and the design matrix is a column vector of elements
equal to 1, i.e. X> X = N and X> y = Y.. .
1 >
D1 = (y y − β̂ > X> y)
σ2  
J K
1 X X 2 Y..2 
D0 = 2 Yjk −
σ N
j=1 k=1
and the F -statistic

D0 − D1 D1
F = /
J −1 N −J
74/112
1 D1 <- t ( y ) % * % y - t ( b . hat ) % * % t ( X ) % * % y
2
3 # Null model :
4 X0 <- matrix ( rep (1 , nrow ( plant . dried )) , ncol =1)
5 b0 . hat <- solve ( t ( X0 ) % * % X0 ) % * % t ( X0 ) % * % y
6 D0 <- t ( y ) % * % y - t ( b0 . hat ) % * % t ( X0 ) % * % y
7
8 Fstat <- (( D0 - D1 ) / ( ncol ( X ) -1)) / ( D1 / ( nrow ( X ) - ncol ( X )))

9 crit . val <- qf (0.95 , df1 = ncol ( X ) -1 , df2 = nrow ( X ) - ncol ( X ))
10
11 if ( Fstat > crit . val ){

12 cat ( " There is enough evidence to reject H0 " )
13 } else {
14 cat ( " There is NOT enough evidence to reject H0 " )
15 }
16 # There is enough evidence to reject H0
75/112
Similarly, we could have obtained the same results with the lm() function
1 res . lm <- lm ( weight ~ group )
2 summary ( res . lm )
3
4 # Call :
5 # lm ( formula = weight ~ group )
6
7 # Residuals :
8 # Min 1 Q Median 3Q Max
9 # -1.0710 -0.4180 -0.0060 0.2627 1.3690
76/112
10 # Coefficients :
12 # ( Intercept ) 5.0320 0.1971 25.527 <2e -16 * * *
13 # groupGroupA -0.3710 0.2788 -1.331 0.1944
14 # groupGroupB 0.4940 0.2788 1.772 0.0877 .
15 # ---
16 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
17

20 #F - statistic : 4.846 on 2 and 27 DF , p - value : 0.01591
Have a look to the two-factor analysis of variance, which is a simple extension of this
section (Dobson and Barnett, 2018, Section 6.4.2).
77/112
3.1 - Introduction
3.2 - Basic results
3.3 - Diagnostics

3.7 - Extensions
78/112
Analysis of Covariance (ANCOVA)
The analysis of covariance is the term used for models in which some of the
explanatory variables are dummy variables representing factor levels and others are
continuous measurements called covariates.
We compare means of subgroups defined by factor levels, but we consider that the
covariates may also affect the responses, therefore we compare the means after
adjustment for covariate effects.
79/112
Example: Training methods
• The responses Yjk are achievement scores measured at three levels of a factor
• The factor represents three different training methods
• The covariates Xjk are aptitude scores measured before training commenced
We want to compare the training methods, taking into account differences in initial
aptitude between the three groups of subjects.
80/112
1 data ( " achievement " )

2 attach ( achievement )
3 plot ( achievement )
4
5 library ( ggplot2 ) # Install if needed

6 ggplot ( achievement , aes ( x = x , y = y , colour = method )) +
7 geom _ point ( aes ( shape = method , color = method )) +
8 labs ( x = " Initial aptitude , x " ) +
9 labs ( y = " Achievement score , y " )
81/112
Achievement score, y
8 method
A
6
B
C
4
1 2 3 4 5
Initial aptitude, x
82/112
1 res . lm <- lm ( y ~ method + x , data = achievement )

2 summary ( res . lm )
3
4 # Call :
5 # lm ( formula = score ~ treatment + aptitude , data = # train _ methods )
6
7 # Residuals :
8 # Min 1Q Median 3Q Max
9 # -1.73878 -0.57959 0.07347 0.48980 1.00408
83/112
10 # Coefficients :
12 # ( Intercept ) 2.8367 0.4235 6.699 3.74 e -06 ***
13 # treatmentB 2.1878 0.4545 4.814 0.000162 ***
14 # treatmentC 1.8612 0.4240 4.390 0.000400 ***
15 # aptitude 0.7429 0.1421 5.227 6.84 e -05 ***
16 # ---
17 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
18

21 #F - statistic : 29.43 on 3 and 17 DF , p - value : 5.889 e -07
Note: To get directly the results of the F -test we can use the glm function.
1 res . lgm <- glm ( y ~ method + x , data = achievement )
2 summary ( res . glm )
84/112
3.1 - Introduction
3.2 - Basic results
3.3 - Diagnostics

3.7 - Extensions
85/112
General linear models
The term general linear models is used for Gaussian models with any combination of
categorical and continuous explanatory variables.
The factors can be
• crossed: there are observations for each combination of levels of the factors (see
two factors ANOVA)
• nested: the combinations of factors are different
86/112
Example - nested factors
Two-factor nested design, comparison of two drugs A1 tested in hospital B1 , B2 and B3

and A2 tested in hospital B4 and B5 .
Drug A1 Drug A2
Hospitals B1 B2 B3 B4 B5
Responses Y111 Y121 Y131 Y241 Y251
.. .. .. .. ..
. . . . .
Y11n1 Y12n2 Y13n3 Y24n4 Y25n5
We want to compare the effects of the two drugs and possible differences among
hospitals using the same drug.
87/112
The saturated model is
E(Yjkl ) = µ + α1 + α2 + (αβ)11 + (αβ)12 + (αβ)13 + (αβ)24 + (αβ)25
under constraints
• α1 = 0
• (αβ)11 = 0
• (αβ)24 = 0
Hospitals 1, 2 and 3 can be only compared within drug A1 and hospitals 4 and 5 can
be only compared within drug A2 .
88/112
This model is not different from other Gaussian models
• Response variable is assumed to be normally distributed

• Response and explanatory variables are assumed to be linearly related
• The variance σ 2 is constant
• The responses are assumed to be independent
These assumption must be checked by looking at the residuals.
89/112
If the assumption of normality is not plausible, a transformation can be used.
A very used transformation is the Box-Cox transformation:


 y λ −1 , λ 6= 0
∗ λ
y = (19)
log y λ=0
• if λ = 1, y is unchanged (except for a location shift which does not influence

inference)
• if λ = 21 , the transformation is the square root
• if λ = 1, the transformation is the reciprocal
• if λ = 0, the transformation is the logarithm
The value of λ which produces the “most normal” distribution can be estimated by the
method of maximum likelihood. 90/112
3.1 - Introduction
3.2 - Basic results
3.3 - Diagnostics

3.7 - Extensions
91/112
Non-additive associations
The additive assumption means that the effect of changes in a predictor Xj on the
response Y is independent of the values of other predictors.
In many situations, there is a synergy effect, i.e. increasing the level of one covariate
may interact with the level of another. This is called interaction in statistics.
Example
Take
Y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + ε
This means that
Y = β0 + (β1 + β3 x2 )x1 + β2 x2 + ε = β0 + β̃1 x1 + β2 x2 + ε
where β˜1 = β1 + β3 x2 , i.e. β̃1 changes with x2 and the effect of x1 on Y is no longer
constant.
92/112
Remark:
• The hierarchical principle states that if we include an interaction in a model, we

should also include the main effects, even if the p-values associated with their
coefficients are not significant.
• The concept of interactions applies to qualitative variables, to quantitative
variables or to a combination of both.
93/112
Example
Using the Credit data (ISLR package), the least squares lines are shown for
prediction of Balance from Income for students and non-students. Left model does
not include interactions, right models does.
1400
1400
student student
non−student non−student
1000
1000
Balance
Balance
600
600
200
200
0 50 100 150 0 50 100 150

94/112
1 lm <- lm ( Balance ~ Income + Student , data = Credit )

2 plot (0 ,0 , xlim = c (0 ,150) , ylim = c (200 ,1400) ,
3 xlab = " Income " , ylab = " Balance " )
4 abline ( a = lm $ coefficients [1]+ lm $ coefficients [3] ,
5 b = lm $ coefficients [2] ,
6 col =2 , lwd =2)
7 abline ( a = lm $ coefficients [1] , b = lm $ coefficients [2] , col =1 , lwd =2)
8 legend ( " topleft " , legend = c ( " student " , " non - student " ) ,
9 col = c (2 ,1) , lwd =2 , cex =0.9)
95/112
10 lm2 <- lm ( Balance ~ Income * Student , data = Credit )

11 plot (0 ,0 , xlim = c (0 ,150) , ylim = c (200 ,1400) ,
12 xlab = " Income " , ylab = " Balance " )
13 abline ( a = lm2 $ coefficients [1]+ lm2 $ coefficients [3] ,
14 b = lm2 $ coefficients [2]+ lm2 $ coefficients [4] ,
15 col =2 , lwd =2)
16 abline ( a = lm2 $ coefficients [1] ,
17 b = lm2 $ coefficients [2] , col =1 , lwd =2)
18 legend ( " topleft " , legend = c ( " student " , " non - student " ) ,
19 col = c (2 ,1) , lwd =2 , cex =0.9)
96/112
Non-linear associations
So far, we have considered only linear associations between X and Y, but in other
situations the relationship can be non-linear.
A non-linear association can be suggested by looking at the residuals.
A popular model is a U-shaped association, that can be modelled by a quadratic

association
E(Yi ) = β0 + β1 xi + β2 xi2
This is still a linear regression because the regression equation is still a linear
combination of the explanatory variables X and X2 .
97/112
Non-linear associations
Tips:
• center the explanatory variables

• scale the explanatory variables
xi − x̄
x̃i =
sd(x)
Scaling has several advantages
• numerical accuracy of matrix manipulation is improved, in particular in presence

of large values of the covariate
• the intercept β0 relates the average of y to the average of x, instead of the
average of y with x = 0 (which is sometimes an impossible value)
• the slope represents a one standard deviation change which is potentially more
meaningful than a one unit change (which can be very small or very large)
98/112
Fractional polynomials
The quadratic function is symmetric, however there may be situations where the rate
of increase is faster than the rate of decrease.
A range of functions can be investigated through fractional polynomials
E(Yi ) = β0 + β1 xip p 6= 0
In this case, you can test several models (p = 1 is linear, p = 2 is quadratic, p = −2 is

reciprocal quadratic) and investigate the best fit. If p = 0, we use log(xi ).
Since the curves of association can be modified by both the function and the slope
parameter, a large number of potential non-linear association can be investigated.
99/112
Fractional polynomials
100/112
3.1 - Introduction
3.2 - Basic results
3.3 - Diagnostics

3.7 - Extensions
101/112
Correlation of the error terms
An important assumption of the linear model is that ε1 , . . . , εN are uncorrelated.
If there is correlation among the error terms, then the estimated standard errors of the
coefficients will tend to underestimate the true stanrdard errors.
When does it happen? A classic situation is for time series (Observations at adjacent
time points are positively correlated).
We investigate the correlation of the errors by plotting the residuals w.r.t. time:
• If errors are uncorrelated, there should be no discernible pattern;
• If the error terms are positively correlated, we may see a trend, in particular for
adjacent residuals.
102/112
Correlation of the error terms
ρ=0.0
3
2
1
Residual
−1 0
−3
0 20 40 60 80 100
ρ=0.5
2
1
Residual
0
−2
−4
0 20 40 60 80 100
ρ=0.9
1.5
0.5
Residual
−0.5
−1.5
0 20 40 60 80 100
Observation
Figure 2: Residuals from a linear regression fit to data generated with error correlation ρ = 0, 0.5 and 0.9.
103/112
Non-constant Variance
Another important assumption of the linear model is that the error terms have
constant variance, Var(εi ) = σ 2 for every i.
The case of non-constant variance is called heteroscedasticity.
A possible solution is to use a concave transformation of the observations, like log Y or

√
Y, which results in a greater amount of shrinkage of the larger responses.
Sometimes we have an idea of the variance of each response: for example, each
observation could be an average of ni observations, there the average can have
2
variance σi2 = σni . A solution to this situation is to fit weighted least squares, with
weights proportional to wi = ni .
104/112
Non-constant Variance
Response Y Response log(Y)
15
0.4
998
975
845
10
0.2
0.0
5
Residuals
Residuals
−0.8 −0.6 −0.4 −0.2

0
−5
605
−10
671
437
10 15 20 25 30 2.4 2.6 2.8 3.0 3.2 3.4
Fitted values Fitted values
Figure 3: Residuals for non-transformed (left) and log-transformed data (right). Red line is a smooth
105/112
fit to the residuals, blue lines give the outer quantile.
Collinearity
Collinearity refers to the situation in which two or more predictors are closely related to
each other.
The presence of collinearity makes it difficult to separate out the individual effects of
collinear variables on the response. Moreover, a small change in the data can cause the
coefficient values to be estimated very differently. This means there is a great
uncertainty in the estimates in presence of collinearity.
1 library ( ISLR )
2 data ( Credit )
3 attach ( Credit )
4
5 par ( mfrow = c (1 ,2))

6 plot ( Limit , Age , xlab = " Limit " , ylab = " Age " , col = " red " )
7 plot ( Limit , Rating , xlab = " Limit " , ylab = " Rating " , col = " red " )
106/112
Collinearity
80
800
70
600
60
Rating
Age
50
400
40
200
30
2000 4000 6000 8000 12000 2000 4000 6000 8000 12000
Limit Limit
107/112
Collinearity
A simple way to detect collinearity is to look at the correlation matrix of the predictors.
1 cor ( cbind ( Limit , Age , Rating ))
2 # Limit Age Rating
3 # Limit 1.0000000 0.1008879 0.9968797
4 # Age 0.1008879 1.0000000 0.1031650
5 # Rating 0.9968797 0.1031650 1.0000000
However, it is possible that collinearity exists among three or more variables even when
no pair of variables has high correlation. This situation is called multicollinearity.
108/112
Collinearity
A way to inspect multicollinearity is the variance inflation factor, i.e. the ratio of the
variance of β̂j when fitting the full model divided by the variance of β̂j if fit on its own.
1
VIF(β̂j ) = ≥1 (20)
1 − RX2 j |x−j
where Rx2j |x−j is the R 2 statistic from regression of Xj on all the other predictors.
• If VIF(β̂j ) = 1 there is no collinearity

• If VIF(β̂j ) > 5 there is a problem
Solutions:
• drop one of the problematic variables

• combine the collinear variables into a single predictor (e.g. taking the average of
each pair of predictors)
109/112
Collinearity
In the Credit dataset, if we were to regress balance on limit, rating and age. We
could then compute the VIF for each of the three predictors as follows
1 lm . limit <- lm ( Limit ~ Rating + Age )
2 1 / (1 - summary ( lm . limit ) $ r . squared ) # VIF Limit
3 # [1] 160.5929
4
5 lm . rating <- lm ( Rating ~ Limit + Age )

6 1 / (1 - summary ( lm . rating ) $ r . squared ) # VIF Rating
7 # [1] 160.6683
8
9 lm . age <- lm ( Age ~ Rating + Limit )

10 1 / (1 - summary ( lm . age ) $ r . squared ) # VIF Age
11 # [1] 1.011385
110/112
Collinearity
Alternatively one could install the car package in Rand directly use the vif() function.
1 library ( car )
2
3 lm . full <- lm ( Balance ~ Limit + Rating + Age )

4 vif ( lm . full )
5 # Limit Rating Age
6 # 160.592880 160.668301 1.011385
111/112
References
Sources and recommended reading:
1. A. J. Dobson & A. G. Barnett (2018) An introduction to generalised linear

models, Chapman and Hall. Chapter 6.
2. G. James, D. Witten, T. Hastie & R. Tibshirani (2013) An Introduction to
Statistical Learning with Applications in R, Springer. Chapter 3 (Section 3.5
excluded).
112/112

Week03 Lecture BB

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week03 Lecture BB

Uploaded by

Copyright:

Available Formats

MATH5806 Applied Regression Analysis

Lecture 3 - Linear Gaussian models

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models

This lecture is about obtaining estimates for

• Multivariate linear regression

This is a case where analytical solutions exist.

Linear models of the form

• Multivariate regression: association between a continuous response and several

This model (usually called general linear model) is usually written as

Additionally, X is an N × p design matrix, which in the case of a multiple regression

E(Yi ) = β1 + β2 Xi2 + β3 X2i3

E(Yi ) = β1 + γ1 δ1 Xi2 + exp(β2 )Xi3 .

E(Yi ) = β1 exp(β2 Xi2 ).

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models

The joint distribution is

For each Yi , the log-likelihood is

0 b 00 (θi )c 0 (θi )−c 00 (θi )b 0 (θi )

The log-likelihood for all the Yi ’s is then

Now, let’s consider each term separately

Hence, the score function is

The variance-covariance matrix of the score is

Ijk = E[Uj Uk ] (5)

which represents the information matrix

because E[(Yi − µi )(Yl − µl )] = 0 for i 6= l.

Since E[(Yi − µi )2 ] = Var(Yi )

β̂ (m) = β̂ (m−1) + [I (m−1) ]−1 u(m−1)

From Equation (6), the information matrix can be written as

which can be written in matrix terms as

[I (m−1) ]−1 β̂ (m−1) + u(m−1) = X> Wz (10)

E(β̂) = (X> X)−1 X> E(Y) = (X> X)−1 X> Xβ = β

• Consistency: the variance-covariance matrix is I −1 , therefore

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models

Under Gauss-Markov assumptions, i.e.

and assuming that N > p, the least squares function is

RSS(β) = (y − Xβ)> (y − Xβ) = ε> ε (11)

This is a multivariate function in β which is (strictly) convex.

Hence there is a unique minimiser β̂, satisfying

Assuming that X has rank N ≥ p, we can write

β̂ = (X> X)−1 X> y.

While this estimator is unbiased, it is necessary to introduce an assumption on the

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models

Similarly in the presence of other covariates.

Standard errors can be used to compute (1 − α)100% confidence intervals as:

k = 0, 1, where tα/2,n−2 is α/2 critical value of a Student-t distribution with n − 2

Now, let’s consider a null hypothesis H0 and an alternative hypothesis H1 .

The scaled deviance can be used for model comparison.

Then we have D0 ∼ χ2 (N − q) and D1 ∼ χ2 (N − p), and thus

If the values of ∆D is in the critical region, then we would reject H0 in favour of H1 on