Download as pdf or txt
Download as pdf or txt
You are on page 1of 112

MATH5806 Applied Regression Analysis

Lecture 3 - Linear Gaussian models

Boris Beranger
Term 2, 2021

1/112
Chapter 3 - Linear Gaussian models
3.1 - Introduction
3.2 - Basic results
3.2.1 - Maximum likelihood estimation
3.2.2 - Least squares estimation
3.3 - Diagnostics

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models


3.7 - Extensions
3.8 - Potential problems
2/112
Chapter 3 - Linear Gaussian models
3.1 - Introduction
3.2 - Basic results
3.2.1 - Maximum likelihood estimation
3.2.2 - Least squares estimation
3.3 - Diagnostics

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models


3.7 - Extensions
3.8 - Potential problems
3/112
Goals

This lecture is about obtaining estimates for

• Multivariate linear regression


• ANOVA

This is a case where analytical solutions exist.

4/112
Introduction

Linear models of the form

E(Yi ) = µi = x>
i β
Yi ∼ N(µi , σ 2 ),

where the r.v. Yi are independent, are the basis of most analyses of continuous data.
There are two main models of this form

• Multivariate regression: association between a continuous response and several


explanatory variables
• Analysis of variance (ANOVA): comparisons of more than two means
• Analysis of covariance (ANCOVA)

5/112
Introduction

This model (usually called general linear model) is usually written as

y = Xβ + ε (1)

where

y> = [Y1 , . . . , YN ]
β > = [β1 , β2 , . . . , βp ]
ε> = [ε1 , . . . , εN ]
i.i.d.
where εi ∼ N (0, σ 2 ) for i = 1, . . . , N.
The error term includes all the terms we have missed with the model: the true
relationship is not linear, there are variables not considered, there is measurement
error. The error term is usually considered independent from the X.
6/112
Introduction

Additionally, X is an N × p design matrix, which in the case of a multiple regression


above is set to

 
1 X12 X13 . . . X1p
 
. 
 
X=
.
.

.
 

1 XN2 XN3 . . . XNp

The parameter βj is interpreted as the average effect on Y of a one unit increase in the
covariate xj , holding all the other predictors fixed.

7/112
Introduction

The model is linear in the parameters, which means we have, for instance:

E(Yi ) = β1 + β2 Xi2 + β3 X2i3

E(Yi ) = β1 + γ1 δ1 Xi2 + exp(β2 )Xi3 .

But NOT:
E(Yi ) = β1 + β2 Xβi22

E(Yi ) = β1 exp(β2 Xi2 ).

8/112
Chapter 3 - Linear Gaussian models
3.1 - Introduction
3.2 - Basic results
3.2.1 - Maximum likelihood estimation
3.2.2 - Least squares estimation
3.3 - Diagnostics

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models


3.7 - Extensions
3.8 - Potential problems
9/112
Chapter 3 - Linear Gaussian models
3.1 - Introduction
3.2 - Basic results
3.2.1 - Maximum likelihood estimation
3.2.2 - Least squares estimation
3.3 - Diagnostics

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models


3.7 - Extensions
3.8 - Potential problems
10/112
Maximum likelihood estimation

The joint distribution is


N
Y
f (Y1 , . . . , YN |θ1 , . . . , θN ) = exp[Yi b(θi ) + c(θi ) + d(Yi )]
i=1
N N N
" #
X X X
= exp Yi b(θi ) + c(θi ) + d(Yi )
i=1 i=1 i=1

For each Yi , the log-likelihood is


`i = Yi b(θi ) + c(θi ) + d(Yi ), (2)

which gives

0 b 00 (θi )c 0 (θi )−c 00 (θi )b 0 (θi )


E(Yi ) = µi = − bc 0 (θ i)
(θi ) , Var(Yi ) = [b 0 (θi )]3
, g (µi ) = x>
i β = ηi .
11/112
Maximum likelihood estimation

The log-likelihood for all the Yi ’s is then


N
X N
X N
X N
X
`(θ; Y1 , . . . , YN ) = `i = Yi b(θi ) + c(θi ) + d(Yi ).
i=1 i=1 i=1 i=1

To obtain the maximum likelihood estimator for the parameter βj we derive the score
function using the chain rule:
N   XN  
d` X d`i d`i dθi dµi
= Uj = = · · . (3)
dβj dβj dθi dµi dβj
i=1 i=1

12/112
Maximum likelihood estimation

Now, let’s consider each term separately

• d`i
dθi = Yi b 0 (θi ) + c 0 (θi ) = b 0 (θi )(Yi − µi )
 −1  00 −1
−c (θi ) c 0 (θi )b 00 (θi )
• dθi
dµi = dµdθi
i
= 0
b (θi ) + 0
[b (θ )] 2 = (b 0 (θi )Var(Yi ))−1
i
dµi dµi dηi dµi
• dβj = dηi · dβj = dηi Xij

Hence, the score function is

N   
X (yi − µi ) dµi
Uj = Xij (4)
Var (Yi ) dηi
i=1

13/112
Maximum likelihood estimation

The variance-covariance matrix of the score is

Ijk = E[Uj Uk ] (5)

which represents the information matrix


( N  N  )
X (Yi − µi )  dµi  X (Yi − µi )

dµi
Ijk = E Xij Xik
Var(Yi ) dηi Var(Yi ) dηi
i=1 i=1
N
E[(Yi − µi )2 Xij Xik ] dµi 2
X  
=
[Var(Yi )]2 dηi
i=1

because E[(Yi − µi )(Yl − µl )] = 0 for i 6= l.

14/112
Maximum likelihood estimation

Since E[(Yi − µi )2 ] = Var(Yi )


N
dµi 2
 
X Xij Xik
Ijk = (6)
Var(Yi ) dηi
i=1

dµi
Furthermore, since dηi = 1, the information is of the form

N
X Xij Xik
Ijk =
Var (Yi )
i=1

or in matrix notation
1 T
I= X X (7)
σ2
15/112
Maximum likelihood estimation

If we want to apply the method of scoring to approximate the MLE, the estimating
equation is

β̂ (m) = β̂ (m−1) + [I (m−1) ]−1 u(m−1)


[I (m−1) ]β̂ (m) = [I (m−1) ]β̂ (m−1) + u(m−1) (8)

From Equation (6), the information matrix can be written as

I = X> WX (9)
 2
dµi
where wii = Var1(Yi ) dηi = 1
σ2

16/112
Maximum likelihood estimation

Finally, the expression on the right hand side of (8) can be written as
p X
N N
dµi 2 (m−1) X (Yi − µi )Xij dµi
   
X Xij Xik
β̂k +
Var(Yi ) dηi Var(Yi ) dηi
k=1 i=1 i=1

which can be written in matrix terms as

[I (m−1) ]−1 β̂ (m−1) + u(m−1) = X> Wz (10)

where
p p p
  !
X (m−1) dηi X (m−1)
X (m−1)
Zi = Xik β̂k + (Yi − µi ) = Xik β̂k + Yi − Xik βk = Yi
dµi
k=1 k=1 k=1

17/112
Maximum likelihood estimation

And, therefore,
1 > 1
2
X Xβ̂ = 2 X> y =⇒ β̂ = (X> X)−1 X> y
σ σ
Properties:

• Unbiasedness

E(β̂) = (X> X)−1 X> E(Y) = (X> X)−1 X> Xβ = β

• Consistency: the variance-covariance matrix is I −1 , therefore

I −1 = σ 2 (X> X)−1

• Normality
β̂ ∼ N (β, σ 2 (X> X)−1 )
18/112
Maximum likelihood estimation

10

10
5

5
Y

Y
0

0
−5

−5
−10

−10
−2 −1 0 1 2 −2 −1 0 1 2

X X

Figure 1: Left: Simulated data. Red line gives the true relationship (population regression
line), blue line gives the least square estimate. Right: light blue lines are 10 least square lines
19/112
Chapter 3 - Linear Gaussian models
3.1 - Introduction
3.2 - Basic results
3.2.1 - Maximum likelihood estimation
3.2.2 - Least squares estimation
3.3 - Diagnostics

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models


3.7 - Extensions
3.8 - Potential problems
20/112
Least squares estimation

It is possible to derive an estimator without making any further assumption about the
distribution of y.

Under Gauss-Markov assumptions, i.e.

• E(ε) = 0
• E(ε> ε) = σ 2 IN

and assuming that N > p, the least squares function is

RSS(β) = (y − Xβ)> (y − Xβ) = ε> ε (11)

This is a multivariate function in β which is (strictly) convex.

21/112
Least squares estimation

Hence there is a unique minimiser β̂, satisfying


d d >
(y − Xβ)> (y − Xβ) = y y − 2β > X> y + β > X> Xβ
dβ dβ
= −2X> y + 2X> Xβ = 0
→ X> Xβ= X> y

Assuming that X has rank N ≥ p, we can write

β̂ = (X> X)−1 X> y.

While this estimator is unbiased, it is necessary to introduce an assumption on the


distribution of the y to derive the distribution of the estimator.
22/112
Least squares estimation

23/112
Chapter 3 - Linear Gaussian models
3.1 - Introduction
3.2 - Basic results
3.2.1 - Maximum likelihood estimation
3.2.2 - Least squares estimation
3.3 - Diagnostics

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models


3.7 - Extensions
3.8 - Potential problems
24/112
Confidence intervals

The estimate of the uncertainty of the estimates is given by the standard error, SE(β̂):

2
" #
2 2 1 X̄
SE(β̂0 ) = σ + PN
n i=1 (Xi − X̄)
2

σ2
SE(β̂1 )2 = PN ,
i=1 (Xi − X̄)2
where σ 2 = Var(ε).

Similarly in the presence of other covariates.

Remark: the SE(β̂1 ) is smaller when the Xi are more spread out: intuitively, we are
more able to estimate the slope of the line in this case.
25/112
Confidence intervals

Standard errors can be used to compute (1 − α)100% confidence intervals as:


h    i
β̂k − tα/2,n−2 SE β̂k , β̂k + tα/2,n−2 SE β̂k ,

k = 0, 1, where tα/2,n−2 is α/2 critical value of a Student-t distribution with n − 2


degrees of freedom.

26/112
Deviance

The predicted values given the MLE are Xβ̂, therefore the residuals are defined as
y − Xβ̂
and the square of residuals
N
X
>
(y − Xβ̂) (y − Xβ̂) = (Yi − Ŷi )2
i=1
and the deviance
D = 2[`(β̂max ) − `(β̂)]
 
1 2 1 > 1 2
= 2 − N log(2πσ ) + 2 (y − Xβ̂) (y − Xβ̂) + N log(2πσ )
2 2σ 2
1 >
= 2 (y y − 2β̂ > X> y + β̂ > X> Xβ̂)
σ
1
= 2 (y> y − β̂ > X> y) 27/112
σ
Hypothesis testing

Now, let’s consider a null hypothesis H0 and an alternative hypothesis H1 .


  
β1 β1
. .
H0 = β = β0 =  ..  , H1 = β = β1 =  .. 
βq βp

The scaled deviance can be used for model comparison.


1 h i
∆D = D0 − D1 = (y> y − β̂0> X> > > >
0 y) − (y y − β̂1 X1 y)
σ2
1 h i
= 2 β̂1> X>
1 y − β̂ > >
X
0 0 y
σ

28/112
Hypothesis testing

Then we have D0 ∼ χ2 (N − q) and D1 ∼ χ2 (N − p), and thus

∆D ∼ χ2 (p − q)

when N is large.

Remark: If the models do not fit the data well, it is possible that either D0 or D1
follow a non-central χ2 distribution with non-centrality parameter ν.

If the values of ∆D is in the critical region, then we would reject H0 in favour of H1 on


the grounds that model M1 provides a significantly better description of the data.

29/112
Hypothesis testing

The standard deviation σ 2 , however, is unknown; replacing it by its estimate σ̂ 2 results


in (??) being inaccurate.

But σ 2 may be eliminated by using the ratio


β̂1> X> > >
D0 −D1 1 y−β̂0 X0 y
p−q p−q
F = = (12)
D1 y> y−β̂1> X>
1 y
N−p N−p

This may be explicitly calculated from the fitted values.

• Under the null hypothesis H0 , the statistic F follows the Fp−q,N−p distribution
• Otherwise, the statistic F has a non-central distribution
30/112
Hypothesis testing

The F -statistic is usually used to test the hypothesis

H0 : β1 = β2 = . . . = βp = 0
H1 : at least one βj is non-zero

and
(TSS − RSS)/(p − 1)
F = (13)
RSS/(N − p)
where TSS = i=1 (Yi − Ȳ)2 and RSS = N
PN P 2
i=1 (Yi − Ŷi )

It is possible to show that E(RSS/(N − p − 1)) = σ 2 and, under H0 ,


E[(TSS − RSS)/p] = σ 2 . Therefore, under H0 , the F -statistic is expected to be close
to 1, while under H1 , E[(TSS − RSS)/p] > σ 2 and the F -statistic is larger than 1.
31/112
Hypothesis testing

Remark 1: the F -statistic is important because if you test for the effect of any
predictor separately (without any correction) about 5% of the p-values will be under α
(e.g. 0.05) by chance. The F -statistic does not suffer from this problem because it
adjusts for the number of predictors.

Remark 2: The approach using the F -statistic works when p < N; however, sometimes
we have a very large number of variables and p > N. In this case, multiple regression
cannot be fitted and the F -statistic cannot be used.

32/112
Hypothesis testing

The task of determining which predictors are associated with the response is referred
to as variable selection. We will discuss more about this in Week 8.

One problem is that there are 2p possible models to compare, containing subsets of p
variables. Automated approaches to choose the set of models to consider are

• forward selection: starting with the null model and add variables, by selecting always the
model with the lowest RSS; can always be used
• backward selection: starting with the model with all the variables and remove the variable
with the largest p-value and continue; cannot be used when p > N
• mixed selection: combination of forward and backward selection, until all the variables
outside the model are associated with larger p-values than variables within the model

33/112
Distribution of residuals

The estimated residuals are defined as

ei = yi − X>
i β̂ = yi − ŷi (14)

which can be also written as

e = y − Xβ̂ = y − X(X> X)−1 X> y (15)

The hat (or projection) matrix H := X(X> X)−1 X> maps the vector of response values
to the vector of fitted values

Hy = X(X> X)−1 X> y = Xβ̂ = ŷ.

34/112
Distribution of residuals

H is symmetric and idempotent, therefore, it is a projection matrix. It defines a


transformation of y in N-dimensional space to vector ŷ in a subspace such that ŷ is as
close to y as possible.

Remark: Note also that for an idempotent matrix A: rank(A) = tr(A).

Then the residuals are given by

e = y − ŷ = (I − H)y. (16)

35/112
Distribution of residuals

The variance-covariance matrix of the estimated residuals is

Var(e) = Var[(I − H)y]


= Var[(I − H)(Xβ + ε)]
= Var[(I − H)ε]
= (I − H)Var[ε](I − H)>
= σ 2 (I − H)(I − H)>
= σ 2 (I − H)
= σ 2 [I − X(X> X)−1 X> ]

36/112
Distribution of residuals

Theorem
If the Gauss-Markov assumptions hold, then

e ∼ N (0, σ 2 (I − H)).

(Note that the covariance matrix is singular.)


Since Cov [e] is not diagonal, the residuals are not independent; since
Var[ei ] = σ 2 (1 − hii ), their variance is not constant.

37/112
Distribution of residuals
Proof.
If the model assumptions hold, y is a Gaussian random vector, and then (16) shows
that e is a Gaussian random vector. Its mean is

E[H] = (I − H)E[y] = (I − H)Xβ = 0

and its covariance matrix is

Cov [e] = Cov [(I − H)y] = (I − H)Cov [y](I − H)> = (I − H)(σ 2 I)(I − H)
= σ 2 (I − H)

since H = H (a linear projection) and I − H is idempotent:

(I − H)(I − H) = I(I − H) − H(I − H) = I − H − H + HH = I − H.


38/112
Distribution of residuals

Theorem
Under the Gauss-Markov assumptions,
N
X
−2 > −2
σ e e=σ ei2 ∼ χ2N−p
i=1

39/112
Distribution of residuals

Proof.
Using the general result on the distribution of quadratic forms, we need to see what
is the noncentrality parameter and find the rank of matrix I − H.
The distributions of residuals is e ∼ N (0, σ 2 (I − H)).
Additionally, since E(e) = 0 the noncentrality parameter λ = 0. Moreover,

rank(I − H) = tr(I − H) = tr(I) − tr(H) = N − tr(X(X> X)−1 X> )


= N − tr(Ip ) = N − p

This is true because rank(H) = tr(H) and H has rank p (which is the rank(X)):
therefore the tr(H) can be seen as the tr(Ip ).
Which concludes the proof of this theorem.
40/112
Distribution of residuals

PN
It follows that σ̂ 2 := 2
i=1 ei /(N − p) is an unbiased estimator of σ 2 .
This is called the residual standard error and it is the estimate needed to estimate the
coefficient standard error.
Residual plots are useful graphical tools for identifying non-linearity in the data: we
can plot the residuals (Yi − Ŷi ) versus the fitted values Ŷi .

Ideally the residual plot will not show any discernible pattern.

If the residual plot indicates that there are non-linear associations in the data, then a

simple approach is to use non-linear transformations of the predictors, i.e. log x, x,
x 2 , etc.

41/112
Distribution of residuals

42/112
Studentized residuals

Plotting the residuals vs. fitted values may reveal possible violations of linearity or
homoscedasticity. Standardising the residuals may lead to a better feeling for their
magnitude.
We have calculated the variance of the i-th residual to be

Var[ei ] = σ 2 [I − H]ii = σ 2 (1 − hii ). (17)

Accordingly, one defines the studentized residuals:


ei
e0i = p .
2
σ̂ (1 − hii )

43/112
Studentized residuals

Recommendations about outliers: Points should not be routinely deleted from an


analysis just because they do not fit the model. Outliers and bad leverage points are
signals, flagging potential problems with the model. Outliers often point out an
important feature of the problem not considered before. They may point to an
alternative model in which the points are not an outlier.

44/112
Leverage

The i-th diagonal entry hii of H is called the leverage of the i-th observation. An
observation with high leverage can make a substantial difference to the fit of the model.

Model fits are particularly sensitive to data with high leverage.

Also note from (17) that residuals at points with high leverage have small variance.

45/112
Leverage

The sum of leverages equals the number of parameters:


N
X
hii = tr(H) = tr(X(X> X)−1 X> ) = tr(Ip ) = p
i=1

[As said before: This is true because rank(H) = tr(H) and H has rank p (which is the
rank(X)): therefore the tr(H) can be seen as the tr(Ip ).]

Hence on average the leverage is p/N. As a rule of thumb, if hii is greater than two or
three times p/N, it may be a concern.

46/112
Cook’s distance

Definition
The Cook’s distance is given by:
1
Di = kŷ − ŷ−i k2
p σ̂ 2

Cook’s distance hence measures the (rescaled) sum of squared differences between
fitted values when the i-th datum is removed.
It is a measure for the influence of the i-th datum on the entire model fit.
It can be showed that
1 hii
Di = e 2.
p 1 − hii i
47/112
Cook’s distance

One sees that Cook’s distance statistic Di is particularly large at high leverage points
with large residuals.

• According to Weisberg (2005, p. 200), if the largest value of Di is substantially


less than one, deletion of a case will not change the estimate ŷ by much;
• Fox (2002, p. 198) is among many authors who recommend 4/(n − 2) as a rough
cutoff for noteworthy values of Di for simple linear regression;
• In practice, it is important to look for gaps in the values of Cook’s distance and
not just whether values exceed the suggested cut-off.

48/112
Coefficient of determination, R 2

The strength of a linear relationship is usually measured through the sample correlation
coefficient R.
• An R close to 1 indicates a positive linear relationship
• An R close to -1 indicates a negative linear relationship.
If we square R then this boils down to R 2 close to one giving a numerical indication of
the strength of the regression.
The residual sum of squares
N
X
RSS = ei2 = e> e = (y − Xβ)> (y − Xβ)
i=1

is minimised by the least squares estimate β̂ = (X> X)−1 X> y, therefore


d = (y − Xβ̂)> (y − Xβ̂) = y> y − β̂ > X> y
RSS
49/112
Coefficient of determination, R 2

For the minimal model


Yi = β0 + εi ,

X> X = N and X> y = N


P
i=1 yi , then RSS is minimised for β0 = ȳ , and then
PN 2
RSS0 = i=1 (yi − ȳ ) . This is the worst possible value for RSS. It is also sometimes
called the total sum of squares (TSS).
If parameters are added to the model, then RSS must decrease. The relative amount
of decrease
TSS − RSS
R 2 := (18)
TSS
is called the coefficient of determination. It is the proportion of the total variation in
the data which is explained by the model.
50/112
Coefficient of determination, R 2

For the maximal model, we will have RSS = 0 and thus R 2 = 1: this means that the
R 2 always increases when more variables are added to the model, even if those
variables are weakly associated with the response. The fact that adding a variable
leads to a small increase of the R 2 statistic provides evidence that the contribution of
that variable is small.
The coefficient of determination can be interpreted as the proportion of variance
explained by the model.
In case of one covariate, the coefficient of determination is the squared correlation
between Y and x. In multiple regression, it turns out that R 2 = Cor (Y , Ŷ )2 (the
property of the least squares estimates is that they maximises the correlation among
the responses and the fitted linear model among all the possible linear models).

51/112
Coefficient of determination, R 2

There is a relationship between the F -statistics and the R 2 coefficient. We know that
we can write:
RSS
R 2 × TSS = (TSS − RSS) and = 1 − R 2.
TSS
and from (13) we have
TSS−RSS
p−1
F= RSS
N−p
2
R × TSS N −p
=
RSS p−1
R2 N − p
= ∼ Fp−1,N−p .
1 − R2 p − 1

52/112
Example - Cheddar cheese

As cheese ages, various chemical processes take place that determine the taste of the
final product. In a study of Cheddar cheese from the La Trobe Valley of Victoria,
Australia, samples of cheese were analysed for their chemical composition and were
subjected to taste tests. Overall taste scores were obtained by combining the scores
from several tasters.
Data are measured on concentrations of various chemicals in 30 samples of mature
Cheddar cheese, and a subjective measure of taste for each sample. The variables
acetic and H2S are the natural logarithm of the concentration of acetic acid and
hydrogen sulfide respectively. The variable lactic has not been transformed.

53/112
Example - Cheddar cheese

Let’s upload the data


1 # We use the package " faraway "
2 # If needed , install the package by running the command
3 # install . packages (" faraway ")
4 library ( faraway )
5

6 data ( cheddar )
7 attach ( cheddar )

54/112
Example - Cheddar cheese

The model (M1) is


taste = β0 + β1 Acetic + β2 H2S + β3 Lactic + 

We can analytically compute the MLE


1 X <- cbind ( rep (1 , nrow ( cheddar )) , Acetic , H2S , Lactic )
2 y <- matrix ( taste , ncol =1)
3

4 b . hat <- solve ( t ( X ) % * % X ) % * % t ( X ) % * % y


5 b . hat
6

7 # [ ,1]
8 # -28.8767696
9 # Acetic 0.3277413
10 # H2S 3.9118411
11 # Lactic 19.6705434
55/112
Example - Cheddar cheese

The estimate of the variance can be computed through the residuals


1
σ̂ 2 = (y − Xβ̂)> (y − Xβ̂)
N −p

The residual standard deviation is σ̂ 2 and the standard deviations of the estimates
are the square root of the elements on the diagonal of σ̂ 2 (X> X)−1

56/112
Example - Cheddar cheese

1 N <- nrow ( cheddar )


2 p <- ncol ( X )
3 sigma2 . hat <- t ( y - X % * % b . hat ) % * % ( y - X % * % b . hat ) / ( N - p )
4 sigma2 . hat
5

6 # [ ,1]
7 # [1 ,] 102.6312
8

9 as . vector ( sigma2 . hat ) * solve ( t ( X ) % * % X )


10 sqrt ( diag ( as . vector ( sigma2 . hat ) * solve ( t ( X ) % * % X )))
11

12 # Acetic H2S Lactic


13 # 19.735418 4.459757 1.248430 8.629055

57/112
Example - Cheddar cheese

And the R 2 statistics is


1 RSS <- t ( y - X % * % b . hat ) % * % ( y - X % * % b . hat )
2 TSS <- t ( y - mean ( y )) % * % ( y - mean ( y ))
3

4 TSS
5 # [ ,1]
6 # [1 ,] 7662.887
7

8 RSS
9 # [ ,1]
10 # [1 ,] 2668.411
11

12 R2 <- ( TSS - RSS ) / TSS


13 R2
14 # [ ,1]
15 # [1 ,] 0.6517747
58/112
Example - Cheddar cheese

We check whether any of the three possible predictors are significant. Define

H0 : β1 = β2 = β3 = 0

H1 : at least one βj 6= 0

In other terms, does the model (M1) offers any significant improvement over the
empty model (M0): taste = β0 + ?
1 Fstat <- R2 / (1 - R2 ) * (N - p ) / (p -1)
2 Fstat
3 # [ ,1]
4 # [1 ,] 16.22143

59/112
Example - Cheddar cheese

The F statistics is F (3, 26) distributed. The p-value is the probability to be greater
than the observed test statistic, i.e.:
1 pval <- 1 - pf ( Fstat , df1 =p -1 , df2 =N - p )
2 pval
3 # [ ,1]
4 # [1 ,] 3.810185 e -06

So there is very strong evidence that at least one of the three predictors is significant.

60/112
Example - Cheddar cheese

All these statistics can be computed through the lm function.


1 cheese . lm <- lm ( taste ~ Acetic + H2S + Lactic )
2 summary ( cheese . lm )
3

4 # Call :
5 # lm ( formula = taste ~ Acetic + H2S + Lactic )
6

7 # Residuals :
8 # Min 1Q Median 3Q Max
9 # -17.390 -6.612 -1.009 4.908 25.449

61/112
Example - Cheddar cheese

10 # Coefficients :
11 # Estimate Std . Error t value Pr ( >| t |)
12 # ( Intercept ) -28.8768 19.7354 -1.463 0.15540
13 # Acetic 0.3277 4.4598 0.073 0.94198
14 # H2S 3.9118 1.2484 3.133 0.00425 * *
15 # Lactic 19.6705 8.6291 2.280 0.03108 *
16 # ---
17 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’
18

19 # Residual standard error : 10.13 on 26 degrees of freedom


20 # Multiple R - squared : 0.6518 , Adjusted R - squared : 0.6116
21 #F - statistic : 16.22 on 3 and 26 DF , p - value : 3.81 e -06

In the model M1 with three predictors, is there evidence that H2S is significant?

62/112
Chapter 3 - Linear Gaussian models
3.1 - Introduction
3.2 - Basic results
3.2.1 - Maximum likelihood estimation
3.2.2 - Least squares estimation
3.3 - Diagnostics

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models


3.7 - Extensions
3.8 - Potential problems
63/112
Analysis of Variance - ANOVA

Analysis of variance is a method to compare means of groups of continuous


observations where the groups are defined by the levels of the factors.

• Y: continuous variable
• x: categorical variable(s)

The element of X (design matrix) are dummy variables.

64/112
Example: One-factor analysis

Genetically similar seeds are randomly assigned to be raised in either nutritionally


enriched environment (treatment A or treatment B) or standard conditions (control
group) using a completely randomised experimental design. After a predetermined
time all plants are harvested, dried and weighted.
1 library ( dobson )
2 data ( " plant . dried " )
3 attach ( plant . dried )
4

5 head ( plant . dried )


6

7 # group weight
8 #1 Control 4.17
9 #2 Control 5.58
10 #3 Control 5.18
11 #4 Control 6.11
12 #5 Control 4.50 65/112
Example: One-factor analysis

Remark: if experimental units are randomly allocated to groups corresponding to J


levels of a factor, this is called a completely randomised experiment.

The responses at level j, i.e. Yj1 , . . . , Yjnj are called replicates. In the example nj = K
for j = 1, 2, 3, but this is not always true. We assume nj = K throughout for simplicity.
The response vector (of length N = JK ), is given by:
y = [Y11 , Y12 , . . . , Y1K , Y21 , . . . , Y2K , . . . , YJ1 , . . . , YJK ]

We consider three specifications of the model:


(1) E(Yjk ) = µj for k = 1, . . . , K
(2) E(Yjk ) = µ + αj for k = 1, . . . , K
(3) E(Yjk ) = µ + αj for k = 1, . . . , K , under constraint α1 = 0
66/112
Model 1 - E(Yjk ) = µj for k = 1, . . . , K

Model (1) can be re-written as E(Yi ) = Jj=1 xij µj for i = 1, . . . , N where xij represent
P

an element of the design matrix through:

• xij = 1 if response Yi corresponds to level j


• xij = 0 otherwise

This gives E(y) = Xβ where

   
µ1 1 0 ··· 0
 µ2  0 1 · · · 0
   
β=
 .. 
 and X=
 .. .. .
.. 
. . . ··· .
µJ 0 0 ··· 1

67/112
Model 1 - E(Yjk ) = µj for k = 1, . . . , K

The design matrix is given by


13 X <- cbind ( c ( rep (1 ,10) , rep (0 ,20)) , c ( rep (0 ,10) , rep (1 ,10) , rep (0 ,10)) ,
14 c ( rep (0 ,20) , rep (1 ,10)) )

and the estimate β̂ is the vector of sample means for each group
15 y <- matrix ( weight , ncol =1)
16 b . hat <- solve ( t ( X ) % * % X ) % * % t ( X ) % * % y
17 b . hat
18

19 # [ ,1]
20 # [1 ,] 5.032
21 # [2 ,] 4.661
22 # [3 ,] 5.526

The disadvantage of this simple formulation of the model is that it cannot be extended
to consider more than one factor. 68/112
Model 2 - E(Yjk ) = µ + αj for k = 1, . . . , K

In this model µ is an average effect for all levels and αj is an additional effect due to
level j. In this case we have:

  
µ 1 1 0 ··· 0
α  1 0 1 · · · 0
 1  
1 0 0 · · ·
   
α2 
β= and X= 0 .

 ..   .. .. .. .. 
 .  . . . ··· .
αJ 1 1 1 ··· 1

The design matrix as an additional column of elements equal to 1.


The first row (or column) of the (J + 1) × (J + 1) matrix X> X is the sum of the
remaining rows (or columns), therefore X> X is singular and there is no unique solution.
69/112
Model 2 - E(Yjk ) = µ + αj for k = 1, . . . , K

The general solution can be written


     
µ̂ 0 −1
α̂1  Y  1
     
1 
 1. 
 ..  = K  ..  − λ  .. 
β̂ =    
 .   .   . 
α̂J YJ. 1
Pnj
where λ is an arbitrary constant and Yj. = i=1 Yij .
Usually a sum-to-one constraint is used, such that Jj=1 αj = 0, i.e.
P

J J
1 X 1 X Y..
Yj. − Jλ = 0 ⇐⇒ λ= Yj. =
K JK N
j=1 j=1

and therefore
Y.. Yj. Y..
µ̂ = and α̂j = − j = 1, . . . , J
N K N 70/112
Model 2 - E(Yjk ) = µ + αj for k = 1, . . . , K

1 X <- cbind ( rep (1 , nrow ( plant . dried )) ,


2 c ( rep (1 ,10) , rep (0 ,20)) ,
3 c ( rep (0 ,10) , rep (1 ,10) , rep (0 ,10)) ,
4 c ( rep (0 ,20) , rep (1 ,10)) )
5

6 lambda <- mean ( weight )


7 mu . hat <- lambda
8 mu . hat
9

10 # [1] 5.073
11

12 alpha . hat <- aggregate ( weight , list ( group ) , mean ) $ x - lambda


13 alpha . hat
14

15 # [1] -0.041 -0.412 0.453

71/112
Model 3 - E(Yjk ) = µ + αj for k = 1, . . . , K , under constraint α1 = 0

In this case µ represents the effect of the first level and αj measures the difference
between the first level and the j-th level of the factor.

This is called a corner point parametrization. We have

 
  1 0 0 ··· 0
µ 1 1 0 · · · 0
α2 
   
1 0 1 · · ·
 
β=
 .. 
 and X= 0 .
 .   .. .. .. .. 
. . . ··· .
αJ
1 0 0 ··· 1

72/112
Model 3 - E(Yjk ) = µ + αj for k = 1, . . . , K , under constraint α1 = 0

Now X> X is not singular, so there a unique solution for


 
Y1.
1 Y2. − Y1. 
 
β̂ =  .. 
K  .


YJ. − Y1.
1 X <- cbind ( rep (1 , nrow ( plants )) ,
2 c ( rep (0 ,10) , rep (1 ,10) , rep (0 ,10)) ,
3 c ( rep (0 ,20) , rep (1 ,10)) )
4

5 b . hat <- solve ( t ( X ) % * % X ) % * % t ( X ) % * % y


6 b . hat
7 # [ ,1]
8 # [1 ,] 5.032
9 # [2 ,] -0.371
10 # [3 ,] 0.494 73/112
Model 3 - E(Yjk ) = µ + αj for k = 1, . . . , K , under constraint α1 = 0

In the analysis of the variance, it is important to compare the alternative hypothesis


(means for each level differ) with the null hypothesis (means are all equal)
For the null model, E(Yjk ) = µ and the design matrix is a column vector of elements
equal to 1, i.e. X> X = N and X> y = Y.. .

1 >
D1 = (y y − β̂ > X> y)
σ2  
J K
1 X X 2 Y..2 
D0 = 2 Yjk −
σ N
j=1 k=1

and the F -statistic


D0 − D1 D1
F = /
J −1 N −J
74/112
Model 3 - E(Yjk ) = µ + αj for k = 1, . . . , K , under constraint α1 = 0

1 D1 <- t ( y ) % * % y - t ( b . hat ) % * % t ( X ) % * % y
2

3 # Null model :
4 X0 <- matrix ( rep (1 , nrow ( plant . dried )) , ncol =1)
5 b0 . hat <- solve ( t ( X0 ) % * % X0 ) % * % t ( X0 ) % * % y
6 D0 <- t ( y ) % * % y - t ( b0 . hat ) % * % t ( X0 ) % * % y
7

8 Fstat <- (( D0 - D1 ) / ( ncol ( X ) -1)) / ( D1 / ( nrow ( X ) - ncol ( X )))


9 crit . val <- qf (0.95 , df1 = ncol ( X ) -1 , df2 = nrow ( X ) - ncol ( X ))
10

11 if ( Fstat > crit . val ){


12 cat ( " There is enough evidence to reject H0 " )
13 } else {
14 cat ( " There is NOT enough evidence to reject H0 " )
15 }
16 # There is enough evidence to reject H0
75/112
Model 3 - E(Yjk ) = µ + αj for k = 1, . . . , K , under constraint α1 = 0

Similarly, we could have obtained the same results with the lm() function
1 res . lm <- lm ( weight ~ group )
2 summary ( res . lm )
3

4 # Call :
5 # lm ( formula = weight ~ group )
6

7 # Residuals :
8 # Min 1 Q Median 3Q Max
9 # -1.0710 -0.4180 -0.0060 0.2627 1.3690

76/112
Model 3 - E(Yjk ) = µ + αj for k = 1, . . . , K , under constraint α1 = 0

10 # Coefficients :
11 # Estimate Std . Error t value Pr ( >| t |)
12 # ( Intercept ) 5.0320 0.1971 25.527 <2e -16 * * *
13 # groupGroupA -0.3710 0.2788 -1.331 0.1944
14 # groupGroupB 0.4940 0.2788 1.772 0.0877 .
15 # ---
16 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
17

18 # Residual standard error : 0.6234 on 27 degrees of freedom


19 # Multiple R - squared : 0.2641 , Adjusted R - squared : 0.2096
20 #F - statistic : 4.846 on 2 and 27 DF , p - value : 0.01591

Have a look to the two-factor analysis of variance, which is a simple extension of this
section (Dobson and Barnett, 2018, Section 6.4.2).

77/112
Chapter 3 - Linear Gaussian models
3.1 - Introduction
3.2 - Basic results
3.2.1 - Maximum likelihood estimation
3.2.2 - Least squares estimation
3.3 - Diagnostics

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models


3.7 - Extensions
3.8 - Potential problems
78/112
Analysis of Covariance (ANCOVA)

The analysis of covariance is the term used for models in which some of the
explanatory variables are dummy variables representing factor levels and others are
continuous measurements called covariates.

We compare means of subgroups defined by factor levels, but we consider that the
covariates may also affect the responses, therefore we compare the means after
adjustment for covariate effects.

79/112
Example: Training methods

• The responses Yjk are achievement scores measured at three levels of a factor
• The factor represents three different training methods
• The covariates Xjk are aptitude scores measured before training commenced

We want to compare the training methods, taking into account differences in initial
aptitude between the three groups of subjects.

80/112
Example: Training methods

1 data ( " achievement " )


2 attach ( achievement )
3 plot ( achievement )
4

5 library ( ggplot2 ) # Install if needed


6 ggplot ( achievement , aes ( x = x , y = y , colour = method )) +
7 geom _ point ( aes ( shape = method , color = method )) +
8 labs ( x = " Initial aptitude , x " ) +
9 labs ( y = " Achievement score , y " )

81/112
Example: Training methods

Achievement score, y
8 method
A
6
B
C
4

1 2 3 4 5
Initial aptitude, x

82/112
Example: Training methods

1 res . lm <- lm ( y ~ method + x , data = achievement )


2 summary ( res . lm )
3

4 # Call :
5 # lm ( formula = score ~ treatment + aptitude , data = # train _ methods )
6

7 # Residuals :
8 # Min 1Q Median 3Q Max
9 # -1.73878 -0.57959 0.07347 0.48980 1.00408

83/112
Example: Training methods

10 # Coefficients :
11 # Estimate Std . Error t value Pr ( >| t |)
12 # ( Intercept ) 2.8367 0.4235 6.699 3.74 e -06 ***
13 # treatmentB 2.1878 0.4545 4.814 0.000162 ***
14 # treatmentC 1.8612 0.4240 4.390 0.000400 ***
15 # aptitude 0.7429 0.1421 5.227 6.84 e -05 ***
16 # ---
17 # Signif . codes : 0 ‘* * * ’ 0.001 ‘* * ’ 0.01 ‘* ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
18

19 # Residual standard error : 0.7785 on 17 degrees of freedom


20 # Multiple R - squared : 0.8386 , Adjusted R - squared : 0.8101
21 #F - statistic : 29.43 on 3 and 17 DF , p - value : 5.889 e -07

Note: To get directly the results of the F -test we can use the glm function.
1 res . lgm <- glm ( y ~ method + x , data = achievement )
2 summary ( res . glm )
84/112
Chapter 3 - Linear Gaussian models
3.1 - Introduction
3.2 - Basic results
3.2.1 - Maximum likelihood estimation
3.2.2 - Least squares estimation
3.3 - Diagnostics

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models


3.7 - Extensions
3.8 - Potential problems
85/112
General linear models

The term general linear models is used for Gaussian models with any combination of
categorical and continuous explanatory variables.

The factors can be

• crossed: there are observations for each combination of levels of the factors (see
two factors ANOVA)
• nested: the combinations of factors are different

86/112
Example - nested factors

Two-factor nested design, comparison of two drugs A1 tested in hospital B1 , B2 and B3


and A2 tested in hospital B4 and B5 .

Drug A1 Drug A2
Hospitals B1 B2 B3 B4 B5
Responses Y111 Y121 Y131 Y241 Y251
.. .. .. .. ..
. . . . .
Y11n1 Y12n2 Y13n3 Y24n4 Y25n5

We want to compare the effects of the two drugs and possible differences among
hospitals using the same drug.

87/112
Example - nested factors

The saturated model is

E(Yjkl ) = µ + α1 + α2 + (αβ)11 + (αβ)12 + (αβ)13 + (αβ)24 + (αβ)25

under constraints

• α1 = 0
• (αβ)11 = 0
• (αβ)24 = 0

Hospitals 1, 2 and 3 can be only compared within drug A1 and hospitals 4 and 5 can
be only compared within drug A2 .

88/112
Example - nested factors

This model is not different from other Gaussian models

• Response variable is assumed to be normally distributed


• Response and explanatory variables are assumed to be linearly related
• The variance σ 2 is constant
• The responses are assumed to be independent

These assumption must be checked by looking at the residuals.

89/112
Example - nested factors

If the assumption of normality is not plausible, a transformation can be used.

A very used transformation is the Box-Cox transformation:



 y λ −1 , λ 6= 0
∗ λ
y = (19)
log y λ=0

• if λ = 1, y is unchanged (except for a location shift which does not influence


inference)
• if λ = 21 , the transformation is the square root
• if λ = 1, the transformation is the reciprocal
• if λ = 0, the transformation is the logarithm
The value of λ which produces the “most normal” distribution can be estimated by the
method of maximum likelihood. 90/112
Chapter 3 - Linear Gaussian models
3.1 - Introduction
3.2 - Basic results
3.2.1 - Maximum likelihood estimation
3.2.2 - Least squares estimation
3.3 - Diagnostics

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models


3.7 - Extensions
3.8 - Potential problems
91/112
Non-additive associations

The additive assumption means that the effect of changes in a predictor Xj on the
response Y is independent of the values of other predictors.
In many situations, there is a synergy effect, i.e. increasing the level of one covariate
may interact with the level of another. This is called interaction in statistics.
Example
Take
Y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + ε

This means that

Y = β0 + (β1 + β3 x2 )x1 + β2 x2 + ε = β0 + β̃1 x1 + β2 x2 + ε

where β˜1 = β1 + β3 x2 , i.e. β̃1 changes with x2 and the effect of x1 on Y is no longer
constant.
92/112
Non-additive associations

Remark:

• The hierarchical principle states that if we include an interaction in a model, we


should also include the main effects, even if the p-values associated with their
coefficients are not significant.
• The concept of interactions applies to qualitative variables, to quantitative
variables or to a combination of both.

93/112
Non-additive associations
Example
Using the Credit data (ISLR package), the least squares lines are shown for
prediction of Balance from Income for students and non-students. Left model does
not include interactions, right models does.
1400

1400
student student
non−student non−student
1000

1000
Balance

Balance
600

600
200
200

0 50 100 150 0 50 100 150


94/112
Non-additive associations

1 lm <- lm ( Balance ~ Income + Student , data = Credit )


2 plot (0 ,0 , xlim = c (0 ,150) , ylim = c (200 ,1400) ,
3 xlab = " Income " , ylab = " Balance " )
4 abline ( a = lm $ coefficients [1]+ lm $ coefficients [3] ,
5 b = lm $ coefficients [2] ,
6 col =2 , lwd =2)
7 abline ( a = lm $ coefficients [1] , b = lm $ coefficients [2] , col =1 , lwd =2)
8 legend ( " topleft " , legend = c ( " student " , " non - student " ) ,
9 col = c (2 ,1) , lwd =2 , cex =0.9)

95/112
Non-additive associations

10 lm2 <- lm ( Balance ~ Income * Student , data = Credit )


11 plot (0 ,0 , xlim = c (0 ,150) , ylim = c (200 ,1400) ,
12 xlab = " Income " , ylab = " Balance " )
13 abline ( a = lm2 $ coefficients [1]+ lm2 $ coefficients [3] ,
14 b = lm2 $ coefficients [2]+ lm2 $ coefficients [4] ,
15 col =2 , lwd =2)
16 abline ( a = lm2 $ coefficients [1] ,
17 b = lm2 $ coefficients [2] , col =1 , lwd =2)
18 legend ( " topleft " , legend = c ( " student " , " non - student " ) ,
19 col = c (2 ,1) , lwd =2 , cex =0.9)

96/112
Non-linear associations

So far, we have considered only linear associations between X and Y, but in other
situations the relationship can be non-linear.

A non-linear association can be suggested by looking at the residuals.

A popular model is a U-shaped association, that can be modelled by a quadratic


association
E(Yi ) = β0 + β1 xi + β2 xi2

This is still a linear regression because the regression equation is still a linear
combination of the explanatory variables X and X2 .

97/112
Non-linear associations

Tips:

• center the explanatory variables


• scale the explanatory variables
xi − x̄
x̃i =
sd(x)
Scaling has several advantages

• numerical accuracy of matrix manipulation is improved, in particular in presence


of large values of the covariate
• the intercept β0 relates the average of y to the average of x, instead of the
average of y with x = 0 (which is sometimes an impossible value)
• the slope represents a one standard deviation change which is potentially more
meaningful than a one unit change (which can be very small or very large)
98/112
Fractional polynomials

The quadratic function is symmetric, however there may be situations where the rate
of increase is faster than the rate of decrease.

A range of functions can be investigated through fractional polynomials

E(Yi ) = β0 + β1 xip p 6= 0

In this case, you can test several models (p = 1 is linear, p = 2 is quadratic, p = −2 is


reciprocal quadratic) and investigate the best fit. If p = 0, we use log(xi ).

Since the curves of association can be modified by both the function and the slope
parameter, a large number of potential non-linear association can be investigated.
99/112
Fractional polynomials

100/112
Chapter 3 - Linear Gaussian models
3.1 - Introduction
3.2 - Basic results
3.2.1 - Maximum likelihood estimation
3.2.2 - Least squares estimation
3.3 - Diagnostics

3.4 - Analysis of Variance (ANOVA)

3.5 - Analysis of Covariance (ANCOVA)

3.6 - General linear models


3.7 - Extensions
3.8 - Potential problems
101/112
Correlation of the error terms

An important assumption of the linear model is that ε1 , . . . , εN are uncorrelated.

If there is correlation among the error terms, then the estimated standard errors of the
coefficients will tend to underestimate the true stanrdard errors.

When does it happen? A classic situation is for time series (Observations at adjacent
time points are positively correlated).

We investigate the correlation of the errors by plotting the residuals w.r.t. time:
• If errors are uncorrelated, there should be no discernible pattern;
• If the error terms are positively correlated, we may see a trend, in particular for
adjacent residuals.
102/112
Correlation of the error terms

ρ=0.0

3
2
1
Residual

−1 0
−3
0 20 40 60 80 100

ρ=0.5

2
1
Residual

0
−2
−4

0 20 40 60 80 100

ρ=0.9
1.5
0.5
Residual

−0.5
−1.5

0 20 40 60 80 100

Observation

Figure 2: Residuals from a linear regression fit to data generated with error correlation ρ = 0, 0.5 and 0.9.
103/112
Non-constant Variance

Another important assumption of the linear model is that the error terms have
constant variance, Var(εi ) = σ 2 for every i.

The case of non-constant variance is called heteroscedasticity.

A possible solution is to use a concave transformation of the observations, like log Y or



Y, which results in a greater amount of shrinkage of the larger responses.

Sometimes we have an idea of the variance of each response: for example, each
observation could be an average of ni observations, there the average can have
2
variance σi2 = σni . A solution to this situation is to fit weighted least squares, with
weights proportional to wi = ni .

104/112
Non-constant Variance

Response Y Response log(Y)

15

0.4
998
975
845
10

0.2
0.0
5
Residuals

Residuals

−0.8 −0.6 −0.4 −0.2


0
−5

605
−10

671
437

10 15 20 25 30 2.4 2.6 2.8 3.0 3.2 3.4

Fitted values Fitted values

Figure 3: Residuals for non-transformed (left) and log-transformed data (right). Red line is a smooth
105/112
fit to the residuals, blue lines give the outer quantile.
Collinearity

Collinearity refers to the situation in which two or more predictors are closely related to
each other.

The presence of collinearity makes it difficult to separate out the individual effects of
collinear variables on the response. Moreover, a small change in the data can cause the
coefficient values to be estimated very differently. This means there is a great
uncertainty in the estimates in presence of collinearity.
1 library ( ISLR )
2 data ( Credit )
3 attach ( Credit )
4

5 par ( mfrow = c (1 ,2))


6 plot ( Limit , Age , xlab = " Limit " , ylab = " Age " , col = " red " )
7 plot ( Limit , Rating , xlab = " Limit " , ylab = " Rating " , col = " red " )
106/112
Collinearity

80

800
70

600
60

Rating
Age

50

400
40

200
30

2000 4000 6000 8000 12000 2000 4000 6000 8000 12000

Limit Limit

107/112
Collinearity

A simple way to detect collinearity is to look at the correlation matrix of the predictors.
1 cor ( cbind ( Limit , Age , Rating ))
2 # Limit Age Rating
3 # Limit 1.0000000 0.1008879 0.9968797
4 # Age 0.1008879 1.0000000 0.1031650
5 # Rating 0.9968797 0.1031650 1.0000000

However, it is possible that collinearity exists among three or more variables even when
no pair of variables has high correlation. This situation is called multicollinearity.

108/112
Collinearity

A way to inspect multicollinearity is the variance inflation factor, i.e. the ratio of the
variance of β̂j when fitting the full model divided by the variance of β̂j if fit on its own.
1
VIF(β̂j ) = ≥1 (20)
1 − RX2 j |x−j

where Rx2j |x−j is the R 2 statistic from regression of Xj on all the other predictors.

• If VIF(β̂j ) = 1 there is no collinearity


• If VIF(β̂j ) > 5 there is a problem

Solutions:

• drop one of the problematic variables


• combine the collinear variables into a single predictor (e.g. taking the average of
each pair of predictors)
109/112
Collinearity

In the Credit dataset, if we were to regress balance on limit, rating and age. We
could then compute the VIF for each of the three predictors as follows
1 lm . limit <- lm ( Limit ~ Rating + Age )
2 1 / (1 - summary ( lm . limit ) $ r . squared ) # VIF Limit
3 # [1] 160.5929
4

5 lm . rating <- lm ( Rating ~ Limit + Age )


6 1 / (1 - summary ( lm . rating ) $ r . squared ) # VIF Rating
7 # [1] 160.6683
8

9 lm . age <- lm ( Age ~ Rating + Limit )


10 1 / (1 - summary ( lm . age ) $ r . squared ) # VIF Age
11 # [1] 1.011385

110/112
Collinearity

Alternatively one could install the car package in Rand directly use the vif() function.
1 library ( car )
2

3 lm . full <- lm ( Balance ~ Limit + Rating + Age )


4 vif ( lm . full )
5 # Limit Rating Age
6 # 160.592880 160.668301 1.011385

111/112
References

Sources and recommended reading:

1. A. J. Dobson & A. G. Barnett (2018) An introduction to generalised linear


models, Chapman and Hall. Chapter 6.
2. G. James, D. Witten, T. Hastie & R. Tibshirani (2013) An Introduction to
Statistical Learning with Applications in R, Springer. Chapter 3 (Section 3.5
excluded).

112/112

You might also like