Chapter2 Annotated Part2

Chapter 2
Linear Regression
This chapter discusses linear regression from a predictive modelling perspective. Extensions
to penalized likelihood (ridge regression) and variable selection (LASSO) are also discussed.
The material in this chapter can be mostly found in Chapter 3 of Hastie et al. (2009) and
Chapter 1 of Wood (2017b).
2.1 Predictive Modelling
2.1.1 Setup
Given a set of inputs and outputs called a training set, prediction is the task of deriving a
function that maps new inputs to new outputs:
19
20 CHAPTER 2. LINEAR REGRESSION
The term predictive model is a colloquialism that usually either (a) describes a learning
algorithm,
or a prediction rule,
2.1. PREDICTIVE MODELLING 21
This setup is intuitive, but is not specific enough to actually execute. Any function that
maps a set of inputs and outputs to a function from inputs to outputs could be called a
“learning algorithm”, just like how in statistics any function from the sample space to the
parameter space could be called an “estimator”. A framework for evaluating the quality of a
learning algorithm is required.
Learning algorithms are evaluated by evaluating the quality of the predictions they pro-
duce. If the predictions produced by a given algorithm are generally close to the correspond-
ing outputs, then a learning algorithm is “good”. To make this specific enough to implement,
definitions of “generally” and “close” are required. We say that a prediction algorithm is
good if it has small expected loss:
If we replace “generally” by “on average” and “close” by “small loss”, we can implement
predictive modelling.
The most common choice of loss function is the square loss:
It is important to stress that we could use any loss function; the choice of square loss is
convenient for development, but other loss functions are usefull in other contexts. There are
a number of motivations for and desirable properties of square loss:
(a) It is the negative log-likelihood in a normal model where we predict using the mean,
bring a connection with classical statistics;
(b) It admits a tractable minimizer over the class of all functions (see below);
(c) It approximates any loss function that is smooth and stationary.
On the last point, consider the Taylor expansion of a twice-differentiable, stationary loss
function:
While none of these properties are unique or necessary, they are nice. We will use square
loss for the remainder of the course.
2.1.2 Square loss and the regression function

The choice of a loss function leads to a specific prediction function. We choose the prediction
function that minimizes the expected loss:
This expectation is taken with respect to the joint distribution of a single (input, ouput)
pair. This minimization is over the space of all functions. Remarkably, with square loss it
is tractable:
The use of square loss implies that predictions of an output at a given input should be
set equal to the conditional expectation of the outputs at that input.
2.1.3 Estimating the regression function

Having decided to use the regression function to predict an output given an input, the last
step is to estimate it: given a training set of data, we want to take our best guess at what
the regression function is. This involves:
(a) Choosing a model for the regression function, and
(b) Fitting the model to the training data.
The phrase “fitting the model to the data” is usually used to describe this whole process.
We now have some thinking to do. One option is to appeal to the law of large numbers/central
limit theorem:
This has two glaring problems:
1. From classical statistics, we know that the error in using the sample mean to approx-
imate the population mean is proportional to the number of data in the sum. So we
will need a lot of repeated inputs for this to work.
2. What happens if a new input doesn’t equal any previously-observed input? This model
cannot extrapolate, but it also cannot even predict for values within the range—but
not equal to any—of the inputs in the training set.
3. There is no guarantee of smoothness/stability, i.e. that predictions having similar

inputs will be similar. This sometimes matters and sometimes does not.
One option is to expand the range of data used in the prediction of the outputs to any
“nearby” inputs. This is what is done by the k-nearest neighbours prediction algorithm:
The k-nearest neighbours approach to prediction is smoother and more efficient (in the
variance sense) than simple averaging and can be applied to any input value. It also makes
almost no assumptions on the form of the distribution of the training data, which sounds
like a purely good thing. However, it only uses a small subset of the available training data
to make each prediction. As the dimension of the inputs grows, predictions will become less
precise: the number of training data needed to make predictions as precise as those for a
single-dimensional input is exponential in the input dimension.
Another way to put it is that k-nearest neighbours is a local method: only a small
number of training data close to the input are used to predict the output. If we are willing
to make strong assumptions about the form of the regression function, global methods can
out-perform local ones, if the assumptions hold. Consider a simple polynomial model for
the regression function, based on a Taylor expansion at the sample mean of the inputs:
With one term, this might look more familiar:
This is the context in which we view the familiar linear regression model.
2.2 Linear Regression via Least Squares
2.2.1 Derivation and properties

The linear regression model is the simplest predictive modelling technique:
Lift
output
learningrule
When the regression function is approximated well globally by a linear function, then
linear regression is an excellent method: easy to fit, provably lowest variance, and easy
to interpret. It also forms the basis of many of the more complicated models covered in
subsequent chapters.
Linear regression is a parametric model. Fitting a linear regression model to data is
equivalent to estimating the regression parameter:
Fix Xp
I
vector in IRP
function
2.2. LINEAR REGRESSION VIA LEAST SQUARES 27
The method of least squares is used to estimate :
Let Yi Xi β square loss

LIKY
114
Xinyi i 1 in
training data
XP12
ye.IR Y Y 4nF
Estimate XER Xij xi
B argmgnLp XiY 11 112 xtx j xj
set
0
XYY XP 0
Xy X'xp B xx xTy
Under a Gaussian model for the outputs, least squares is maximum likelihood:
Y N XP 0 In
Up oY Const 14 B 4 43
cont 114 4311
maximizing l β forfixed 02
minimizing14 xp It Lp XY
Least squares is maximum likelihood has a geometric interpretation of modelling the

mean output as the orthogonal projection of the outputs into the span of the inputs:
Y Xps β minimizes 114 4311

Distancebetween
Y and
span x YER Y Xβ
forsomePERP CIR
so Y is the closest rector in span x toy
Y argmin 114 6112

bespan x
This is interpreted as choosing the mean to be the closest vector to y in Rn that can be
written as a linear combination of the p n-dimensional vectors x1 , . . . , xp .
Prediction of new outputs also has a geometric flavour, as a projection into the span of
the inputs. This is simply because we use the mean output to predict:
see above
We can write the whole vector of predictions as a linear combination of the outputs:
Y Xp XX XTY HY
H NX XT
The “hat matrix” is an orthogonal projection:
Definition It is an orthogonal projection if H H
HE Xx X Nx XT Xx x H
Properties
The eigenvalues of H are all 0 and 1
H UNUT H UNKENT UNUT

U is invertible A m If X is an eigenvalue
xp
Of It then 7 7
Rank H P where ER
Note trace H trace XXXX trace Ip p
The geometric nature of the predictions in linear regression suggest that linear algebra
is the right field to lean on for both computation and theory.
2.2.2 Computation via Gram-Schmidt and QR decomposition

An understanding of computation for linear regression is important for understanding its
theoretical properties and limitations. In matrix form, the model is:
The least-squares solution is:
With p = 1 and no intercept, this is written in terms of the inner product of the inputs
and outputs and the norm of the inputs:
f
Now add a new input, and suppose it is orthogonal to the previous input:
Then the new least squares solution is:

If the second input is orthogonal to the first input, then the least sqaures estimate of the
first regression coefficient doesn’t change. Supposing now that all p inputs are orthogonal,
we have:
and hence the least squares estimates are:
and the predictions are:

Of course, inputs will never be orthogonal outside of designed experiments (where they
are not random); in fact if their distribution is continuous then they are orthogonal with
probability zero. But, we can always orthogonalize them! The Gram-Schmidt procedure for
orthogonalization is as follows:
By applying Gram-Schmidt to the columns of X, we form the QR decomposition:
The QR decomposition of X directly leads to estimates and predictions:

The Gram-Schmidt procedure also has an interpretation within linear regression. If we

consider the regression on the first k < p inputs,
we end up with residuals,
and hence estimates,
obtained by regressing xj on the preceding residuals. But this is exactly the Gram-
Schmidt procedure; indeed, if we let
then we have
2.2.3 Spectral analysis

Linear regression is fantastic when we have a small to moderate number of input variables
that the regression function is a linear function of. To move towards more advanced predictive
modelling, it is instructive to ask when and why linear regression might not work. One case
relevant to predictive modelling, which we will see in Chapter 3, is the case of a large number
of highly correlated inputs. To understand the behaviour of linear regression in this scenario,
we consider a spectral analysis. The singular value decomposition of the input matrix X is:
ER can be written P
VERNP UTV Ip
UDVT
D
diag f singularvalues
fjsoiffVE
Sp
RPXPVTV VVT Ip Hnk
Note GRP VDUTUDVT VD VT
Eigendecomposition of
So sub of X TX
From this, we can write the predictions as:
Y NX X'y UDVT VDT VDUTY
UDIE D
LIV DUTY
UREDUTY
UUTY
As we add more complicated inputs that are more highly correlated with each other, the
ratio of the largest singular value to the smallest (called the condition number) of X will
increase. Eventually, the smallest singular value will reach zero, indicating that two or more
inputs are linearly dependent, X is not full rank, X T X is not invertible, and b cannot be
(uniquely) determined. This sounds bad. However, predictions can still be made! So there
is no problem?
The problem comes when considering the uncertainty in the predictions. The formula
for the point predictions assumed that X T X could be inverted. We have:
Y XP Hy it xx exists
Corly Cov B XT O xx XT 02H

Cov B 04 4 02 VDV
As the smallest diagonal element in D

0
Cor B α for any ERP
The uncertainty in Y explodes
Intuitively, it really shouldn’t be the case that adding in a single linearly dependent
predictor completely destroys a perfectly good predictive model. The problem is that all
inputs contribute equally to the predictions, precisely because of the lack of dependence
of the predictions on the singular values of X. We want directions in X space that are
(nearly) linearly dependent to have (nearly) no influence on the predictions. This can be
accomplished by modifying the spectral decomposition of X T X directly.
2.3. RIDGE REGRESSION 37
2.3 Ridge Regression

When we move towards more advanced methods for predictive modelling, a common strategy
will be to apply linear regression to a set of transformed inputs that is much richer than simply
using the inputs as observed. This practice is sometimes called “feature engineering”. Once
we start computing a richer set of inputs, we will inevitably have to deal with collinearity:
inputs that are linearly dependent, or nearly so.
We saw in Section 2.2.3 that the predictions only depend on the singular vectors of X,
but that their uncertainty increases without bound as any columns of X become closer to
linearly dependent. When two or more columns of X are linearly dependent, the variance
of the predictions is infinite. This is undesirable: if we have a prediction algorithm that
is working well, and we copy one of the inputs, intuitively it shouldn’t completely fail; it
should return the same predictions as if the input wasn’t copied! This is a property enjoyed
by ridge regression.
2.3.1 Spectral motivation

Consider the least-squares predictions:
Y XX xty HY
Ridge regression adds a small value to the diagonal of X T X:
Yridge 7 X AIP XTY 7 0
Hla y
Note XX VNV eigendecomposition VTV WT Ip

n diag X Xp
X invertible 7 Apso
Note XX Ip VnVT XVVT
V NtXIp V
But A diagonal so Nt7Ip diag 7 7

Apta 0
The predictions now satisfy:
YA X Xex 7251 4 YER

nf UDVIVDVTTTIJVD.VE
UDVT V D XI VT VDUTY
UD D XI DUTY
I YU lui upJ UjEIR

g.fi
Y10 IjPiUjUjy Yfrom linearregression
The predictions now depend on the singular values of X. Specifically, the predictions are
shrunk towards zero by a factor inversely proportional to the singular values:
Inputs corresponding to large singular values have their influence on the predictions
reduced less, because this factor will be closer to 1. Inputs with small singular values have
smaller influence on the predictions. Inputs with singular values equal to zero have no
influence on the predictions.
2.3. RIDGE REGRESSION 39
2.3.2 Penalized least squares

The original derivation of ridge regression was motivated along these practical, geometric
lines. A more statistical derivation allowed the properties of ridge regression estimates to be
analyzed, and led to a wide literature on these shrinkage methods.
Consider the following constrained minimization problem:
B argmin 14 43112 too

β I BIKE
This problem is another solution to collinearity: by simply forbidding the size of the
regression coefficients to become too large, we limit the influence of linearly dependent vari-
ables on the predictions. This problem always has a solution for any t > 0, even if X is low
rank. To see this, consider the Lagrangian corresponding to this objective:
114 4311
argmin
β 111311K
114
where A and t are
arging 4311 apply related
Here > 0 is referred to as a regularization or shrinkage parameter. Larger values

indicate more shrinkage of the regression coefficients towards zero.
The solution can now be derived:
114 4311 7111112

argmin
114 43112 7111311 0
2 44 xp 2 13 0
XY xp 713 0
XX Ip B XY
B XXXI Xy
Y Xp XX XI XTY
prediction
Ridge regression
2.4. LEAST ABSOLUTE SHRINKAGE AND SELECTION OPERATOR (LASSO) 41
The solution to the constrained optimization problem is exactly the ridge regression
solution from Section 2.3.1.
To fit a ridge regression model to data for fixed > 0 is trivial given the ability to fit
the ordinary linear regression model (which is ridge regression with = 0). Estimating
from the data is necessary in practice. We will discuss this in Chapter 4.
2.4 Least Absolute Shrinkage and Selection Operator

(LASSO)
Ridge regression can shrink the influence of an input arbitrarily close to zero. However, it
cannot shrink it to zero, and therefore cannot actually remove an input from the model. We
may wish to actually remove inputs if we desire a sparser or more parsimonious model, for
reasons of computation, interpretation, scientific plausibility, and so on.
One solution is to just remove inputs with influence below a given threshold. However,
in the case of a very large number of inputs most of which have nearly zero influence, this is
incredibly computationally wasteful. It is also not at all clear how to choose such a threshold.
Further, dropping one input will change the singular values of another, so this procedure
would have to be iterated until the singular values were all above the (arbitrary) threshold;
this is especially relevant in the case of many collinear inputs. A more principled way of
reducing the number of inputs is desirable.
The least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996) is a
method that simultaneously performs shrinkage and variable selection. Incredibly, algo-
rithms exist for computing the LASSO solution(s) that scale in the number of non-zero
parameters, rather than the number of parameters (Friedman et al., 2010). This makes
LASSO and related methods (e.g. glmnet, Zou and Hastie 2005) extremely efficient in
high-dimensional, sparse prediction problems.
2.4.1 L1 -penalties and variable selection

Consider the following modification of the ridge regression problem:
114 7111311
jilβjl
1431122 111311
argman
Throughout this section for

convenience assume
the dater have been scaled

Ziyi 0
l
iixij o
j p
EE Yi I
I l p
iiXij j
Changing from the 2-norm to the 1-norm has surprising consequences: the latter can set
coefficients to zero exactly. We can illustrate this using the following infamous geometric
argument: β2
Ordinaryleastsquares
β
Ridge regression
if Mia
x Ba ridge
solutions
1181124
13112 t cannotintersect
exactly
atoonanyaxis LA a
pllist f
The constant contours of the objective have larger values the farther they are from the
centre. The same is true of the constraint. where they intersect is the solution to the
constrained minimization problem.
2.4.2 Coordinate descent and sparsity

The LASSO problem no longer has an exact solution. The reason is because the penalty is not
differentiable at zero. The penalty is still convex, and substantial research has been done on
the subject of its efficient computation. Many algorithms have been proposed for computing
the LASSO solultion. Surprisingly, coordinate descent—one of the least efficient general
convex optimization algorithms—has emerged as by far the best algorithm for computing
LASSO solutions; see Friedman et al. (2010); Tay et al. (2023). The reason is because (a)
the coordinate updates are available in closed-form and depend on quantities which may
be pre-computed, enabling extremely fast iterations depending only on primitive operations,
and (b) it can be calculated in advance of each iteration which coefficients will be unchanged,
and in advance of the whole algorithm (for a given ) which coefficients will be zero. This
enables the algorithm to apply computation only to coefficients known to be non-zero and
to change on a given iteration.
For a general optimization problem, Convex
x̅ argminfix f R IR convex
coordinate descent optimizes one coordinate of the objective at a time,
x̅ argmyinflxiix.in Xp Optional
Reference optimization
K argm.intX xyx3 xp numerical
Nocedal
Wright
by
argmginflx.e.xj.lij Xjti Xp
argingfix Xp 1 Xp
And cycles the process
until convergence is attained. It is generally an inefficient convex optimization algorithm,
requiring p iterations to minimize a p-dimensional quadratic (as opposed to 1 for, say, New-
ton’s method). However, it turns out that for the LASSO, the sparsity structure in the
problem is exploited well by a clever implementation of coordinate descent.
To start, consider the objective:
La P 114 4112 711811
Yi XP Ej βjI
continuous
βjER
Differentiable for βj70 βjCO
Not differentiable at 0
The partial derivative with respect to
βj
only is:
j
211T EI Xij Yi XP 7 if
Bjo
7 if Bjco
deaf if Pj 0
contain
Does Current value of
The solution is: Bj p Pj1 Piti BP
y
ftp.f o II Xp III ikjkdeafexi.FI J
confusing
Ii ijyi Zejx.pe 7tEIEIBj
part ejxieβe
pj XijYi 7
But we assumed β so here If
Bj so then
B Ei ijyi Zetjx.pe a
so we have
ii XijYi j eXiepe 7 βj o
Bj xijyi E.pe ieBe

t7pjc Bj O
Can express this whole operation at
β S Xiii Ej eXieβe 7
0 0 14
Where
scar 20 8 41
871 1
is the soft thresholding operator

SWA
I
assuming
Note
The dependence on the soft-thresholding operator is unique to the L1 penalty, and is why
the LASSO is able to send coefficients to zero: they get thresholded away. In fact, it can be
shown that:
Ii XijYi exjXieβe II Xijritpj where ri Yi Yi Yi Xip

Currett
iteration
And Iii xijri xjy fpyyixk.pk XjER jᵗʰ column of
Precompete
Tiwa are non zero fffh.am edonce
Bj sfxjy E.ggfXKBK Bj ftfnetffsfjm.li
It is known in 70 888cgVanished
advance which is will go from computed
are already
The_So, software precomputes all the feature inner products, and then considers only variables
for whichb j 6= 0. Further, it can be shown that:
starting atp o.ITsoonlyitlxiyl77 seen
directlyfrom
Hence, it is known whether a bj will change at each iteration based on quantities com-
puted in advance of the iteration, speeding up the algorithm.
Finally, computing the solution path of the LASSO—a sequence of solutions as a function
of —can be done efficiently. The software starts with:
Imax MY I etyl Note if 7 Imax then B o
They choose 7min Imax eg 7min 0.00017max
A decreasing sequence of values is then used. The computational trick is to use the
coefficients from the previous iteration as a starting value for the next iteration; in the opti-
mization literature this is called warm starts and it is a very popular technique in problems
of this type. However, it is known that b is a piecewise linear function of :
Let Ba argmin
114 XP lit 7111311
As 7 more Bula become non Zero

Suppose we have 7077 such that Bra to Bra
Then TEMP such that XE 7,70 T depends on 70,7
BCA 8170 7 7012
(See Hastie et al. 2009, Exercise 3.27 (c)). This explains why the warm starts work so
well in this problem: the values of b will be close if their values of are close, i.e.
But also because Bla and play

111517 8170111
1 701112011
have the same non zero values
Tallit 17 701 is small
In practice, the glmnet (Friedman et al., 2010) implementation is extremely fast for very
high-dimensional, sparse regression problems.

Chapter2 Annotated Part2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter2 Annotated Part2

Uploaded by

Copyright:

Available Formats

Chapter 2

2.1 Predictive Modelling

The most common choice of loss function is the square loss:

(c) It approximates any loss function that is smooth and stationary.

2.1.2 Square loss and the regression function

2.1.3 Estimating the regression function

(a) Choosing a model for the regression function, and

(b) Fitting the model to the training data.

This has two glaring problems:

3. There is no guarantee of smoothness/stability, i.e. that predictions having similar

With one term, this might look more familiar:

2.2 Linear Regression via Least Squares

2.2.1 Derivation and properties

The method of least squares is used to estimate :

Let Yi Xi β square loss

cont 114 4311

Least squares is maximum likelihood has a geometric interpretation of modelling the

Y Xps β minimizes 114 4311

Y argmin 114 6112

Definition It is an orthogonal projection if H H

H UNUT H UNKENT UNUT

2.2.2 Computation via Gram-Schmidt and QR decomposition

The least-squares solution is:

Then the new least squares solution is:

and hence the least squares estimates are:

and the predictions are:

By applying Gram-Schmidt to the columns of X, we form the QR decomposition:

The QR decomposition of X directly leads to estimates and predictions:

The Gram-Schmidt procedure also has an interpretation within linear regression. If we

we end up with residuals,

and hence estimates,

2.2.3 Spectral analysis

Note GRP VDUTUDVT VD VT

From this, we can write the predictions as:

Y NX X'y UDVT VDT VDUTY

Corly Cov B XT O xx XT 02H

As the smallest diagonal element in D

Cor B α for any ERP

The uncertainty in Y explodes

2.3 Ridge Regression

2.3.1 Spectral motivation

Yridge 7 X AIP XTY 7 0

Note XX VNV eigendecomposition VTV WT Ip

But A diagonal so Nt7Ip diag 7 7

The predictions now satisfy:

YA X Xex 7251 4 YER

I YU lui upJ UjEIR

2.3.2 Penalized least squares

B argmin 14 43112 too

Here > 0 is referred to as a regularization or shrinkage parameter. Larger values

114 4311 7111112

2.4 Least Absolute Shrinkage and Selection Operator

2.4.1 L1 -penalties and variable selection

Throughout this section for

the dater have been scaled

2.4.2 Coordinate descent and sparsity

coordinate descent optimizes one coordinate of the objective at a time,

To start, consider the objective:

La P 114 4112 711811

Bj xijyi E.pe ieBe

Can express this whole operation at

is the soft thresholding operator

Ii XijYi exjXieβe II Xijritpj where ri Yi Yi Yi Xip