09 Regularization

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Regularization

Regularization
Regularization: consists in adding a penalty term to the loss function:

V (θ) = J(θ) + λ g(θ)


where λ > 0 is a tuning parameter. The solution of the learning from data problem
becomes
θ̂ = arg min V (θ).
θ∈Mp (θ)

This technique is used in both regression and classification problems. One of the most
common form is quadratic regularization:

V (θ) = J(θ) + λ !θ!22 = J(θ) + λ θ T θ.

Why regularization?
To counteract the effects of overfitting.
To make the optimization problem better conditioned.

Note: the tuning parameter λ must be determined separately as for the model complexity.
Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 1/6
Regularization

Regularized least squares: the Ridge regression


Consider the static model

y(t) = ϕT (t) θ + ε(t), t = 1, 2, . . . , N

and the associated quadratic regularized LS problem

N
ε2 (t) + λ θ T θ = !Y − Φ θ!2 + λ !θ!22
!
minV (θ) : V (θ) =
t=1

This estimation method is called Ridge regression. By setting the gradient of V (θ) to zero
it is easy to find the normal equations
" T
Φ Φ + λ Ip θ = ΦT Y
#

where Ip is the p × p identity matrix and p is the complexity of the model. The
regularized LS estimator is thus given by

T
#−1
ΦT Y
"
θ̂ = Φ Φ + λ Ip

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 2/6


Regularization

The penalty term λ !θ!22 is also called a shrinkage penalty in statistics because it has
the effect of shrinking the estimated parameters towards zero.

Assume now that the existence of the true model

y(t) = ϕT (t) θ ∗ + w(t) t = 1, 2, . . . , N

By analyzing the statistical properties of the Ridge estimator we get


" T #−1 T ∗ ∗
" T #−1 ∗
E[θ̂] = Φ Φ + λ Ip Φ Φ θ = θ − λ Φ Φ + λ Ip θ $= θ ∗

2 T
#−1 T T
#−1 2
(ΦT Φ)−1
" "
cov(θ̂) = σw Φ Φ + λ Ip Φ Φ Φ Φ + λ Ip < σw

The Ridge estimator θ̂ is biased and cov(θ̂) < cov(θ̂LS ).


As λ increases, the shrinkage of the estimated coefficients leads to a reduction in the
variance of the estimates at the expense of an increase in bias =⇒ λ is an
hyperparameter that can manage the bias-variance tradeoff.
If ΦT Φ is ill-conditioned (or even rank deficient), the use of λ leads to the better
conditioned matrix ΦT Φ + λ Ip .
Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 3/6
Regularization

Example: polynomial fitting of the model y(t) = sin u(t) + w(t).


LS, 3rd order polynomial LS, 12th order polynomial
1.5 1.5
True True
Val. set Val. set
1
1 Estimated Estimated

0.5
0.5

0
0
-0.5

-0.5
-1

-1
-1.5

-1.5 -2
0 100 200 300 400 500 600 0 100 200 300 400 500 600

Ridge, 12th order polynomial, =0.1 Ridge, 12th order polynomial, =3


1.5 1.5
True True
Val. set Val. set
1 Estimated 1 Estimated

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5
0 100 200 300 400 500 600 0 100 200 300 400 500 600

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 4/6


Regularization

Regularization can be exploited also for the identification of dynamic models and for
classification problems. In general:

If the model complexity p is high (many parameters) it may not be possible to


estimate several of them accurately =⇒ it is advantageous to pull them towards
zero as the ones having the smallest influence on J(θ) will be affected most by the
shrinkage property =⇒ regularization allows complex models to be trained on
small data sets without severe overfitting.
The problem of minimizing J(θ) may be ill conditioned, especially when the
complexity p is high, in the sense that the Hessian J $$ (θ) may be ill-conditioned
=⇒ adding the norm penalty will add λ Ip to this matrix so that it becomes better
conditioned.
The choice of λ is a crucial issue as we may think of λ as a knob to control the
bias-variance tradeoff (the larger λ the larger the number of parameters that will be
close to zero). The best way consists in choosing the values of λ leading to the
smallest value of the loss function evaluated on the validation set.

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 5/6


Regularization

Alternative formulation for Ridge regression

It is possible to show that the problem

arg min V (θ), V (θ) = J(θ) + λ θ T θ


θ∈Mp (θ)

is equivalent to the problem

arg min J(θ), subject to θT θ ≤ K


θ∈Mp (θ)

For every λ there is some K such that the solutions θ̂ of the two optimization
problems are the same. The two approaches can be related using Lagrange
multipliers.
The parameter K can be seen as a budget for how large the norm of θ can be.
Note that K plays the role of 1/λ. In fact, decreasing K has the effect of shrinking
the estimated parameters towards zero.

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 6/6

You might also like