Lecture 02

Advanced Machine Learning
Lecture 2: Linear models

Sandjai Bhulai
Vrije Universiteit Amsterdam
s.bhulai@vu.nl
8 September 2023
Linear models

Polynomial curve tting
▪ 10 points sampled from sin(2πx) + disturbance
x3 x5 x7 x9
sin x = x − + − + +⋯
3! 5! 7! 9!
3 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting
▪ Polynomial curve
M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j
∑
j=0
▪ Performance is measured by
1 N
{y(xn, w) − tn}2
2∑
E(w) =
n=1

fi
Polynomial curve tting: order 0
M
∑
j=0
fi
M
∑
j=0
fi
M
∑
j=0
fi
M
∑
j=0
fi
Over tting
▪ Root mean square (RMS) error: ERMS = 2E(w*)/N

fi
Over tting

fi
E ect of dataset size
▪ Polynomial of order 9 and N = 15

ff
E ect of dataset size
▪ Polynomial of order 9 and N = 100

ff
Regularization
▪ Penalize large coe cients values
1 N λ
{y(xn, w) − tn} + ∥w∥2
2
2∑
Ẽ(w) =
n=1
2
▪ λ becomes a model parameter

ffi
Regularization
▪ Regularization with ln λ = − 18

Regularization
▪ Regularization with ln λ = 0

Regularization
▪ ERMS versus ln λ

Regularization

A deeper analysis

What is the issue?
x3 x5 x7 x9
sin x = x − + − + +⋯
3! 5! 7! 9!
Linear basis function models
▪ General model is
M−1
wjφj(x) = w⊤φ(x)
∑
y(x, w) =
j=0
▪ φj are known are basis functions

▪ Typically, φ0(x) = 1 so that w0 acts as bias

M−1
∑
y(x, w) =
j=0
▪ Polynomial basis functions:

φj(x) = x j
▪ These are global functions

M−1
∑
y(x, w) =
j=0
▪ Gaussian basis functions:

(x − μj)2
{ 2s 2 }
φj(x) = exp −
▪ These are local functions
> μj controls location

> s controls scale
M−1
∑
y(x, w) =
j=0
▪ Sigmoidal basis functions:
( )
x − μj
φj(x) = σ
s
where
1
σ(a) =
1 + exp(−a)

Maximum likelihood
▪ Assume observations from a deterministic function with
added Gaussian noise:
t = y(x, w) + ϵ where p(ϵ | β) = (ϵ | 0, β −1)
{ 2σ 2 }
2 1 1 2
▪ Note that (x | μ, σ ) = exp − (x − μ)
(2πσ 2)1/2
β = 1/σ 2
(x | μ, σ 2) > 0
∞
∫−∞
(x | μ, σ 2) dx = 1
𝒩
𝒩
𝒩
𝒩
Maximum likelihood
{ 2σ 2 }
2 1 1 2
▪ Note that (x | μ, σ ) = exp − (x − μ)
(2πσ 2)1/2
∞
∫−∞
[x] = x (x | μ, σ 2) dx = μ
∞
∫−∞
[x 2] = x 2 (x | μ, σ 2) dx = μ 2 + σ 2
var[x] = [x 2] − [x]2 = σ 2
𝒩
𝔼
𝔼
𝒩
𝒩
𝔼
𝔼
𝒩
Maximum likelihood
▪ This is the same as saying
p(t | x, w, β) = (t | y(x, w), β −1)
M−1
▪ Recall: y(x, w) =
∑
j=0

𝒩
𝒩
Maximum likelihood
▪ This is the same as saying
p(t | x, w, β) = (t | y(x, w), β −1)
▪ Given observed inputs X = {x1, …, xN} and targets

t = [t1, …, tN ]⊤, we obtain the likelihood function:
N
(tn | w⊤φ(xn), β −1)
∏
p(t | X, w, β) =
n=1
𝒩
𝒩
Maximum likelihood
▪ Taking the logarithm, we get
N
(tn | w⊤φ(xn), β −1)
∑
ln p(t | w, β) = ln
n=1
N N
= ln β − ln(2π) − βED(w)
2 2
1 N
{tn − w⊤φ(xn)}2
2∑
where ED(w) =
n=1
{ 2σ 2 }
2 1 1 2
▪ Recall: (x | μ, σ ) = exp − (x − μ)
(2πσ 2)1/2
𝒩
𝒩
Maximum likelihood
▪ Computing the gradient and setting it to zero yields
N
{tn − w⊤φ(xn)}φ(xn)⊤ = 0
∑
∇w ln p(t | w, β) = β
n=1
The Moore-Penrose
pseudo-inverse
▪ Solve for w, we get
wML = (Φ Φ) ⊤ −1
Φ⊤t
with φ0(x1) φ1(x1) ⋯ φM−1(x1)

φ0(x2) φ1(x2) ⋯ φM−1(x2)
Φ=
⋮ ⋮ ⋱ ⋮
φ0(xN ) φ1(xN ) ⋯ φM−1(xN )

Interpretation
▪ Consider y = ΦwML = [φ1, …, φM]wML
y∈ ⊆
N-dimensional
M-dimensional
▪ is spanned by φ1, …, φM
▪ wML minimizes the distance between t and its orthogonal
projection on , i.e., y
30
𝒮
𝒮
Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
𝒮
𝒯
Regularization
▪ Consider the error function
ED(w) + λEW (w)

data term + regularization term
▪ With the sum-of-squares error function and a quadratic

regularizer, we get
1 N ⊤ 2 λ ⊤
∑
{tn − w φ(xn)} + w w
2 n=1 2
▪ This is minimized by w = (λI + Φ⊤Φ)

−1
Φ⊤t

Regularization
▪ With a more general regularizer, we have
1 N λ M
{tn − w⊤φ(xn)}2 + | wj |q
2∑
n=1
2 ∑
j=1
lasso quadratic

Regularization
▪ Lasso tends to generate sparser solutions than a quadratic
regularizer

Lecture 02

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 02

Uploaded by

Copyright:

Available Formats

Advanced Machine Learning

Lecture 2: Linear models

Advanced Machine Learning

4 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

9 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

10 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

11 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

12 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

▪ λ becomes a model parameter

13 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

14 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

15 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

16 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

17 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

Advanced Machine Learning

▪ φj are known are basis functions

20 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

▪ Polynomial basis functions:

▪ These are global functions

21 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

▪ Gaussian basis functions:

▪ These are local functions

> μj controls location

▪ Sigmoidal basis functions:

23 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

▪ This is the same as saying

p(t | x, w, β) = (t | y(x, w), β −1)

26 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

p(t | x, w, β) = (t | y(x, w), β −1)

▪ Given observed inputs X = {x1, …, xN} and targets

with φ0(x1) φ1(x1) ⋯ φM−1(x1)

29 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

ED(w) + λEW (w)

▪ With the sum-of-squares error function and a quadratic

▪ This is minimized by w = (λI + Φ⊤Φ)

31 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

32 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

33 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

You might also like