Professional Documents
Culture Documents
Lecture 04
Lecture 04
Lecture 4: Classification
Sandjai Bhulai
Vrije Universiteit Amsterdam
s.bhulai@vu.nl
15 September 2023
Linear models for classi cation
fi
Classi cation with linear models
▪ Goal: take input vector x and map it onto one of K discrete
classes
{−1, a < 0
⊤ +1, a ≥ 0,
y(x) = f(w φ(x)) f(a) =
w⊤φntn
∑
EP = −
n∈ℳ
p(x | C1)p(C1)
where have de ned a = ln
p(x | C2)p(C2)
(1 − σ)
σ
▪
The inverse of σ is the logit function a = ln
{ 2 }
1 1 1 ⊤ −1
p(x | Ck) = exp − (x − μk ) Σ (x − μk)
(2π) D/2
|Σ| 1/2
where
w = Σ−1(μ1 − μ2)
1 ⊤ −1 1 ⊤ −1 p(C1)
w0 = − μ1 Σ μ1 + μ2 Σ μ2 + ln
2 2 p(C2)
16 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
Continuous inputs
▪ Quadratic term from Gaussian vanishes. The priors p(Ck)
only enter via the bias parameter
▪ For the general case of K classes, we have
ak(x) = w⊤k x + wk0
where
wk = Σ−1μk
1 ⊤ −1
wk0 = − μk Σ μk + ln p(Ck)
2
∏[
q (xn | μ1, Σ)] [(1 − q) (xn | μ2, Σ)]
tn 1−tn
p(t, X | q, μ1, μ2, Σ) =
n=1
𝒩
𝒩
20 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
𝒩
𝒩
Maximum likelihood
▪ The log-likelihood function with relevant terms for q is:
N
∑{ n
t ln q + (1 − tn)ln(1 − q)}
n=1
1 N N1 N1
N∑
q= tn = =
n=1
N N1 + N2
∏[
q (xn | μ1, Σ)] [(1 − q) (xn | μ2, Σ)]
tn 1−tn
p(t, X | q, μ1, μ2, Σ) =
n=1
𝒩
𝒩
21 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
Maximum likelihood
▪ The log-likelihood function with relevant terms for μ1 is:
N
1 N
tn(xn − μ1)⊤Σ−1(xn − μ1) + const
∑ 2∑
tn ln (xn | μ1, Σ) = −
n=1 n=1
1 N
N1 ∑
μ1 = tnxn
n=1
∏[
q (xn | μ1, Σ)] [(1 − q) (xn | μ2, Σ)]
tn 1−tn
p(t, X | q, μ1, μ2, Σ) =
n=1
𝒩
𝒩
𝒩
22 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
Logistic regression
▪ Posterior probability of class C1 written as a logistic sigmoid
acting on a linear function of feature vector φ
p(C1 | φ) = y(φ) = σ(w⊤φ) dσ(a)
= σ(a)(1 − σ(a))
da
▪ More compact than maximum likelihood tting of Gaussians.
For M parameters, Gaussian model uses 2M parameters for
the means, and M(M + 1)/2 parameters for the shared
covariance matrix
N
yntn(1 − yn)1−tn
∏
▪ Maximum likelihood: p(t | w) =
n=1
∑{ n
E(w) = − ln p(t | w) = − t ln yn + (1 − tn)ln(1 − yn)}
n=1
∑
∇E(w) = (yn − tn)φn
n=1
▪ Therefore, we have the same form for the gradient for the
sum-of-squares error N
{n n }
⊤ ⊤
∑
∇ln p(t | w, β) = t − w φ(x ) φ(xn )
n=1
24 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
Iterative reweighted least squares
▪ Ef cient iterative optimization: Newton-Raphson
= (Φ⊤Φ)−1Φ⊤t
= (Φ⊤RΦ)−1Φ⊤Rz
where
z = Φwold − R −1(y − t)