Lecture 04

Advanced Machine Learning
Lecture 4: Classification
Sandjai Bhulai
Vrije Universiteit Amsterdam
s.bhulai@vu.nl
15 September 2023
Linear models for classi cation
Advanced Machine Learning
fi
Classi cation with linear models
▪ Goal: take input vector x and map it onto one of K discrete
classes
▪ Consider linear models: separable by (D − 1) dimensional

hyperplanes in the D-dimensional input space
▪ Simplest linear regression model: y(x) = w⊤x + w0
f( ⋅ ) to map function onto discrete

▪ Use activation function
⊤
classes y(x) = f(w x + w0)
▪ Due to f( ⋅ ), these models are no longer linear in the parameters
3 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

fi
Discriminant functions
▪ The simplest case is the 2-class case: y(x) = w⊤x + w0,
where w is a weight vector and w0 is the bias
▪ Decision boundary is 0
▪ Consider 2 points xa and xb lie on the decision surface.
Because y(xa) = y(xb) = 0, we have w⊤(xa − xb) = 0.
▪ Thus, vector w is orthogonal to every vector lying within the
decision surface
▪ If x is on the decision surface, then y(x) = 0, indicating that
w⊤x w0
=−
∥w∥ ∥w∥

Geometry of linear discriminants
▪ Decision surface is perpendicular to w

▪ Displacement is controlled by w0

Multiple classes
▪ Not generally good idea to use multiple 2-class classi ers to
do K-class classi cation
▪ Leads to ambiguous regions

fi
fi
Single K-class classi er
▪ Single discriminant comprising K linear
functions of form yk(x) = w⊤k x + wk0
▪ Point x belongs to class Ck if yk(x) > yj(x)

for all j ≠ k
▪ Decision boundary between Ck and Cj is

given by yk(x) = yj(x) and corresponds to
(D − 1)-dimensional hyperplane
(wk − wj)⊤x + (wk0 − wj0) = 0
▪ Decision region singly connected and convex
(due to linearity of discriminant functions)

fi
Perceptron algorithm
▪ Rosenblatt (1962)
▪ Linear model with step activation function:
{−1, a < 0
⊤ +1, a ≥ 0,
y(x) = f(w φ(x)) f(a) =
▪ Train using perceptron criterion (here tn ∈ {−1,1})
w⊤φntn
∑
EP = −
n∈ℳ
where ℳ is the set of misclassi ed patterns

▪ Note that direct misclassi cation using total number of misclassi ed
patterns will not work because of non-linear f( ⋅ )

fi
fi
fi
▪ Total error function is piecewise linear
▪ Stochastic gradient descent:
w(τ+1) = w(τ) − η ∇EP(w) = w(τ) + ηφntn
▪ Update is not a function of w, thus η can be

equal to 1
▪ Perceptron convergence theorem: if there exists
and exact solution, then PA will nd a solution in
a nite number of steps
▪ Attacked by Minsky and Papert in Perceptrons
(1969). Attack valid only for single-layer
perceptrons. Consequence: research stopped in
neural computation for nearly a decade

fi
fi




Probabilistic generative models
▪ Model class-conditional densities p(x | Ck)
▪ Posterior probability for class C1:
p(x | C1)p(C1) 1
p(C1 | x) = = = σ(a)
p(x | C1)p(C1) + p(x | C2)p(C2) 1 + exp(−a)
p(x | C1)p(C1)
where have de ned a = ln
p(x | C2)p(C2)
▪ σ is the logistic sigmoid function
(1 − σ)
σ
▪
The inverse of σ is the logit function a = ln

fi
Probabilistic generative models
▪ Generalization to multiple classes:
p(x | Ck)p(Ck) exp(ak)
p(Ck | x) = =
∑j p(x | Cj)p(Cj) ∑j exp(aj)
where ak = ln(p(x | Ck)p(Ck))
▪ This is known as the softmax function, because it is a

smoothed version of the max
▪ Different representations for class-conditional densities yield

different consequences in how classi cation is done

fi
Continuous inputs
▪ First assume that all classes share the same covariance
matrix and that there are only 2 classes.
▪ We have
p(C1 | x) = σ(w⊤x + w0)
{ 2 }
1 1 1 ⊤ −1
p(x | Ck) = exp − (x − μk ) Σ (x − μk)
(2π) D/2
|Σ| 1/2
where
w = Σ−1(μ1 − μ2)
1 ⊤ −1 1 ⊤ −1 p(C1)
w0 = − μ1 Σ μ1 + μ2 Σ μ2 + ln
2 2 p(C2)
Continuous inputs
▪ Quadratic term from Gaussian vanishes. The priors p(Ck)
only enter via the bias parameter
▪ For the general case of K classes, we have
ak(x) = w⊤k x + wk0
where
wk = Σ−1μk
1 ⊤ −1
wk0 = − μk Σ μk + ln p(Ck)
2

Continuous inputs

Linear versus quadratic
▪ When covariance is shared by classes, the decision
boundary is linear
▪ When covariances are unlinked, the decision boundary is
quadratic

Maximum likelihood
▪ Since we have a parametric form for class-conditional
densities p(x | Ck), we can determine values of the
parameters and priors p(Ck)
p(xn, C1) = p(C1)p(xn | C1) = q (xn | μ1, Σ)
p(xn, C2) = p(C2)p(xn | C2) = (1 − q) (xn | μ2, Σ)
▪ Let tn ∈ {0,1}, then the likelihood is then given by

N
∏[
q (xn | μ1, Σ)] [(1 − q) (xn | μ2, Σ)]
tn 1−tn
p(t, X | q, μ1, μ2, Σ) =
n=1
𝒩
𝒩
𝒩
𝒩
Maximum likelihood
▪ The log-likelihood function with relevant terms for q is:
N
∑{ n
t ln q + (1 − tn)ln(1 − q)}
n=1
▪ Maximize with respect to q, yields
1 N N1 N1
N∑
q= tn = =
n=1
N N1 + N2
∏[
q (xn | μ1, Σ)] [(1 − q) (xn | μ2, Σ)]
tn 1−tn
p(t, X | q, μ1, μ2, Σ) =
n=1
𝒩
𝒩
Maximum likelihood
▪ The log-likelihood function with relevant terms for μ1 is:
N
1 N
tn(xn − μ1)⊤Σ−1(xn − μ1) + const
∑ 2∑
tn ln (xn | μ1, Σ) = −
n=1 n=1
▪ Maximize with respect to μ1, yields
1 N
N1 ∑
μ1 = tnxn
n=1
∏[
q (xn | μ1, Σ)] [(1 − q) (xn | μ2, Σ)]
tn 1−tn
p(t, X | q, μ1, μ2, Σ) =
n=1
𝒩
𝒩
𝒩
Logistic regression
▪ Posterior probability of class C1 written as a logistic sigmoid
acting on a linear function of feature vector φ
p(C1 | φ) = y(φ) = σ(w⊤φ) dσ(a)
= σ(a)(1 − σ(a))
da
▪ More compact than maximum likelihood tting of Gaussians.
For M parameters, Gaussian model uses 2M parameters for
the means, and M(M + 1)/2 parameters for the shared
covariance matrix
N
yntn(1 − yn)1−tn
∏
▪ Maximum likelihood: p(t | w) =
n=1

fi
Logistic regression
▪ Negative log of likelihood yields cross entropy
N
∑{ n
E(w) = − ln p(t | w) = − t ln yn + (1 − tn)ln(1 − yn)}
n=1
▪ Gradient with respect to w

N
∑
∇E(w) = (yn − tn)φn
n=1
▪ Therefore, we have the same form for the gradient for the
sum-of-squares error N
{n n }
⊤ ⊤
∑
∇ln p(t | w, β) = t − w φ(x ) φ(xn )
n=1
Iterative reweighted least squares
▪ Ef cient iterative optimization: Newton-Raphson
wnew = wold − H −1 ∇E(w)

where H is the Hessian matrix (with second derivatives)
▪ For sum-of-squares error this can be done in one step

because the error function is quadratic
▪ For cross entropy we get a similar set of normal equations for

weighted least squares, which depends on w
▪ This dependency forces us to apply the update iteratively

fi
▪ Apply this to linear regression
N
(w⊤φn − tn)φn = Φ⊤Φw − Φ⊤t
∑
∇E(w) =
n=1
N
φnφn⊤ = Φ⊤Φ
∑
H = ∇ ∇E(w) =
n=1
▪ The Newton-Raphson update then takes
wnew = wold − H −1 ∇E(w) = wold − (Φ⊤Φ)−1{Φ⊤Φwold − Φ⊤t}
= (Φ⊤Φ)−1Φ⊤t

▪ Apply this to logistic regression
N
(yn − tn)φn = Φ⊤(y − t)
∑
∇E(w) =
n=1
N
yn(1 − yn)φnφn⊤ = Φ⊤RΦ
∑
H = ∇ ∇E(w) =
n=1
with R as diagonal matrix with Rnn = yn(1 − yn)

▪ The Newton-Raphson update then takes
wnew = wold − H −1 ∇E(w) = wold − (Φ⊤RΦ)−1Φ⊤(y − t)
= (Φ⊤RΦ)−1{Φ⊤RΦwold − Φ⊤(y − t)}
= (Φ⊤RΦ)−1Φ⊤Rz
where
z = Φwold − R −1(y − t)

Lecture 04

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 04

Uploaded by

Copyright:

Available Formats

Advanced Machine Learning

Advanced Machine Learning

▪ Consider linear models: separable by (D − 1) dimensional

▪ Simplest linear regression model: y(x) = w⊤x + w0

f( ⋅ ) to map function onto discrete

▪ Due to f( ⋅ ), these models are no longer linear in the parameters

3 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

4 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

▪ Decision surface is perpendicular to w

5 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

6 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

▪ Point x belongs to class Ck if yk(x) > yj(x)

▪ Decision boundary between Ck and Cj is

7 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

▪ Train using perceptron criterion (here tn ∈ {−1,1})

where ℳ is the set of misclassi ed patterns

8 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

w(τ+1) = w(τ) − η ∇EP(w) = w(τ) + ηφntn

▪ Update is not a function of w, thus η can be

9 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

10 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

11 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

12 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

13 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

▪ σ is the logistic sigmoid function

14 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

where ak = ln(p(x | Ck)p(Ck))

▪ This is known as the softmax function, because it is a

▪ Different representations for class-conditional densities yield

15 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

p(C1 | x) = σ(w⊤x + w0)

17 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

18 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

19 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

p(xn, C1) = p(C1)p(xn | C1) = q (xn | μ1, Σ)

p(xn, C2) = p(C2)p(xn | C2) = (1 − q) (xn | μ2, Σ)

▪ Let tn ∈ {0,1}, then the likelihood is then given by

▪ Maximize with respect to q, yields

▪ Maximize with respect to μ1, yields

23 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

▪ Gradient with respect to w

wnew = wold − H −1 ∇E(w)

▪ For sum-of-squares error this can be done in one step

▪ For cross entropy we get a similar set of normal equations for

▪ This dependency forces us to apply the update iteratively

25 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

▪ The Newton-Raphson update then takes

wnew = wold − H −1 ∇E(w) = wold − (Φ⊤Φ)−1{Φ⊤Φwold − Φ⊤t}

26 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

with R as diagonal matrix with Rnn = yn(1 − yn)

27 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

wnew = wold − H −1 ∇E(w) = wold − (Φ⊤RΦ)−1Φ⊤(y − t)

= (Φ⊤RΦ)−1{Φ⊤RΦwold − Φ⊤(y − t)}

28 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

You might also like