Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Advanced Machine Learning

Lecture 4: Classification
Sandjai Bhulai
Vrije Universiteit Amsterdam

s.bhulai@vu.nl
15 September 2023
Linear models for classi cation

Advanced Machine Learning

fi
Classi cation with linear models
▪ Goal: take input vector x and map it onto one of K discrete
classes

▪ Consider linear models: separable by (D − 1) dimensional


hyperplanes in the D-dimensional input space

▪ Simplest linear regression model: y(x) = w⊤x + w0

f( ⋅ ) to map function onto discrete


▪ Use activation function

classes y(x) = f(w x + w0)

▪ Due to f( ⋅ ), these models are no longer linear in the parameters

3 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
Discriminant functions
▪ The simplest case is the 2-class case: y(x) = w⊤x + w0,
where w is a weight vector and w0 is the bias
▪ Decision boundary is 0
▪ Consider 2 points xa and xb lie on the decision surface.
Because y(xa) = y(xb) = 0, we have w⊤(xa − xb) = 0.
▪ Thus, vector w is orthogonal to every vector lying within the
decision surface
▪ If x is on the decision surface, then y(x) = 0, indicating that
w⊤x w0
=−
∥w∥ ∥w∥

4 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Geometry of linear discriminants

▪ Decision surface is perpendicular to w


▪ Displacement is controlled by w0

5 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Multiple classes
▪ Not generally good idea to use multiple 2-class classi ers to
do K-class classi cation
▪ Leads to ambiguous regions

6 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
fi
Single K-class classi er
▪ Single discriminant comprising K linear
functions of form yk(x) = w⊤k x + wk0

▪ Point x belongs to class Ck if yk(x) > yj(x)


for all j ≠ k

▪ Decision boundary between Ck and Cj is


given by yk(x) = yj(x) and corresponds to
(D − 1)-dimensional hyperplane
(wk − wj)⊤x + (wk0 − wj0) = 0
▪ Decision region singly connected and convex
(due to linearity of discriminant functions)

7 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
Perceptron algorithm
▪ Rosenblatt (1962)
▪ Linear model with step activation function:

{−1, a < 0
⊤ +1, a ≥ 0,
y(x) = f(w φ(x)) f(a) =

▪ Train using perceptron criterion (here tn ∈ {−1,1})

w⊤φntn

EP = −
n∈ℳ

where ℳ is the set of misclassi ed patterns


▪ Note that direct misclassi cation using total number of misclassi ed
patterns will not work because of non-linear f( ⋅ )

8 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
fi
fi
Perceptron algorithm
▪ Total error function is piecewise linear
▪ Stochastic gradient descent:

w(τ+1) = w(τ) − η ∇EP(w) = w(τ) + ηφntn

▪ Update is not a function of w, thus η can be


equal to 1
▪ Perceptron convergence theorem: if there exists
and exact solution, then PA will nd a solution in
a nite number of steps
▪ Attacked by Minsky and Papert in Perceptrons
(1969). Attack valid only for single-layer
perceptrons. Consequence: research stopped in
neural computation for nearly a decade

9 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
fi
Perceptron algorithm

10 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Perceptron algorithm

11 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Perceptron algorithm

12 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Perceptron algorithm

13 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Probabilistic generative models
▪ Model class-conditional densities p(x | Ck)
▪ Posterior probability for class C1:
p(x | C1)p(C1) 1
p(C1 | x) = = = σ(a)
p(x | C1)p(C1) + p(x | C2)p(C2) 1 + exp(−a)

p(x | C1)p(C1)
where have de ned a = ln
p(x | C2)p(C2)

▪ σ is the logistic sigmoid function

(1 − σ)
σ

The inverse of σ is the logit function a = ln

14 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
Probabilistic generative models
▪ Generalization to multiple classes:
p(x | Ck)p(Ck) exp(ak)
p(Ck | x) = =
∑j p(x | Cj)p(Cj) ∑j exp(aj)

where ak = ln(p(x | Ck)p(Ck))

▪ This is known as the softmax function, because it is a


smoothed version of the max

▪ Different representations for class-conditional densities yield


different consequences in how classi cation is done

15 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
Continuous inputs
▪ First assume that all classes share the same covariance
matrix and that there are only 2 classes.
▪ We have

p(C1 | x) = σ(w⊤x + w0)

{ 2 }
1 1 1 ⊤ −1
p(x | Ck) = exp − (x − μk ) Σ (x − μk)
(2π) D/2
|Σ| 1/2

where
w = Σ−1(μ1 − μ2)
1 ⊤ −1 1 ⊤ −1 p(C1)
w0 = − μ1 Σ μ1 + μ2 Σ μ2 + ln
2 2 p(C2)
16 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
Continuous inputs
▪ Quadratic term from Gaussian vanishes. The priors p(Ck)
only enter via the bias parameter
▪ For the general case of K classes, we have
ak(x) = w⊤k x + wk0

where
wk = Σ−1μk
1 ⊤ −1
wk0 = − μk Σ μk + ln p(Ck)
2

17 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Continuous inputs

18 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Linear versus quadratic
▪ When covariance is shared by classes, the decision
boundary is linear
▪ When covariances are unlinked, the decision boundary is
quadratic

19 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Maximum likelihood
▪ Since we have a parametric form for class-conditional
densities p(x | Ck), we can determine values of the
parameters and priors p(Ck)

p(xn, C1) = p(C1)p(xn | C1) = q (xn | μ1, Σ)

p(xn, C2) = p(C2)p(xn | C2) = (1 − q) (xn | μ2, Σ)

▪ Let tn ∈ {0,1}, then the likelihood is then given by


N

∏[
q (xn | μ1, Σ)] [(1 − q) (xn | μ2, Σ)]
tn 1−tn
p(t, X | q, μ1, μ2, Σ) =
n=1
𝒩
𝒩
20 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
𝒩
𝒩
Maximum likelihood
▪ The log-likelihood function with relevant terms for q is:
N

∑{ n
t ln q + (1 − tn)ln(1 − q)}
n=1

▪ Maximize with respect to q, yields

1 N N1 N1
N∑
q= tn = =
n=1
N N1 + N2

∏[
q (xn | μ1, Σ)] [(1 − q) (xn | μ2, Σ)]
tn 1−tn
p(t, X | q, μ1, μ2, Σ) =
n=1
𝒩
𝒩
21 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
Maximum likelihood
▪ The log-likelihood function with relevant terms for μ1 is:
N
1 N
tn(xn − μ1)⊤Σ−1(xn − μ1) + const
∑ 2∑
tn ln (xn | μ1, Σ) = −
n=1 n=1

▪ Maximize with respect to μ1, yields

1 N
N1 ∑
μ1 = tnxn
n=1

∏[
q (xn | μ1, Σ)] [(1 − q) (xn | μ2, Σ)]
tn 1−tn
p(t, X | q, μ1, μ2, Σ) =
n=1
𝒩
𝒩
𝒩
22 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
Logistic regression
▪ Posterior probability of class C1 written as a logistic sigmoid
acting on a linear function of feature vector φ
p(C1 | φ) = y(φ) = σ(w⊤φ) dσ(a)
= σ(a)(1 − σ(a))
da
▪ More compact than maximum likelihood tting of Gaussians.
For M parameters, Gaussian model uses 2M parameters for
the means, and M(M + 1)/2 parameters for the shared
covariance matrix
N
yntn(1 − yn)1−tn

▪ Maximum likelihood: p(t | w) =
n=1

23 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
Logistic regression
▪ Negative log of likelihood yields cross entropy
N

∑{ n
E(w) = − ln p(t | w) = − t ln yn + (1 − tn)ln(1 − yn)}
n=1

▪ Gradient with respect to w


N


∇E(w) = (yn − tn)φn
n=1

▪ Therefore, we have the same form for the gradient for the
sum-of-squares error N
{n n }
⊤ ⊤

∇ln p(t | w, β) = t − w φ(x ) φ(xn )
n=1
24 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
Iterative reweighted least squares
▪ Ef cient iterative optimization: Newton-Raphson

wnew = wold − H −1 ∇E(w)


where H is the Hessian matrix (with second derivatives)

▪ For sum-of-squares error this can be done in one step


because the error function is quadratic

▪ For cross entropy we get a similar set of normal equations for


weighted least squares, which depends on w

▪ This dependency forces us to apply the update iteratively

25 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
Iterative reweighted least squares
▪ Apply this to linear regression
N
(w⊤φn − tn)φn = Φ⊤Φw − Φ⊤t

∇E(w) =
n=1
N
φnφn⊤ = Φ⊤Φ

H = ∇ ∇E(w) =
n=1

▪ The Newton-Raphson update then takes

wnew = wold − H −1 ∇E(w) = wold − (Φ⊤Φ)−1{Φ⊤Φwold − Φ⊤t}

= (Φ⊤Φ)−1Φ⊤t

26 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Iterative reweighted least squares
▪ Apply this to logistic regression
N
(yn − tn)φn = Φ⊤(y − t)

∇E(w) =
n=1
N
yn(1 − yn)φnφn⊤ = Φ⊤RΦ

H = ∇ ∇E(w) =
n=1

with R as diagonal matrix with Rnn = yn(1 − yn)

27 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Iterative reweighted least squares
▪ The Newton-Raphson update then takes

wnew = wold − H −1 ∇E(w) = wold − (Φ⊤RΦ)−1Φ⊤(y − t)

= (Φ⊤RΦ)−1{Φ⊤RΦwold − Φ⊤(y − t)}

= (Φ⊤RΦ)−1Φ⊤Rz

where
z = Φwold − R −1(y − t)

28 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

You might also like