Professional Documents
Culture Documents
03 ML Classification 2017 P1
03 ML Classification 2017 P1
Classification - Part 1
Marc Toussaint
University of Stuttgart
Duy Nguyen-Tuong
Bosch Center for Artificial Intelligence
Summer 2017
Discriminative Function
• Represent a discrete-valued function F : Rn → Y via a discriminative
function
f : Rn × Y → R
such that
F : x 7→ argmaxy f (x, y)
That is, a discriminative function f (x, y) maps an input x to an output
2/19
Example Discriminative Function
0.9
0.5
0.1
1 0.9
0.8 0.5
0.1
0.6 0.9
0.4 0.5
0.2 0.1
0
3
-2 2
-1 1
0 0
1 -1
2
3 -2
3/19
How could we parameterize a discriminative
function?
• Well, linear in features!
Pk
f (x, y) = j=1 φj (x, y)βj = φ(x, y)>β
x [y = 1]
x2
[y = 1]
1 [y = 2]
φ(x, y) =
x [y = 2]
x2 [y = 2]
1 [y = 3]
x [y = 3]
x2 [y = 3]
5/19
Logistic Regression
-1
-2
-2 -1 0 1 2 3
• Input x ∈ R2
Output y ∈ {0, 1}
Example shows RBF Ridge Logistic Regression
6/19
A Loss Function for Classification
• Data D = {(xi , yi )}ni=1 with xi ∈ Rd and yi ∈ {0, 1}
7/19
A Loss Function for Classification
• Data D = {(xi , yi )}ni=1 with xi ∈ Rd and yi ∈ {0, 1}
• Maximum likelihood:
We interpret the discriminative function f (x, y) as defining class
probabilities
f (x,y)
p(y | x) = Pe
y0 ef (x,y0 )
→ p(y | x) should be high for the correct class, and low otherwise
7/19
A Loss Function for Classification
• Data D = {(xi , yi )}ni=1 with xi ∈ Rd and yi ∈ {0, 1}
• Maximum likelihood:
We interpret the discriminative function f (x, y) as defining class
probabilities
f (x,y)
p(y | x) = Pe
y0 ef (x,y0 )
→ p(y | x) should be high for the correct class, and low otherwise
Pn
Lneg-log-likelihood (β) = − i=1 log p(yi | xi )
7/19
Logistic Regression: Binary Case
• In the binary case, we have “two functions” f (x, 0) and f (x, 1). W.l.o.g. we
may fix f (x, 0) = 0 to zero. Therefore we choose features
φ(x, y) = φ(x) [y = 1]
0.8
0.7
ef (x,1)
0.6
0.5
0.2
ez
0.1
1
with the logistic sigmoid function σ(z) = 1+ez
= e−z +1
. 0
-10 -5 0 5 10
Llogistic (β) = − n 2
P
i=1 log p(yi | xi ) + λ||β||
Pn h i
=− i=1 yi log p(1 | xi ) + (1 − yi ) log[1 − p(1 | xi )] + λ||β||2
8/19
Optimal Parameters β
• Gradient (see exercises):
> n
∂Llogistic (β) X
= (pi − yi )φ(xi ) + 2λIβ = X>(p − y) + 2λIβ
∂β i=1
φ(x1 )>
..
where pi := p(y = 1 | xi ) , X= ∈ Rn×k
.
φ(xn )>
∂Llogistic (β)
• ∂β is non-linear in β (it enters also the calculation of pi ) → does
not have analytic solution
9/19
Optimal Parameters β
• Gradient (see exercises):
> n
∂Llogistic (β) X
= (pi − yi )φ(xi ) + 2λIβ = X>(p − y) + 2λIβ
∂β i=1
φ(x1 )>
..
where pi := p(y = 1 | xi ) , X= ∈ Rn×k
.
φ(xn )>
∂Llogistic (β)
• ∂β is non-linear in β (it enters also the calculation of pi ) → does
not have analytic solution
2 logistic
with Hessian H = ∂ L∂β 2 (β) = X>W X + 2λI
W = diag(p ◦ (1 − p)), that is, diagonal with Wii = pi (1 − pi ) 9/19
polynomial (cubic) ridge logistic regression:
(MT/plot.h -> gnuplot pipe)
3
1
train 0
decision boundary -1
2
3
2
1
1 0
-1
-2
-3
0
3
2
-2 1
-1 -1 0
0 -1
1
2
3 -2
-2
-2 -1 0 1 2 3
10/19
RBF ridge logistic regression:
(MT/plot.h -> gnuplot pipe)
3
1
train 0
decision boundary -1
2
3
2
1
1 0
-1
-2
-3
0
3
2
-2 1
-1 -1 0
0 -1
1
2
3 -2
-2
-2 -1 0 1 2 3
11/19
Recap
p(y | x) ∝ ef (x,y)
12/19
Logistic Regression: Multi-Class Case
• Data D = {(xi , yi )}ni=1 with xi ∈ Rd and yi ∈ {1, .., M }
• We choose f (x, y) = φ(x, y)>β with
φ(x) [y = 1]
φ(x) [y = 2]
φ(x, y) =
..
.
φ(x) [y = M ]
ef (x,y)
p(y | x) = P f (x,y 0 )
y0 e
13/19
Optimal Parameters β
• Gradient:
>
∂Llogistic (β) Pn
∂βc = i=1 (pic − yic )φ(xi ) + 2λIβc = X>(pc − yc ) + 2λIβc ,
pic = p(y = c | xi )
2 logistic
• Hessian: H = ∂ ∂β
L (β)
c ∂βd
= X>Wcd X + 2[c = d] λI
Wcd diagonal with Wcd,ii = pic ([c = d] − pid )
14/19
polynomial (quadratic) ridge 3-class logistic regression:
(MT/plot.h -> gnuplot pipe)
3
0.9
train 0.5
p=0.5 0.1
1 0.9
2 0.5
0.8
0.1
0.6 0.9
1
0.4 0.5
0.2 0.1
0
0
3
-2 2
-1 1
-1 0 0
1 -1
2
3 -2
-2
-2 -1 0 1 2 3
15/19
Summary
• Logistic regression belongs to the class of probabilistic classification
approaches. The basic idea is estimating the class probabilities using
a logistic function.
• There are many other non-probabilistic approaches. One of the most
important classifier there is the support vector machines.
16/19
Support Vector Machines: Basic Idea
17/19
Support Vector Machines: Basic Idea
17/19
Let’s start with the linear SVM ...
�
Thus,
1
minβ 2 k β k2 , s.t. yi (xTi β + β0 ) ≥ 1
18/19
Dual Representation of the linear SVM
Lagrangian function
N
1 X
k β k2 − αi yi (xTi β + β0 ) − 1
J(β, β0 , α) =
2 i=1
Dual representation
PN PN PN
maxα Q(α) = i=1 αi − 12 i=1 j=1 αi αj yi yj xTi xj
PN
s.t. i=1 αi yi = 0; αi ≥ 0 for i = 1, 2, · · · , N
19/19
Dual Representation of the linear SVM
Lagrangian function
N
1 X
k β k2 − αi yi (xTi β + β0 ) − 1
J(β, β0 , α) =
2 i=1
Dual representation
PN PN PN
maxα Q(α) = i=1 αi − 12 i=1 j=1 αi αj yi yj xTi xj
PN
s.t. i=1 αi yi = 0; αi ≥ 0 for i = 1, 2, · · · , N
19/19