03 ML Classification 2017 P1

Machine Learning
Classification - Part 1
Marc Toussaint
University of Stuttgart
Duy Nguyen-Tuong
Bosch Center for Artificial Intelligence
Summer 2017
Discriminative Function
• Represent a discrete-valued function F : Rn → Y via a discriminative
function
f : Rn × Y → R
such that
F : x 7→ argmaxy f (x, y)
That is, a discriminative function f (x, y) maps an input x to an output
ŷ(x) = argmax f (x, y)

y
• A discriminative function f (x, y) has high value if y is a correct answer

to x; and low value if y is a false answer
• In that way a discriminative function e.g. discriminates correct
sequence/image/graph-labelling from wrong ones
2/19
Example Discriminative Function
• Input: x ∈ R2 ; output y ∈ {1, 2, 3}

displayed are p(y = 1|x), p(y = 2|x), p(y = 3|x)
0.9
0.5
0.1
1 0.9
0.8 0.5
0.1
0.6 0.9
0.4 0.5
0.2 0.1
0
3
-2 2
-1 1
0 0
1 -1
2
3 -2
3/19
How could we parameterize a discriminative
function?
• Well, linear in features!
Pk
f (x, y) = j=1 φj (x, y)βj = φ(x, y)>β
• Example: Let x ∈ R and y ∈ {1, 2, 3}. Typical features might be

1 [y = 1]
 
x [y = 1] 
 

 
x2
 



[y = 1] 





1 [y = 2] 


φ(x, y) = 


x [y = 2] 




 x2 [y = 2] 


1 [y = 3] 
 

 
x [y = 3] 
 

 
x2 [y = 3]
• Example: Let x, y ∈ {0, 1} be both discrete. Features might be

 

1 



[x = 0][y = 0] 


φ(x, y) = 


[x = 0][y = 1] 





[x = 1][y = 0] 


[x = 1][y = 1]
4/19
more intuition...
• Features “connect” input and output. Each φj (x, y) allows f to capture
a certain dependence between x and y
• If both x and y are discrete, a feature φj (x, y) is typically a joint
indicator function (logical function), indicating a certain “event”
• Each weight βj mirrors how important/frequent/infrequent a certain
dependence described by φj (x, y) is
• −f (x, y) is also called energy, and the following methods are also
called energy-based modelling, esp. in neural modelling
5/19
Logistic Regression
(MT/plot.h -> gnuplot pipe)

3
train
decision boundary
-1
-2
-2 -1 0 1 2 3
• Input x ∈ R2
Output y ∈ {0, 1}
Example shows RBF Ridge Logistic Regression
6/19
A Loss Function for Classification
• Data D = {(xi , yi )}ni=1 with xi ∈ Rd and yi ∈ {0, 1}
7/19
• Maximum likelihood:
We interpret the discriminative function f (x, y) as defining class
probabilities
f (x,y)
p(y | x) = Pe
y0 ef (x,y0 )
→ p(y | x) should be high for the correct class, and low otherwise
7/19
• Maximum likelihood:
We interpret the discriminative function f (x, y) as defining class
probabilities
f (x,y)
p(y | x) = Pe
y0 ef (x,y0 )
→ p(y | x) should be high for the correct class, and low otherwise
For each (xi , yi ) we want to maximize the likelihood p(yi | xi ):
Pn
Lneg-log-likelihood (β) = − i=1 log p(yi | xi )
7/19
Logistic Regression: Binary Case
• In the binary case, we have “two functions” f (x, 0) and f (x, 1). W.l.o.g. we
may fix f (x, 0) = 0 to zero. Therefore we choose features
φ(x, y) = φ(x) [y = 1]
with arbitrary input features φ(x) ∈ Rk

• We have

0 else
f (x, 1) = φ(x)>β , ŷ = argmax f (x, y) =
y 1 if φ(x)>β > 0
1
exp(x)/(1+exp(x))
• and conditional class probabilities 0.9
0.8
0.7
ef (x,1)
0.6
0.5
p(1 | x) = = σ(f (x, 1)) 0.4
ef (x,0) + ef (x,1) 0.3
0.2
ez
0.1
1
with the logistic sigmoid function σ(z) = 1+ez
= e−z +1
. 0
-10 -5 0 5 10
• Given data D = {(xi , yi )}n

i=1 , we minimize
Llogistic (β) = − n 2
P
i=1 log p(yi | xi ) + λ||β||
Pn h i
=− i=1 yi log p(1 | xi ) + (1 − yi ) log[1 − p(1 | xi )] + λ||β||2
8/19
Optimal Parameters β
• Gradient (see exercises):
> n
∂Llogistic (β) X
= (pi − yi )φ(xi ) + 2λIβ = X>(p − y) + 2λIβ
∂β i=1
φ(x1 )>
 

.. 
 
where pi := p(y = 1 | xi ) , X= ∈ Rn×k

 


 . 

φ(xn )>
 
∂Llogistic (β)
• ∂β is non-linear in β (it enters also the calculation of pi ) → does
not have analytic solution
9/19
• Gradient (see exercises):
> n
∂Llogistic (β) X
= (pi − yi )φ(xi ) + 2λIβ = X>(p − y) + 2λIβ
∂β i=1
φ(x1 )>
 

.. 
 
where pi := p(y = 1 | xi ) , X= ∈ Rn×k

 


 . 

φ(xn )>
 
∂Llogistic (β)
• ∂β is non-linear in β (it enters also the calculation of pi ) → does
not have analytic solution
• Newton algorithm: iterate

>
∂Llogistic (β)
β ← β − H -1 ∂β
2 logistic
with Hessian H = ∂ L∂β 2 (β) = X>W X + 2λI
W = diag(p ◦ (1 − p)), that is, diagonal with Wii = pi (1 − pi ) 9/19
polynomial (cubic) ridge logistic regression:
3
1
train 0
decision boundary -1
2
3
2
1
1 0
-1
-2
-3
0
3
2
-2 1
-1 -1 0
0 -1
1
2
3 -2
-2
-2 -1 0 1 2 3
./x.exe -mode 2 -modelFeatureType 3 -lambda 1e+0
10/19
RBF ridge logistic regression:
3
1
train 0
decision boundary -1
2
3
2
1
1 0
-1
-2
-3
0
3
2
-2 1
-1 -1 0
0 -1
1
2
3 -2
-2
-2 -1 0 1 2 3
./x.exe -mode 2 -modelFeatureType 4 -lambda 1e+0 -rbfBias 0 -rbfWidth .2
11/19
Recap
ridge regression logistic regression
REPRESENTATION f (x) = φ(x)>β f (x, y) = φ(x, y)>β
OBJECTIVE Lls (β) = Lnll (β) =

Pn 2 2
− n 2
P
i=1 (yi − f (xi )) + λ||β||I i=1 log p(yi | xi ) + λ||β||I
p(y | x) ∝ ef (x,y)
SOLVER β̂ ridge = (X>X + λI)-1 X>y binary case:

β ← β − (X>W X + 2λI)-1
(X>(p − y) + 2λIβ)
12/19
Logistic Regression: Multi-Class Case
• Data D = {(xi , yi )}ni=1 with xi ∈ Rd and yi ∈ {1, .., M }
• We choose f (x, y) = φ(x, y)>β with
φ(x) [y = 1]
 
φ(x) [y = 2] 
 

 
φ(x, y) =
 

 .. 

.
 
 
 
φ(x) [y = M ]
where φ(x) are arbitrary features. We have M -1 parameters β

• Conditional class probabilties
ef (x,y)
p(y | x) = P f (x,y 0 )
y0 e
• Given data D = {(xi , yi )}ni=1 , we minimize

Pn
Llogistic (β) = − i=1 log p(y = yi | xi ) + λ||β||2
13/19
• Gradient:
>
∂Llogistic (β) Pn
∂βc = i=1 (pic − yic )φ(xi ) + 2λIβc = X>(pc − yc ) + 2λIβc ,
pic = p(y = c | xi )
2 logistic
• Hessian: H = ∂ ∂β
L (β)
c ∂βd
= X>Wcd X + 2[c = d] λI
Wcd diagonal with Wcd,ii = pic ([c = d] − pid )
14/19
polynomial (quadratic) ridge 3-class logistic regression:
3
0.9
train 0.5
p=0.5 0.1
1 0.9
2 0.5
0.8
0.1
0.6 0.9
1
0.4 0.5
0.2 0.1
0
0
3
-2 2
-1 1
-1 0 0
1 -1
2
3 -2
-2
-2 -1 0 1 2 3
./x.exe -mode 3 -modelFeatureType 3 -lambda 1e+1
15/19
Summary
• Logistic regression belongs to the class of probabilistic classification
approaches. The basic idea is estimating the class probabilities using
a logistic function.
• There are many other non-probabilistic approaches. One of the most
important classifier there is the support vector machines.
16/19
Support Vector Machines: Basic Idea
• Which classifier would you prefer?
17/19
Support Vector Machines: Basic Idea
• Which classifier would you prefer?

• The optimal classifier is the one with maximal margin !!
• The larger the margin, the more robust and the better is the
generalization property.
17/19
Let’s start with the linear SVM ...
• Find a linear surface which maximizes the margin
�
Thus,
1
minβ 2 k β k2 , s.t. yi (xTi β + β0 ) ≥ 1
yi ∈ {−1, 1} ... outputs, i = 1...N

Decision function
ŷ(x) = sign xT β ∗ +β0∗

18/19
Dual Representation of the linear SVM
Lagrangian function
N
1 X
k β k2 − αi yi (xTi β + β0 ) − 1

J(β, β0 , α) =
2 i=1
Dual representation
PN PN PN
maxα Q(α) = i=1 αi − 12 i=1 j=1 αi αj yi yj xTi xj
PN
s.t. i=1 αi yi = 0; αi ≥ 0 for i = 1, 2, · · · , N
19/19
Dual Representation of the linear SVM
Lagrangian function
N
1 X
k β k2 − αi yi (xTi β + β0 ) − 1

J(β, β0 , α) =
2 i=1
Dual representation
PN PN PN
maxα Q(α) = i=1 αi − 12 i=1 j=1 αi αj yi yj xTi xj
PN
s.t. i=1 αi yi = 0; αi ≥ 0 for i = 1, 2, · · · , N
• The optimization of the dual form is convex

• The input patterns appear as dot products
19/19

03 ML Classification 2017 P1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 ML Classification 2017 P1

Uploaded by

Copyright:

Available Formats

Machine Learning

ŷ(x) = argmax f (x, y)

• A discriminative function f (x, y) has high value if y is a correct answer

• Input: x ∈ R2 ; output y ∈ {1, 2, 3}

• Example: Let x ∈ R and y ∈ {1, 2, 3}. Typical features might be

• Example: Let x, y ∈ {0, 1} be both discrete. Features might be

(MT/plot.h -> gnuplot pipe)

For each (xi , yi ) we want to maximize the likelihood p(yi | xi ):

with arbitrary input features φ(x) ∈ Rk

• and conditional class probabilities 0.9

p(1 | x) = = σ(f (x, 1)) 0.4

ef (x,0) + ef (x,1) 0.3

• Given data D = {(xi , yi )}n

• Newton algorithm: iterate

./x.exe -mode 2 -modelFeatureType 3 -lambda 1e+0

./x.exe -mode 2 -modelFeatureType 4 -lambda 1e+0 -rbfBias 0 -rbfWidth .2

ridge regression logistic regression

REPRESENTATION f (x) = φ(x)>β f (x, y) = φ(x, y)>β

OBJECTIVE Lls (β) = Lnll (β) =

SOLVER β̂ ridge = (X>X + λI)-1 X>y binary case:

where φ(x) are arbitrary features. We have M -1 parameters β

• Given data D = {(xi , yi )}ni=1 , we minimize

./x.exe -mode 3 -modelFeatureType 3 -lambda 1e+1

• Which classifier would you prefer?

• Which classifier would you prefer?

• Find a linear surface which maximizes the margin

yi ∈ {−1, 1} ... outputs, i = 1...N

• The optimization of the dual form is convex

You might also like