Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Machine Learning

Classification - Part 1

Marc Toussaint
University of Stuttgart

Duy Nguyen-Tuong
Bosch Center for Artificial Intelligence

Summer 2017
Discriminative Function
• Represent a discrete-valued function F : Rn → Y via a discriminative
function
f : Rn × Y → R

such that
F : x 7→ argmaxy f (x, y)
That is, a discriminative function f (x, y) maps an input x to an output

ŷ(x) = argmax f (x, y)


y

• A discriminative function f (x, y) has high value if y is a correct answer


to x; and low value if y is a false answer
• In that way a discriminative function e.g. discriminates correct
sequence/image/graph-labelling from wrong ones

2/19
Example Discriminative Function

• Input: x ∈ R2 ; output y ∈ {1, 2, 3}


displayed are p(y = 1|x), p(y = 2|x), p(y = 3|x)

0.9
0.5
0.1
1 0.9
0.8 0.5
0.1
0.6 0.9
0.4 0.5
0.2 0.1
0

3
-2 2
-1 1
0 0
1 -1
2
3 -2

3/19
How could we parameterize a discriminative
function?
• Well, linear in features!
Pk
f (x, y) = j=1 φj (x, y)βj = φ(x, y)>β

• Example: Let x ∈ R and y ∈ {1, 2, 3}. Typical features might be


1 [y = 1]
 

x [y = 1] 
 

 
x2
 



[y = 1] 





1 [y = 2] 


φ(x, y) = 


x [y = 2] 




 x2 [y = 2] 


1 [y = 3] 
 

 
x [y = 3] 
 

 
x2 [y = 3]

• Example: Let x, y ∈ {0, 1} be both discrete. Features might be


 

1 



[x = 0][y = 0] 


φ(x, y) = 


[x = 0][y = 1] 





[x = 1][y = 0] 


[x = 1][y = 1]
4/19
more intuition...
• Features “connect” input and output. Each φj (x, y) allows f to capture
a certain dependence between x and y
• If both x and y are discrete, a feature φj (x, y) is typically a joint
indicator function (logical function), indicating a certain “event”
• Each weight βj mirrors how important/frequent/infrequent a certain
dependence described by φj (x, y) is
• −f (x, y) is also called energy, and the following methods are also
called energy-based modelling, esp. in neural modelling

5/19
Logistic Regression

(MT/plot.h -> gnuplot pipe)


3
train
decision boundary

-1

-2
-2 -1 0 1 2 3

• Input x ∈ R2
Output y ∈ {0, 1}
Example shows RBF Ridge Logistic Regression

6/19
A Loss Function for Classification
• Data D = {(xi , yi )}ni=1 with xi ∈ Rd and yi ∈ {0, 1}

7/19
A Loss Function for Classification
• Data D = {(xi , yi )}ni=1 with xi ∈ Rd and yi ∈ {0, 1}

• Maximum likelihood:
We interpret the discriminative function f (x, y) as defining class
probabilities
f (x,y)
p(y | x) = Pe
y0 ef (x,y0 )

→ p(y | x) should be high for the correct class, and low otherwise

7/19
A Loss Function for Classification
• Data D = {(xi , yi )}ni=1 with xi ∈ Rd and yi ∈ {0, 1}

• Maximum likelihood:
We interpret the discriminative function f (x, y) as defining class
probabilities
f (x,y)
p(y | x) = Pe
y0 ef (x,y0 )

→ p(y | x) should be high for the correct class, and low otherwise

For each (xi , yi ) we want to maximize the likelihood p(yi | xi ):

Pn
Lneg-log-likelihood (β) = − i=1 log p(yi | xi )

7/19
Logistic Regression: Binary Case
• In the binary case, we have “two functions” f (x, 0) and f (x, 1). W.l.o.g. we
may fix f (x, 0) = 0 to zero. Therefore we choose features

φ(x, y) = φ(x) [y = 1]

with arbitrary input features φ(x) ∈ Rk


• We have

0 else
f (x, 1) = φ(x)>β , ŷ = argmax f (x, y) =
y 1 if φ(x)>β > 0
1
exp(x)/(1+exp(x))

• and conditional class probabilities 0.9

0.8

0.7

ef (x,1)
0.6

0.5

p(1 | x) = = σ(f (x, 1)) 0.4

ef (x,0) + ef (x,1) 0.3

0.2

ez
0.1
1
with the logistic sigmoid function σ(z) = 1+ez
= e−z +1
. 0
-10 -5 0 5 10

• Given data D = {(xi , yi )}n


i=1 , we minimize

Llogistic (β) = − n 2
P
i=1 log p(yi | xi ) + λ||β||
Pn h i
=− i=1 yi log p(1 | xi ) + (1 − yi ) log[1 − p(1 | xi )] + λ||β||2
8/19
Optimal Parameters β
• Gradient (see exercises):
> n
∂Llogistic (β) X
= (pi − yi )φ(xi ) + 2λIβ = X>(p − y) + 2λIβ
∂β i=1

φ(x1 )>
 

.. 
 
where pi := p(y = 1 | xi ) , X= ∈ Rn×k

 


 . 

φ(xn )>
 

∂Llogistic (β)
• ∂β is non-linear in β (it enters also the calculation of pi ) → does
not have analytic solution

9/19
Optimal Parameters β
• Gradient (see exercises):
> n
∂Llogistic (β) X
= (pi − yi )φ(xi ) + 2λIβ = X>(p − y) + 2λIβ
∂β i=1

φ(x1 )>
 

.. 
 
where pi := p(y = 1 | xi ) , X= ∈ Rn×k

 


 . 

φ(xn )>
 

∂Llogistic (β)
• ∂β is non-linear in β (it enters also the calculation of pi ) → does
not have analytic solution

• Newton algorithm: iterate


>
∂Llogistic (β)
β ← β − H -1 ∂β

2 logistic
with Hessian H = ∂ L∂β 2 (β) = X>W X + 2λI
W = diag(p ◦ (1 − p)), that is, diagonal with Wii = pi (1 − pi ) 9/19
polynomial (cubic) ridge logistic regression:
(MT/plot.h -> gnuplot pipe)
3
1
train 0
decision boundary -1

2
3
2
1
1 0
-1
-2
-3
0
3
2
-2 1
-1 -1 0
0 -1
1
2
3 -2
-2
-2 -1 0 1 2 3

./x.exe -mode 2 -modelFeatureType 3 -lambda 1e+0

10/19
RBF ridge logistic regression:
(MT/plot.h -> gnuplot pipe)
3
1
train 0
decision boundary -1

2
3
2
1
1 0
-1
-2
-3
0
3
2
-2 1
-1 -1 0
0 -1
1
2
3 -2
-2
-2 -1 0 1 2 3

./x.exe -mode 2 -modelFeatureType 4 -lambda 1e+0 -rbfBias 0 -rbfWidth .2

11/19
Recap

ridge regression logistic regression

REPRESENTATION f (x) = φ(x)>β f (x, y) = φ(x, y)>β

OBJECTIVE Lls (β) = Lnll (β) =


Pn 2 2
− n 2
P
i=1 (yi − f (xi )) + λ||β||I i=1 log p(yi | xi ) + λ||β||I

p(y | x) ∝ ef (x,y)

SOLVER β̂ ridge = (X>X + λI)-1 X>y binary case:


β ← β − (X>W X + 2λI)-1
(X>(p − y) + 2λIβ)

12/19
Logistic Regression: Multi-Class Case
• Data D = {(xi , yi )}ni=1 with xi ∈ Rd and yi ∈ {1, .., M }
• We choose f (x, y) = φ(x, y)>β with

φ(x) [y = 1]
 

φ(x) [y = 2] 
 

 
φ(x, y) =
 

 .. 

.
 
 
 
φ(x) [y = M ]

where φ(x) are arbitrary features. We have M -1 parameters β


• Conditional class probabilties

ef (x,y)
p(y | x) = P f (x,y 0 )
y0 e

• Given data D = {(xi , yi )}ni=1 , we minimize


Pn
Llogistic (β) = − i=1 log p(y = yi | xi ) + λ||β||2

13/19
Optimal Parameters β
• Gradient:
>
∂Llogistic (β) Pn
∂βc = i=1 (pic − yic )φ(xi ) + 2λIβc = X>(pc − yc ) + 2λIβc ,
pic = p(y = c | xi )

2 logistic
• Hessian: H = ∂ ∂β
L (β)
c ∂βd
= X>Wcd X + 2[c = d] λI
Wcd diagonal with Wcd,ii = pic ([c = d] − pid )

14/19
polynomial (quadratic) ridge 3-class logistic regression:
(MT/plot.h -> gnuplot pipe)
3
0.9
train 0.5
p=0.5 0.1
1 0.9
2 0.5
0.8
0.1
0.6 0.9
1
0.4 0.5
0.2 0.1
0
0

3
-2 2
-1 1
-1 0 0
1 -1
2
3 -2
-2
-2 -1 0 1 2 3

./x.exe -mode 3 -modelFeatureType 3 -lambda 1e+1

15/19
Summary
• Logistic regression belongs to the class of probabilistic classification
approaches. The basic idea is estimating the class probabilities using
a logistic function.
• There are many other non-probabilistic approaches. One of the most
important classifier there is the support vector machines.

16/19
Support Vector Machines: Basic Idea

• Which classifier would you prefer?

17/19
Support Vector Machines: Basic Idea

• Which classifier would you prefer?


• The optimal classifier is the one with maximal margin !!
• The larger the margin, the more robust and the better is the
generalization property.

17/19
Let’s start with the linear SVM ...

• Find a linear surface which maximizes the margin


Thus,
1
minβ 2 k β k2 , s.t. yi (xTi β + β0 ) ≥ 1

yi ∈ {−1, 1} ... outputs, i = 1...N


Decision function
ŷ(x) = sign xT β ∗ +β0∗


18/19
Dual Representation of the linear SVM

Lagrangian function

N
1 X
k β k2 − αi yi (xTi β + β0 ) − 1
 
J(β, β0 , α) =
2 i=1

Dual representation
PN PN PN
maxα Q(α) = i=1 αi − 12 i=1 j=1 αi αj yi yj xTi xj
PN
s.t. i=1 αi yi = 0; αi ≥ 0 for i = 1, 2, · · · , N

19/19
Dual Representation of the linear SVM

Lagrangian function

N
1 X
k β k2 − αi yi (xTi β + β0 ) − 1
 
J(β, β0 , α) =
2 i=1

Dual representation
PN PN PN
maxα Q(α) = i=1 αi − 12 i=1 j=1 αi αj yi yj xTi xj
PN
s.t. i=1 αi yi = 0; αi ≥ 0 for i = 1, 2, · · · , N

• The optimization of the dual form is convex


• The input patterns appear as dot products

19/19

You might also like