Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

MIMA Group

M L
D M

Machine Learning
Chapter 5
Linear Discriminant Functions

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University


Contents MIMA

 Introduction
 Linear Discriminant Functions and Decision
Surface
 Linear Separability
 Learning
 Gradient Decent Algorithm
 Newton’s Method

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2


Decision-Making Approaches MIMA
 Probabilistic Approaches
 Based on the underlying probability densities of
training sets.
 For example, Bayesian decision rule assumes that
the underlying probability densities were available.
 Discriminating Approaches
 Assume we know the proper forms for the
discriminant functions.
 Use the samples to estimate the values of parameters
of the classifier.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 3


Discriminant Function MIMA

g i : R d  R ( x) (1  i  c)

 Useful way to represent classifier


 One function per category
 Decide ωi, if

g i ( x)  g j ( x) for all j  i

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 4


Linear Discriminant FunctionsMIMA
g i (x)  w x  wi 0T
i

Weight vector Bias/threshold

 Easy for computing, analysis and learning.


 Linear classifiers are attractive candidates for
initial, trial classifier.
 Learning by minimizing a criterion function, e.g.,
training error.
Difficulty: a small training error does not guarantee a small
test error.
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 5
Linear Discriminant FunctionsMIMA
 Two-category case
Decide 1 if g (x)  0
g (x)  w x  w0
T

Decide 2 if g (x)  0
g1 (x)  w x  w10
T
1

g 2 (x)  w x  w20
T
2

Thus, it is suffices to consider only d+1 parameters (w and d)


instead of 2(d+1) parameters under two-category case.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 6


Linear Discriminant FunctionsMIMA
 Two-category case: Implementation

Decide 1 if g (x)  0
1

Decide 2 if g (x)  0 w0
1

w1 w2 wd

x1 x2 xd
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 7
Linear Discriminant FunctionsMIMA
 Two-category case: Implementation

Decide 1 if g (x)  0
1

Decide 2 if g (x)  0 w0
w0 1
1

w1 w2 wd

x0=1
x1 x2 xd
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 8
Decision Surface MIMA

g (x)  w x  w0
T
g ( x)  0

xn w T
w
g ( x)  w x 
T
w0  0
R2 || w || 2

T w0 w 
w  x    0
x  || w || || w || 
x
R1 2
Decide 1 if g (x)  0
w
x1 Decide 2 if g (x)  0

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 9


Decision Surface MIMA

1. A linear discriminant
g (x)  w x  w0
T
 0 the
g (x)divides
function
feature space
T by a
xn w w
g ( x)  w x 
T
hyperplane. w0  0
R2 2. The orientation
|| w ||2
of the
 w w
 x  is0 determined
w Tsurface   0by
x2 the || w ||vector
 normal || w || w.
x
R1 Decide 1 if gof(the
3. The location x)  0
w surface is determined by
x1 Decide  2 if
the bias w0.
g ( x )  0

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 10


Augmented Space MIMA

g (x)  w x  w0 T d-dimension
 a0   y0 
   
 w0   a1  1   y1 
R1
x2
Let a  y 
 w    x   
x1    
w  ad   yd 

R2 (d+1)-dimension

Decision surface: aT y  0
Decide 1 if g (x)  0 Decide 1 if aT y  0
Decide 2 if g (x)  0 Decide 2 if aT y  0

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 11


Augmented Space MIMA
Augmented
feature vector
Augmented
g ( x)  w x  w T
weight vector
0
d-dimension
 a0   y0 
   
 w0   a1  1   y1 
R1
x2
Let a  y 
 w    x   
x1    
w  ad   yd 

R2 (d+1)-dimension

Decision surface: aT y  0
Pass through 1 if g (x)  0
the origin
Decide Decide 1 if aT y  0
of augmented space
Decide  if g (x)  0 Decide 2 if aT y  0
2

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 12


Augmented Space MIMA

g (x)  w x  w0 T
ay = 0
y2
R1
x2

w x1 a y1
R1
R2 1
y0

R2
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 13
Augmented Space MIMA

 Decision surface in feature space:

g (x)  w T x  w0  0 Pass through the origin


only when w0=0.
 Decision surface in augmented space:
g ( x)  a y  0
T
Always pass through the origin.

 w0  1 
a  y 
w x 
By using this mapping, the problem of finding weight wector
w and threshold w0 is reduced to finding a single vector a.
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 14
Linear Separability MIMA

 Two-Category Case

2 2

1
1

Linearly Separable Not Linearly Separable

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 15


Linear Separability MIMA

 Two-Category Case

Given a set of samples y1, y2, …, yn,


some labeled 1 and some labeled 2,
2
if there exists a vector a such that
aT y i  0 if yi is labeled 1

1 aT y i  0 if yi is labeled 2

then the samples are said to be


Linearly Separable
Linearly Separable

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 16


Linear Separability MIMA

 Two-Category Case

Withdrawing all labels of Given a set of samples y1, y2, …, yn,


samples and replacing some labeled 1 and some labeled 2,
the ones labeled 2 by
if there exists a vector a such that
their negatives, it is
equivalent to find a aT y i  0 if yi is labeled 1
vector a such that
aT y i  0 if yi is labeled 2

a yi  0
T
i
then the samples are said to be
Linearly Separable

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 17


Solution Region in Feature Space MIMA

Separating Plane: a1 y1  a2 y2    an yn  0

Normalized Case
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 18
Solution Region in Weight Space MIMA

 Solution Region in Weight Space


a yi  0
T
a yi  b
T

Shrink solution region by margin b / || y i || , b0


Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 19
Linear Discriminant FunctionsMIMA

How to learn the weights?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 20


Criterion Function MIMA

 To facilitate learning, we usually define a scalar


criterion function.
 It usually represents the penalty or cost of a
solution.
 Our goal is to minimize its value.

 Function optimization.

How to minimize the criterion function?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 21


Gradient Decent Algorithm MIMA

 Our goal is to go downhill

J(a) a2 Contour Map

a2
(a1, a2)
a1
a1

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 22


Gradient Decent Algorithm MIMA

 Taylor Expansion
f ( x  x)  f ( x)  f ( x) T  x  O(x T  x)
f : Rd  R : A real-valued d-variate function

x  Rd : A point in the d-dimensional Euclidean space

x  R d : A small shift in the d-dimensional Euclidean space

f (x) : gradient of f(.) at x

O(x  x) :
T
The big oh order of x  x
T

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 23


Gradient Decent Algorithm MIMA

 Taylor Expansion
f ( x  x)  f ( x)  f ( x) T  x  O(x T  x)
What happens if we set Δx to be negatively
proportional to the gradient at x, i.e.,

x    f ( x) ( being a small positive scalar)


f ( x  x)  f ( x)  f ( x)  f ( x)  O(x  x)
t t

Being Ignored when


non-negative it is small

There, we have f(x+ Δx )<=f(x)


Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 24
Gradient Decent Algorithm MIMA

 Basic strategy
 To minimize some function f(.), the general gradient
descent techniques work in the following iterative
way:

1. Set learning rate >0, and a small threshold >0.


Randomly initialize x0  R as the starting point; set k=0.
d
2.
3. do k=k+1
4. xk  xk 1    f ( xk 1 )
5. until
6. Return xk and f(xk)

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 25


Gradient Decent Algorithm MIMA

Why the negative gradient direction?

T
    
 x    
 x1 x2 xd 

steepest if dx   x f

df  ( x f )T dx  0 if dx   x f

steepest decent if dx   x f
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 26
Gradient Decent Algorithm MIMA

How long a step shall we take?

f(x) x2

xf
(x1, x2)
go this way
x1

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 27


Gradient Decent Algorithm MIMA

If improper learning rate (k) is used,


the convergence rate may be poor.

1. Too small: slow convergence.


2. Too large: overshooting

Basin of
x*
Attraction
x2
x0

f(x0) f(x1) Furthermore, the best decent direction


x1
is not necessary, and in fact is quite
rarely, the direction of steepest decent.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 28


Newton’s Method MIMA

 Global minimum of a Paraboloid

1 T
Paraboloid f (x)  c  a x  x Qx
T

2
We can find the global minimum of a paraboloid
by setting its gradient to zero.

 x f (x) |x  x k  a  Qx k  0
1
x*  Q a
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 29
Newton’s Method MIMA

1 T
f (x k  x)  f (x k )  g x  x Q k x
T
k
2
Taylor Series Expansion

All smooth functions can be


approximated by paraboloids in a
sufficiently small neighborhood of
any point.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 30


Newton’s Method MIMA

1 T
f (x k  x)  f (x k )  g x  x Q k x T
k
2
g k   x f ( x) x  x
k

Hessian Matrix
 2 f 2 f 
 2  
2 f  x1 x1 xd 
Qk      
xxT xxk  f
2
 f 
2

 x x  xd2 
 d 1  xxk

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 31


Newton’s Method MIMA

1 T
f (x k  x)  f (x k )  g x  x Q k x T
k
2

Set  x f  g k  Q k x  0

x k 1  x k  Q k 1g k  0

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 32


Comparison MIMA

Newton’s method Gradient Decent

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 33


Comparison MIMA

 Newton’s Method will usually give a greater


improvement per step than the simple gradient
decent algorithm, even with optimal value of k.
 However, Newton’s Method is not applicable if
the Hessian matrix Q is singular.
 Even when Q is nonsingular, compute Q is time
consuming O(d3).
 It often takes less time to set k to a constant
(small than necessary) than it is to compute the
optimum k at each step.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 34


Summary MIMA

 Discriminant functions
 Linear Discriminant Functions and Decision
Surface
 The general setting, one function for each class
 The two-category case
 Minimization of criterion/objection function
 Linear Separability

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 35


Summary MIMA

 Learning
 Gradient descent

xk  xk 1    f ( xk 1 )
 Newton’s method
x k 1  x k  Q k 1g k  0

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 36


MIMA Group

Any question?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

You might also like