Chapter 5

MIMA Group
M L
D M
Machine Learning
Chapter 5
Linear Discriminant Functions
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Contents MIMA
 Introduction
 Linear Discriminant Functions and Decision
Surface
 Linear Separability
 Learning
 Gradient Decent Algorithm
 Newton’s Method
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

Decision-Making Approaches MIMA
 Probabilistic Approaches
 Based on the underlying probability densities of
training sets.
 For example, Bayesian decision rule assumes that
the underlying probability densities were available.
 Discriminating Approaches
 Assume we know the proper forms for the
discriminant functions.
 Use the samples to estimate the values of parameters
of the classifier.

Discriminant Function MIMA
g i : R d  R ( x) (1  i  c)
 Useful way to represent classifier

 One function per category
 Decide ωi, if
g i ( x)  g j ( x) for all j  i

Linear Discriminant FunctionsMIMA
g i (x)  w x  wi 0T
i
Weight vector Bias/threshold
 Easy for computing, analysis and learning.

 Linear classifiers are attractive candidates for
initial, trial classifier.
 Learning by minimizing a criterion function, e.g.,
training error.
Difficulty: a small training error does not guarantee a small
test error.
 Two-category case
Decide 1 if g (x)  0
g (x)  w x  w0
T
Decide 2 if g (x)  0
g1 (x)  w x  w10
T
1
g 2 (x)  w x  w20
T
2
Thus, it is suffices to consider only d+1 parameters (w and d)

instead of 2(d+1) parameters under two-category case.

 Two-category case: Implementation
1
Decide 2 if g (x)  0 w0
1
w1 w2 wd
x1 x2 xd
 Two-category case: Implementation
1
Decide 2 if g (x)  0 w0
w0 1
1
w1 w2 wd
x0=1
x1 x2 xd
Decision Surface MIMA
g (x)  w x  w0
T
g ( x)  0
xn w T
w
g ( x)  w x 
T
w0  0
R2 || w || 2
T w0 w 
w  x    0
x  || w || || w || 
x
R1 2
w
x1 Decide 2 if g (x)  0

Decision Surface MIMA
1. A linear discriminant
g (x)  w x  w0
T
 0 the
g (x)divides
function
feature space
T by a
xn w w
g ( x)  w x 
T
hyperplane. w0  0
R2 2. The orientation
|| w ||2
of the
 w w
 x  is0 determined
w Tsurface   0by
x2 the || w ||vector
 normal || w || w.
x
R1 Decide 1 if gof(the
3. The location x)  0
w surface is determined by
x1 Decide  2 if
the bias w0.
g ( x )  0

Augmented Space MIMA
g (x)  w x  w0 T d-dimension
 a0   y0 
   
 w0   a1  1   y1 
R1
x2
Let a  y 
 w    x   
x1    
w  ad   yd 
R2 (d+1)-dimension
Decision surface: aT y  0
Decide 1 if g (x)  0 Decide 1 if aT y  0
Decide 2 if g (x)  0 Decide 2 if aT y  0

Augmented
feature vector
Augmented
g ( x)  w x  w T
weight vector
0
d-dimension
 a0   y0 
   
 w0   a1  1   y1 
R1
x2
Let a  y 
 w    x   
x1    
w  ad   yd 
R2 (d+1)-dimension
Decision surface: aT y  0
Pass through 1 if g (x)  0
the origin
Decide Decide 1 if aT y  0
of augmented space
Decide  if g (x)  0 Decide 2 if aT y  0
2

g (x)  w x  w0 T
ay = 0
y2
R1
x2
w x1 a y1
R1
R2 1
y0
R2
 Decision surface in feature space:
g (x)  w T x  w0  0 Pass through the origin

only when w0=0.
 Decision surface in augmented space:
g ( x)  a y  0
T
Always pass through the origin.
 w0  1 
a  y 
w x 
By using this mapping, the problem of finding weight wector
w and threshold w0 is reduced to finding a single vector a.
Linear Separability MIMA
 Two-Category Case
2 2
1
1
Linearly Separable Not Linearly Separable

Given a set of samples y1, y2, …, yn,

some labeled 1 and some labeled 2,
2
if there exists a vector a such that
aT y i  0 if yi is labeled 1
1 aT y i  0 if yi is labeled 2
then the samples are said to be

Linearly Separable
Linearly Separable

Withdrawing all labels of Given a set of samples y1, y2, …, yn,

samples and replacing some labeled 1 and some labeled 2,
the ones labeled 2 by
if there exists a vector a such that
their negatives, it is
equivalent to find a aT y i  0 if yi is labeled 1
vector a such that
aT y i  0 if yi is labeled 2
a yi  0
T
i
then the samples are said to be
Linearly Separable

Solution Region in Feature Space MIMA
Separating Plane: a1 y1  a2 y2    an yn  0
Normalized Case
Solution Region in Weight Space MIMA
 Solution Region in Weight Space

a yi  0
T
a yi  b
T
Shrink solution region by margin b / || y i || , b0

How to learn the weights?

Criterion Function MIMA
 To facilitate learning, we usually define a scalar

criterion function.
 It usually represents the penalty or cost of a
solution.
 Our goal is to minimize its value.
 Function optimization.
How to minimize the criterion function?

Gradient Decent Algorithm MIMA
 Our goal is to go downhill
J(a) a2 Contour Map
a2
(a1, a2)
a1
a1

 Taylor Expansion
f ( x  x)  f ( x)  f ( x) T  x  O(x T  x)
f : Rd  R : A real-valued d-variate function
x  Rd : A point in the d-dimensional Euclidean space
x  R d : A small shift in the d-dimensional Euclidean space
f (x) : gradient of f(.) at x
O(x  x) :
T
The big oh order of x  x
T

 Taylor Expansion
f ( x  x)  f ( x)  f ( x) T  x  O(x T  x)
What happens if we set Δx to be negatively
proportional to the gradient at x, i.e.,
x    f ( x) ( being a small positive scalar)

f ( x  x)  f ( x)  f ( x)  f ( x)  O(x  x)
t t
Being Ignored when

non-negative it is small
There, we have f(x+ Δx )<=f(x)

 Basic strategy
 To minimize some function f(.), the general gradient
descent techniques work in the following iterative
way:
1. Set learning rate >0, and a small threshold >0.

Randomly initialize x0  R as the starting point; set k=0.
d
2.
3. do k=k+1
4. xk  xk 1    f ( xk 1 )
5. until
6. Return xk and f(xk)

Why the negative gradient direction?
T
    
 x    
 x1 x2 xd 
steepest if dx   x f
df  ( x f )T dx  0 if dx   x f
steepest decent if dx   x f
How long a step shall we take?
f(x) x2
xf
(x1, x2)
go this way
x1

If improper learning rate (k) is used,

the convergence rate may be poor.
1. Too small: slow convergence.

2. Too large: overshooting
Basin of
x*
Attraction
x2
x0
f(x0) f(x1) Furthermore, the best decent direction

x1
is not necessary, and in fact is quite
rarely, the direction of steepest decent.

Newton’s Method MIMA
 Global minimum of a Paraboloid
1 T
Paraboloid f (x)  c  a x  x Qx
T
2
We can find the global minimum of a paraboloid
by setting its gradient to zero.
 x f (x) |x  x k  a  Qx k  0
1
x*  Q a
1 T
f (x k  x)  f (x k )  g x  x Q k x
T
k
2
Taylor Series Expansion
All smooth functions can be

approximated by paraboloids in a
sufficiently small neighborhood of
any point.

1 T
f (x k  x)  f (x k )  g x  x Q k x T
k
2
g k   x f ( x) x  x
k
Hessian Matrix
 2 f 2 f 
 2  
2 f  x1 x1 xd 
Qk      
xxT xxk  f
2
 f 
2
 x x  xd2 
 d 1  xxk

1 T
f (x k  x)  f (x k )  g x  x Q k x T
k
2
Set  x f  g k  Q k x  0
x k 1  x k  Q k 1g k  0

Comparison MIMA
Newton’s method Gradient Decent

Comparison MIMA
 Newton’s Method will usually give a greater

improvement per step than the simple gradient
decent algorithm, even with optimal value of k.
 However, Newton’s Method is not applicable if
the Hessian matrix Q is singular.
 Even when Q is nonsingular, compute Q is time
consuming O(d3).
 It often takes less time to set k to a constant
(small than necessary) than it is to compute the
optimum k at each step.

Summary MIMA
 Discriminant functions
 Linear Discriminant Functions and Decision
Surface
 The general setting, one function for each class
 The two-category case
 Minimization of criterion/objection function
 Linear Separability

Summary MIMA
 Learning
 Gradient descent
xk  xk 1    f ( xk 1 )
 Newton’s method
x k 1  x k  Q k 1g k  0

MIMA Group
Any question?
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Chapter 5

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5

Uploaded by

Copyright:

Available Formats

MIMA Group

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 3

 Useful way to represent classifier

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 4

Weight vector Bias/threshold

 Easy for computing, analysis and learning.

Thus, it is suffices to consider only d+1 parameters (w and d)

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 6

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 9

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 10

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 11

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 12

 Decision surface in feature space:

g (x)  w T x  w0  0 Pass through the origin

Linearly Separable Not Linearly Separable

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 15

Given a set of samples y1, y2, …, yn,

then the samples are said to be

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 16

Withdrawing all labels of Given a set of samples y1, y2, …, yn,

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 17

 Solution Region in Weight Space

Shrink solution region by margin b / || y i || , b0

How to learn the weights?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 20

 To facilitate learning, we usually define a scalar

How to minimize the criterion function?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 21

 Our goal is to go downhill

J(a) a2 Contour Map

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 22

x  Rd : A point in the d-dimensional Euclidean space

x  R d : A small shift in the d-dimensional Euclidean space

f (x) : gradient of f(.) at x

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 23

x    f ( x) ( being a small positive scalar)

Being Ignored when

There, we have f(x+ Δx )<=f(x)

1. Set learning rate >0, and a small threshold >0.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 25

Why the negative gradient direction?

How long a step shall we take?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 27

If improper learning rate (k) is used,

1. Too small: slow convergence.

f(x0) f(x1) Furthermore, the best decent direction

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 28

 Global minimum of a Paraboloid

All smooth functions can be

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 30

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 31

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 32

Newton’s method Gradient Decent

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 33

 Newton’s Method will usually give a greater

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 34

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 35

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 36

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

You might also like