Lecture 16 - Hyperplane Classifiers - Perceptron - Plain

Hyperplane based Classifiers (1):
The Perceptron Algorithm
CS771: Introduction to Machine Learning

Piyush Rai
2
Hyperplane
 Separates a -dimensional space into two half-spaces (positive and negative)
 Defined by a normal vector (pointing towards positive half-space)
means moving along the direction of ;

means in opp. dir.
𝒘 ⊤ 𝒙 +𝑏=0
 Equation of the hyperplane:
 Assumption: The hyperplane passes through origin. If not, add a bias term
 Distance of a point from a hyperplane
Can be positive
or negative
CS771: Intro to ML
3
Hyperplane based (binary) classification
 Basic idea: Learn to separate two classes by a hyperplane
For multi-class classification
with hyperplanes, there will be
Prediction Rule multiple hyperplanes (e.g., one
for each pair of classes); more on
𝑦 ∗=sign(𝒘 ¿ ¿⊤ 𝒙 ∗+𝑏)¿ this later
 The hyperplane may be “implied” by the model, or learned directly

 Implied: Prototype-based classification, nearest neighbors, generative classification, etc
 Directly learned: Logistic regression, Perceptron, Support Vector Machine (SVM), etc
 The “direct” approach defines a model with params (and optionally a bias param )
 The parameters are learned by optimizing a classification loss function (will soon see examples)
 These are also discriminative approaches – is not modeled but treated as fixed (given)
 The hyperplane need not be linear (e.g., can be made nonlinear using kernels; later)
CS771: Intro to ML
4
Interlude: Loss Functions for Classification
 In regression (assuming linear model ), some common loss fn
 These measure the difference between the true output and model’s prediction
 What about loss functions for classification where ?
 Perhaps the most natural classification loss function would be a “0-1 Loss”
 Loss = 1 if and Loss = 0 if . 0-1 Loss
Non-convex, non-
 Assuming labels as +1/-1, it means differentiable, and NP-Hard to
{
optimize (also no useful
^
ℓ ( 𝑦 , 𝑦 )= 1 if gradient info for the most
part)
0 if (0,1)
Same as or
(0,0) 𝑦 𝒘⊤ 𝒙
CS771: Intro to ML
5
Interlude: Loss Functions for Classification
 An ideal loss function for classification should be such that“Perceptron” Loss
 Loss is small/zero if and match
 Loss is large/non-zero if and do not match
 Large positive ⇒ small/zero loss
Convex and Non-
 Large negative ⇒ large/non-zero loss differentiable
Already saw this in logistic

regression (the likelihood (0,0)
Log(istic) Loss resulted in this loss function)
Hinge Loss
(0,1)
Convex and Convex and Non-
Differentiable differentiable
(0,0) (0,0) (1,0)

CS771: Intro to ML
6
Learning by Optimizing Perceptron Loss
 Let’s ignore the bias term for now. So the hyperplane is simply
 The Perceptron loss function: . Let’s do SGD
Subgradients w.r.t. One randomly chosen
example in each iteration
 If we use then for , and for

 Non-zero gradients only when the model makes a mistake on current example
 Thus SGD will update only when there is a mistake (mistake-driven learning)
CS771: Intro to ML
7
The Perceptron Algorithm
 Stochastic Sub-grad desc on Perceptron loss is also known as the Perceptron
algorithm
Updates are “corrective”: If and , after
Note: An example the update will be less negative.
may get chosen Likewise, if and , after the update will
Mistake be less positive
several times
condition
during the entire
run
If training data is linearly separable,
the Perceptron algo will converge in
a finite number of iterations
(Block & Novikoff theorem)
 An example of an online learning algorithm (processes one training ex. at a time)

 Assuming , easy to see that the final has the form Meaning of
may be
different
 is total number of mistakes made by the algorithm on example
 As we’ll see, many other models also have weights in the form
CS771: Intro to ML
8
Perceptron and (lack of) Margins
 Perceptron would learn a hyperplane (of many possible) that separates the classes
Basically, it will learn the hyperplane
which corresponds to the that
minimizes the Perceptron loss
Kind of an “unsafe” situation to

have – ideally would like it to be
reasonably away from closest
training examples from either class
 Doesn’t guarantee any “margin” around the hyperplane
 The hyperplane can get arbitrarily close to some training example(s) on either side
 This may not be good for generalization performance is some pre-specified margin
 Can artificially introduce margin by changing the mistake condition to

 Support Vector Machine (SVM) does it directly by learning the max. margin
hyperplane
CS771: Intro to ML
9
Coming up next
 Support Vector Machines
CS771: Intro to ML

Lecture 16 - Hyperplane Classifiers - Perceptron - Plain

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 16 - Hyperplane Classifiers - Perceptron - Plain

Uploaded by

Copyright:

Available Formats

Hyperplane based Classifiers (1):

The Perceptron Algorithm

CS771: Introduction to Machine Learning

means moving along the direction of ;

 The hyperplane may be “implied” by the model, or learned directly

Already saw this in logistic

(0,0) (0,0) (1,0)

 If we use then for , and for

 An example of an online learning algorithm (processes one training ex. at a time)

Kind of an “unsafe” situation to

 Can artificially introduce margin by changing the mistake condition to

You might also like