Lec5 Class Margin

Linear Classification: Maximizing the Margin
Yufei Tao
Department of Computer Science and Engineering
Chinese University of Hong Kong
1 / 28
Yufei Tao
Consider that we are given a linearly separable dataset P in Rd . In other

words, P has points of two colors: red and blue; and we can find a plane
c1 x1 + c2 x2 + ... + cd xd = 0 such that given a point p(x1 , ..., xd ) in P:
if p is red, then c1 x1 + c2 x2 + ... + cd xd > 0.
if p is blue, then c1 x1 + c2 x2 + ... + cd xd < 0.
There can be many such separation planes. So far we have been satisfied
with finding any of them. In this lecture, we will introduce a metric to
measure the quality of a plane, and then discuss how to find the best
plane according to the metric.
2 / 28
Yufei Tao
Given a separation plane with respect to P, we define its margin to be

its smallest distance to the points in P.
Example 1.
margin
margin
Which plane would you choose?
3 / 28
Yufei Tao
Definition 2.
Let P be a linearly separable dataset in Rd . The goal of the large margin
separation problem is to find a separation plane with the maximum
margin.
margin
Any algorithm solving this problem is called a support vector machine.
4 / 28
Yufei Tao
Next, we will discuss two methods to approach the problem. The first
one find gives the optimal solution, but is quite complicated and (often)
computationally expensive. The second method, on the other hand, is
much simpler and (often) much faster, but gives an approximate solution
which is nearly optimal.
5 / 28
Yufei Tao
Finding the Optimal Plane
6 / 28
Yufei Tao
We will model the problem as a quadratic programing problem. Towards

that purpose, let us do some analysis. Consider an arbitrary separation
plane c1 x1 + c2 x2 + ... + cd xd = 0. Imagine that there are two copies of
the plane. One person moves a copy up, while another person moves the
other copy down, both at the same speed. They stop as soon as one of
the two copies hits a point in P.
1
margin
7 / 28
Yufei Tao
Now, focus on the two copies of the plane in their final positions. If one
copy has equation c1 x1 + c2 x2 + ... + cd xd = cd+1 , then the other copy
must have equation c1 x1 + c2 x2 + ... + cd xd = cd+1 . Here cd+1 is a
strictly positive value. Let p(x1 , ..., xd ) be a point in P. We must have
(think: why?):
if p is red, then c1 x1 + c2 x2 + ... + cd xd
cd+1 ;
if p is blue, then c1 x1 + c2 x2 + ... + cd xd
cd+1 .
By dividing cd+1 on both sides of each inequality, we have:

if p is red, then w1 x1 + w2 x2 + ... + wd xd
if p is blue, then w1 x1 + w2 x2 + ... + wd xd
where
wi
8 / 28
Yufei Tao
ci
.
cd+1
We will refer to the following plane as 1 :

w1 x1 + w2 x2 + ... + wd xd
the following plane as 2 :

w1 x1 + w2 x2 + ... + wd xd
The margin of the original separation plane is exactly half of the distance
between 1 and 2 :
1
margin
9 / 28
Yufei Tao
Lemma 3.
The distance between 1 and 2 is pP2d
i=1
wi2
~ = [w1 , w2 , ..., wd ]. Also,

To prove the lemma, we will introduce vector w
given a point p(x1 , x2 , ..., xd ), we define vector p~ = [x1 , x2 , ..., xd ]. Thus,
~ p~ = 1, and that of 2 is w
~ p~ = 1. Note that
the equation
qP of 1 is w
d
2
|~
w| =
i=1 wi .
10 / 28
Yufei Tao
Proof of Lemma 3
Take an arbitrary point p1 on 1 , and an arbitrary poitn p2 on 2 . Hence,
~ p~1 = 1 and w
~ p~2 = 1. It follows that w
~ (~
w
p1 p~2 ) = 2.
1
w
~
p1
2
p~1
p~2
p2
The distance between the two planes is precisely
11 / 28
Yufei Tao
~
w
|~
w|
(~
p1
p~2 ) =
2
|~
w| .
In summary of the above, the essence of the large margin separation

problem is essentially
the following problem. We want to find w1 , ..., wd
Pd
to minimize i=1 wi2 (i.e., maximizing pP2d 2 ) subject to the
i=1
following constraints:
wi
For each red point p(x1 , ..., xd ) in P:

w1 x1 + w2 x2 + ... + wd xd
For each blue point p(x1 , ..., xd ) in P:

w1 x1 + w2 x2 + ... + wd xd
1.
This is an instance of quadratic programming.
12 / 28
Yufei Tao
We will not go into the details of quadratic programming, except to

mention:
The instance of quadratic programming on the previous slide is
polynomial-time solvable. The algorithm, however, is exceedingly
complex and falls out of the scope of the course.
There are standard packages online that will allow you to solve the
instance within a reasonable amount of time, provided that the
input set P is not too large.
13 / 28
Yufei Tao
Finding an Approximate Plane
14 / 28
Yufei Tao
Since quadratic programming is complex and expensive, next we trade

precision for simplicity (and usually, also efficiency). Let opt be the
margin of the optimal solution to the large margin separation problem.
We say that a separation plane is c-approximate if its margin is at least
opt /c.
We will give a simple algorithm to find a 4-approximate separation plane.
Here, the constant 4 is chosen to simplify our presentation. In fact, our
algorithm can be easily extended to obtain a (1 + )-approximate
separation plane for any arbitrarily small > 0.
15 / 28
Yufei Tao
Let us first assume that we know a value satisfying opt (we will
clarify how to find later). Recall that a separation plane has the
equation of c1 x1 + c2 x2 + ... + cd xd = 0. Define vector ~c = [c1 , c2 , ..., cd ],
and refer to the plane as the plane determined by ~c . The goal is to find a
good ~c .
Our weapon is once again Perceptron. The dierence from before is that
we will now correct our ~c not only when a point falls on the wrong side
of the plane determined by ~c , but also when the point is too close to the
plane. Specifically, we say that a point p causes a violation in any of the
following situations:
Its distance to the plane determined by ~c is less than or equal to
/2, regardless of the color.
p is red but ~c p~ < 0.
p is blue but ~c p~ > 0.
16 / 28
Yufei Tao
Margin Perceptron
The algorithm starts with ~c = [0, 0, ..., 0], and then runs in iterations.
In each iteration, it simply checks whether any point in p 2 P causes a
violation. If so, the algorithm adjusts ~c as follows:
If p is red, then ~c
If p is blue, then ~c
~c + p~.
~c
p~.
As soon as ~c has been adjusted, the current iteration finishes; and a new
iteration starts.
The algorithm finishes if no point causes any violation in the current
iteration.
17 / 28
Yufei Tao
Define:
R
max{|~
p |}
p2P
Namely, R is the maximum distance from the origin to the points in P.

Theorem 4.
2
Margin Perceptron always terminates in at most 1 + 8R 2 / opt
iterations.
At termination, it returns a separation plane with margin at least /2.
The proof can be found in the appendix.
18 / 28
Yufei Tao
Margin Perceptron requires a parameter opt . Theorem 4 tells us

that the larger is, the better the quality guarantee is for the returned
plane. Ideally, we would like to set = opt , which will give us a
2-approximate plane. Unfortunately, we do not know the value of opt .
Next, we present a strategy that allows us to obtain an estimate of opt
that is o by at most a factor of 4. The strategy is to start a low value of
, and then gradually increases it until we are sure that it is greater than
opt already.
19 / 28
Yufei Tao
An Incremental Algorithm
1
an arbitrary separation plane (obtained by, e.g., running the

old Perceptron algorithm or linear programming). Let 0 be the
margin of this plane.
4 0; i
1.
Run Margin Perceptron with parameter i . If the algorithm does

2
not terminate after 1 + 8R2 iterations, stop it manually. In this case,
i
we return as our final answer.
Update to the plane returned by the algorithm. Let

margin of .
6
7
20 / 28
the maximum distance from the origin to the points in P
i+1
be the
i + 1; Go to Line 4.
Yufei Tao
Lemma 5.
Consider the i-th call to Margin Perceptron. If manual termination
occurs at Line 4, then i > opt . Otherwise, i+1 2 i .
Proof.
First consider the case of manual termination. If opt , then by
Theorem 4, we know that Margin Perceptron should have terminated in
2
2
at most 1 + 8R2 1 + 8R2 iterations. Contradiction.
opt
Now consider that Margin Perceptron terminated normally. Then, by

Theorem 4, 0 (the quality of the plane returned) must be at least i /2.
Hence, i+1 = 4 0 2 i .
21 / 28
Yufei Tao
Corollary 6.
The algorithm terminates after at most 1 + log2
Perceptron.
opt
0
calls to Margin
Proof.
From the previous lemma, we know that i 2i 0 . Suppose that the
algorithm makes k calls to Margin Perceptron. Then,
2k 1 0 . Solving k from the inequality gives the
opt
k 1
corollary.
22 / 28
Yufei Tao
Theorem 7.
Our incremental algorithm returns a separation plane with margin at least
opt /4.
Proof.
Suppose that the algorithm terminates after k calls to Margin
Perceptron. Hence:
k
>
opt
The plane we return eventually has quality

opt /4.
23 / 28
Yufei Tao
k /4,
which is at least
Appendix
24 / 28
Yufei Tao
Proof of Theorem 4.
Let u~ ~x = 0 be the optimal separation plane with margin
loss of generality, suppose that |~
u | = 1. Hence:
opt
opt .
Without
min{|~
p u~|}
p2P
Recall that the perceptron algorithm adjusts ~c in each iteration. Let ~ci
(i 1) be the ~c after the i-th iteration. Also, let ~c0 = [0, ..., 0] be the
initial ~c before the first iteration. Also, let k be the total number of
adjustments.
25 / 28
Yufei Tao
Proof (cont.).
We claim that, for any i 0, ~ci+1 u~ ~ci u~ + opt . Due to symmetry,
we prove this only for the case where ~ci+1 was adjusted from ~ci because
of the violation of a red point p~.
In this case, ~ci+1 = ~ci + p~; and hence, ~ci+1 u~ = ~ci u~ + p~ u~. From the
definition of opt , we know that p~ u~
opt . Therefore,
~ci+1 u~ ~ci u~ + opt .
It follows that
|~ck |
26 / 28
~ck u~
Yufei Tao
opt .
(1)
Proof (cont.).
2
opt
R
We also claim that, for any i 0, |~ci+1 | |~ci | + 2|~
ci | + 2 . Due to
symmetry, we will prove this only for the case where ~ci+1 was adjusted
from ~ci due to the violation of a red point p~.
~ci
p~1
O
(origin)
p~
p~2
As shown above, p~ = p~1 + p~2 , where p~1 is perpendicular to ~ci , and p~2 is
parallel to ~ci (and hence, perpendicular to the plane determined by ~ci ).
Therefore, ~ci+1 = ~ci + p~ = ~ci + p~1 + p~2 . The claim is true due to:
By definition of violation, p~2 is either is to the opposite of ~ci or has
a norm of |~
p2 | /2 opt /2.
Notice that |~
p1 | |~
p | R. Hence, |~ci + p~1 |2 = |~ci |2 + |~
p1 |2
R2 2
R2
2
2
|~ci | + R (|~ci | + 2|~ci | ) . It thus follows that |~ci + p~1 | |~ci | + 2|~
ci | .
27 / 28
Yufei Tao
Proof (cont.).
The claim on the previous slide implies that when |~ci |
|~ci+1 | |~ci | +
opt
2R 2
opt
. Therefore:
|~ck |
2R 2
3k
opt
opt
Combining the above and (1) gives:

k
28 / 28
opt
2R 2
opt
2
8R
2
opt
Yufei Tao
3k
opt

Lec5 Class Margin

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec5 Class Margin

Uploaded by

Copyright:

Available Formats

Linear Classification: Maximizing the Margin

Linear Classification: Maximizing the Margin

Consider that we are given a linearly separable dataset P in Rd . In other

Linear Classification: Maximizing the Margin

Given a separation plane with respect to P, we define its margin to be

Which plane would you choose?

Linear Classification: Maximizing the Margin

Any algorithm solving this problem is called a support vector machine.

Linear Classification: Maximizing the Margin

Linear Classification: Maximizing the Margin

Finding the Optimal Plane

Linear Classification: Maximizing the Margin

We will model the problem as a quadratic programing problem. Towards

Linear Classification: Maximizing the Margin

if p is blue, then c1 x1 + c2 x2 + ... + cd xd

By dividing cd+1 on both sides of each inequality, we have:

if p is blue, then w1 x1 + w2 x2 + ... + wd xd

Linear Classification: Maximizing the Margin

We will refer to the following plane as 1 :

the following plane as 2 :

Linear Classification: Maximizing the Margin

~ = [w1 , w2 , ..., wd ]. Also,

Linear Classification: Maximizing the Margin

The distance between the two planes is precisely

Linear Classification: Maximizing the Margin

In summary of the above, the essence of the large margin separation

For each red point p(x1 , ..., xd ) in P:

For each blue point p(x1 , ..., xd ) in P:

This is an instance of quadratic programming.

Linear Classification: Maximizing the Margin

We will not go into the details of quadratic programming, except to

Linear Classification: Maximizing the Margin

Finding an Approximate Plane

Linear Classification: Maximizing the Margin

Since quadratic programming is complex and expensive, next we trade

Linear Classification: Maximizing the Margin

Linear Classification: Maximizing the Margin

Linear Classification: Maximizing the Margin

Namely, R is the maximum distance from the origin to the points in P.

The proof can be found in the appendix.

Linear Classification: Maximizing the Margin

Margin Perceptron requires a parameter opt . Theorem 4 tells us

Linear Classification: Maximizing the Margin

an arbitrary separation plane (obtained by, e.g., running the

Run Margin Perceptron with parameter i . If the algorithm does

Update to the plane returned by the algorithm. Let

the maximum distance from the origin to the points in P

Linear Classification: Maximizing the Margin

Now consider that Margin Perceptron terminated normally. Then, by

Linear Classification: Maximizing the Margin

Linear Classification: Maximizing the Margin

The plane we return eventually has quality

Linear Classification: Maximizing the Margin

Linear Classification: Maximizing the Margin

Linear Classification: Maximizing the Margin

Linear Classification: Maximizing the Margin

Linear Classification: Maximizing the Margin

Combining the above and (1) gives:

Linear Classification: Maximizing the Margin

You might also like