Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 27

Learning Linear

Classifiers
Authentication by Secret Questions

Give me your device ID and TS271828182845


answer the following questions
1. 10111100 1. 1
2. 00110010 2. 0
3. 10001110 3. 1
4. 00010100 4. 0
5. … 5. …
SERVER DEVICE
Physically Unclonable Functions
0.50ms These tiny differences are
difficult to predict or clone
0.55ms

Then these could act


as the fingerprints
for the devices!
Arbiter PUFs If the top signal reaches the finish line first,
the “answer” to this question is 0, else if the
bottom signal reaches first, the “answer” is 1

Question: 1011

1 0 1 1

1?
Arbiter PUFs If the top signal reaches the finish line first,
the “answer” to this question is 0, else if the
bottom signal reaches first, the “answer” is 1

Question: 0110

0 1 1 0

0?
Linear Models
We have

where
𝐰

If , upper signal wins and answer is 0


If , lower signal wins and answer is 1
Thus, answer is simply
This is nothing but
a linear classifier!
XOR PUF
XOR: given a bunch
of 0/1 bits, output is 1
if odd number of bits
are 1 else if even
number of bits
(includes no bits) are
1, output is 0
XOR is basically
addition modulo 2
Cracking the XOR PUF
It turns out that the XOR PUF can also be cracked using a linear
model although one of a larger dimensionality
Key insight: if we have a bunch of values, their product is +1 if and
only if an even number of them are -1 else the product is -1
We can crack the individual PUFs using linear models i.e., for i-th
PUF

Remember: sign value of +1 corresponds to bit 1 and -1 corresponds to bit


0
Note: is +1 if an even number of the sign values are -1
However, XOR is concerned with parity of +1 bits
Solution: Flip the signs!
Cracking the XOR PUF
The product is -1 if an even number of the sign values are +1 else
the product is +1
The extra is there since XOR is if there are an even number of s
Here, is the number of PUFs
Thus, the output of is 0 if an even number of the sign values are +1
else the output is 1
This is exactly what we wanted!
All we need to do find a way to compute
Although it does not seem so right away, there is a linear model hidden here
Observation:
Find a way to simplify
Cracking the XOR PUF
Let’s take a toy example in 2 dims with where

where
Thus, we can just learn a linear model in 3D instead of 2D
Exercise: extend this intuition to more than 2 classifiers and higher dims
Try to do optimizations to reduce the dimensionality of
Note: we are not assured that the linear model we learn will be of
this form i.e., for some we get
However, we are assured that a linear model with 0 error does exist
The “best” Linear Classifier
It seems infinitely many
classifiers perfectly
classify the data. Which
11
one should I choose?

Indeed! Such models would be very brittle and might


misclassify test data (i.e. predict the wrong class), even
those test data which look very similar to train data

It is better to not select a model


whose decision boundary passes
very close to a training data point
Large Margin Classifiers 12
Fact: distance of origin from hyperplane is
Fact: distance of a point from this hyperplane is
Given train data for a binary classfn problem where and , we want
two things from a classifier
Demand 1: classify every point correctly – how to ask this politely?
One way: demand that for all ,
Easier way: demand that for all ,
Demand 2: not let any data point come close to the boundary
Demand that be as large as possible
Support Vector Machines 13
Just a fancy way of saying Let us simplify this “

optimization
Please find me a linear classifier problem
that perfectly
classifies the train data while keeping data points
as far away from the hyperplane as possible
The mathematical way of writing this request is the following
Constraints Objective

such that for all

This is known as an This looks so complicated,


optimization problem with an how will I ever find a
objective and lots of solution to this optimization
Constrained Optimization 101
Constraints are usually specified using
math equations. The set of points that
satisfy all the constraints is called the
14
HOW WE MUST SPEAK TO MS Mfeasible set ofWE
HOW the optimization
SPEAK TOproblem A HUMAN
Objective
Constraints
I want to find an unknown that
gives me the best value
such that For
Youyour
optimization
according
specifiedproblem
constraints,
to this
has
function
thenooptimal
solution (least)
sincevalue
no point
of is
and etc. etc. satisfies
Oh!and all
it isyour
btw, achieved
not constraints
anyatwould do!
must satisfy these conditions
Feasible
Feasibleset
set is the
is empty!
interval
All I am saying is, of the values
of that satisfy my conditions,
s.t.
s.t. find me the one that gives the
and
and best value according to
3 6
Back to SVMs 15
Assume there do exist params that perfectly classify all train data
Consider one such params which classifies train data perfectly
Now, as
Thus, geometric margin is same as since model has perfect
classification!
We will use this useful fact to greatly simplify the optimization
problem

We will remove What if train data is non-linearly


this assumption separable i.e no linear classifier can
later perfectly classify it? For example
Support Vector Machines 16
Let be the data point that comes closest to the hyperplane i.e.
Recall that all this discussion holds only for a perfect classifier
Let and consider
Note this gives us for all as well as (as )
Thus, instead of searching for , easier to search for

such that for all


min {‖~ 𝐰 2‖2 }
~ ~
𝐰 ,𝑏 ❑
The C-SVM Technique What prevents me from misusing the slack
variables to learn a model that misclassifies every
data point?
17
For linearly separable cases where
The termweprevents
suspect youafrom
perfect
doing classifier
so.
exists If we set to be a large value (it is a
hyper-parameter), then it will penalize
solutions that misuse slack too much
s.t. for all
If a linear classifier cannot perfectly classify data, then find model
using Having the constraint prevents
us from misusing slack to
artificially inflate the margin
s.t. for all Recall English
as well as for all phrase “cut me some
slack”
The terms are called slack variables. They allow some data points
to come close to the hyperplane or be misclassified altogether
From C-SVM to Loss Functions 18
We can further simplify the previous optimization problem
Note basically allows us to have (even )
Thus, the amount of slack we want is just
However, recall that we must also satisfy

any slack i.e. you should have in this case


𝑥
Another way of saying that if you already have , then you [ ]
don’t need
+ ¿= max { 𝑥 ,0 } ¿
Thus, we need only set
The above is nothing but the popular hinge loss function!
Hinge Loss 19
Captures how well as a classifier classified a data point
Suppose on a data point , a model gives prediction score of (for a
linear model , we have )
We obviously want for correct classification but we also want for
large margin – hinge loss function captures both

Note that hinge loss not only penalizes misclassification


but also correct classification if the data point gets
too close to the hyperplane!
Final Form of C-SVM 20
Recall that the C-SVM optimization finds a model by solving

s.t. for all


as well as for all
Using the previous discussion, we can rewrite the above very simply
Use Calculus for Optimization 21
Method 1: First order optimality Condition
Exploits the fact that gradient must vanish at a local optimum
Also exploits the fact that for convex functions, local minima are global
Warning: works only for simple convex functions when there are no
constraints
To Do: given a convex function that we wish to minimize, try finding
all the stationary points of the function (set gradient to zero)
If you find only one, that has to be the global minimum 
Example:
only at
i.e. is cvx i.e. is global min
Use Calculus for Optimization 22
Method 2: Perform (sub)gradient descent
Recall that direction opposite to gradient offers steepest descent
(SUB)GRADIENT DESCENT How to initialize ?
1. Given: obj. func. to minimize How to choose
2. Initialize Often called “step
3. For length” or “learning
rate”
1. Obtain a (sub)gradient
2. Choose a step length What is convergence?
3. Update How to decide if we
4. Repeat until convergence have converged?
Gradient Descent (GD)
Choose step length carefully
else may overshoot the
23
Move opposite global minimum even with
to the gradients great initialization
Also,
initialization may
affect result
Our initialization was suchThis
that time
we converged to a local initialization
minimum was
really nice!

Global With convex fns, all local


minima are global Still need to be careful with
minimum step lengths otherwise may
minima and can afford to
be less carefull with overshoot global minima
initialization
Behind the scenes in GD for SVM
So gradient descent, although a mathematical tool
from calculus, actually tries very actively to make
the model perform better on all data points
24
(ignore bias for now)
, where

Assume for a moment for sake of understanding

No change to due to the


Small : is large do not change too much!
data point
Large : Feel free to change as much as the gradient dictates
If does well on , say , then
If does badly on , say , then may get much better
margin on than
Stochastic Gradient Method 25
, where
Calculating each takes time since - total
At each time, choose a random data point
- only time!!
Warning: may have to perform several SGD steps than we had to do
with GD but each SGD step is much cheaper than a GD step
We take a random data point to avoid being unlucky (also it is
cheap)
Do we really need to spend so Initially, all we need is a general
much time on just one update? direction in which to move
No, SGD gives a cheaper
Especially in the beginning, when
way to perform gradient
we are far away from the optimum!
Mini-batch SGD 26
If data is very diverse, the “stochastic” gradient may vary quite a lot
depending on which random data point is chosen
This is called variance (more on this later) but this
can slow down the SGD process – make it jittery
One solution, choose more than one random point
At each step, choose random data points ( = mini batch size) without
replacement, say and use

Takes time to execute MBSGD – more expensive than SGD


Notice that if then MBSGD becomes plain GD
Coordinate Descent
Sometimes we are able to optimize completely along a given
variable (even if constraints are there) – called coordinate
minimization
27
(CM) is changed in a single step
Similar to GD except only one coordinate
E.g. s.t. with as th partial derivative
CCD: choose coordinate cyclically
i.e. COORDINATE DESCENT
SCD: choose randomly 1. For
1. Select a coordinate
Block CD: choose a small set of 2. Let
coordinates at each to update
3. Let for
Randperm: permute coordinates 4. Repeat until convergence
randomly and choose them in
that order. Once the list is over, choose a new random permutation
and start over (very effective)

You might also like