Lec04 - The Most Confident Classifier

the most confident
classifier
I am confident that one of these days, professors will make
their exams closed-ChatGPT in addition to closed-book
Ill-posed Problems in ML There seem to be infinitely many
linear classifiers all of which
perfectly classify the data! Which
one should I choose?
Ill-posed problems can be made

“well-posed” by imposing additional
requirements on the solution
An example of ill-posed
problem from high-
school math is a set of
linear equations with
infinitely many solutions
This means we are dealing

with an “ill-posed problem”
Large-margin Classifiers Such a linear classifier will
have a large geometric
margin on all training points
Geometric
Margin
Small geometric
margin
A large geometric
margin means a more
confident prediction
Let us try to learn a linear

classifier which is the most
confident on all its predictions
Optimization in ML
The feasible set may be empty
in which case the optimization
problem has no solution!
Optimization: a branch of math that gives us techniques to search fora

Just like
set of linear
the “best” object that satisfies certain properties equations
The object could be a number or a vector etc. that has no
solution
The properties are encoded mathematically as constraints
A function (known as objective function) is used to rate the objects
Objective
Find me a number
between and whose
s.t. squared value is the
and Feasible Set smallest
Constraints
3 6
Learning the most confident classifier
Certain optimization problems choose
We maximize the minimum
margin since want every data
point to have a large margin
to maximize their objective function,
Given: train data with and
others choose to minimize them
Goal: learn a linear classifier so that

This optimization problem will fail to give a
It classifies data points correctly
solution (asi.e., for
feasible setall
will be empty) if no
It has large geometric margin
linear i.e., is
classifier large
can for
correctly all
classify
We will deal with such cases later
all points
Thus, correctness becomes our constraint and margin our objective
such that for all

Note: is another way of saying
Simplification
Instead of demanding , let us instead demand
Simply need to scale the model: if for some then if and then
Note 1: the scaled classifier makes the same predictions on any point
Note 2: the scaled classifier has the same margin on any point
Turns out, searching for these scaled classifiers is simpler

Support Vector Machines Since we are already demanding ,
we get since
such that for all

Maximizing is the same as minimizing which is the
same as minimizing if which is the same as
minimizing
such that for all
The classifier returned by solving

this optimization problem is The squaring and scaling by
called the SVM classifier
such that for all help simplify calculations
when solving this opt problem
Handling Non-separable Data
We suspect that although a good linear classifier exists, it may still
not classify every point correctly
For certain points, we may have i.e.
Problem is, we do not know which points will get misclassified

Slack variables: allow points to get misclassified or else classified
correctly but with small geometric margin
Extract a price from What prevents me from misusing the

model for using slack variables to learn a useless classifier
slack variables that misclassifies every data point?
The C-SVM
s.t. for all
as well as for all
The constraints prevent slack variables from being misused to
artificially inflate the margin
The term in the objective prevents the procedure from returning a
model that misuses or overuses slack variables
Setting to a large value will encourage solutions where slack is mostly
Exercise
 Find the solution to the following optimization problem
 Hint 1: for any and if and only if

 Hint 2: You will see that the solution does not depend on or the data at all!
The Hinge Loss
Let us introduce the positive-part function
Looks like the ReLU function used in deep learning
If , then don’t need slack at all i.e.,

The amount of slack we need is precisely
We can rewrite C-SVM as an unconstrained optimization problem
Note: the second term looks like a loss function

This is the hinge loss function
Zero loss if classification correct and 1

Small loss if classification correct but
Large loss if classification is incorrect
Summary
ML models are often learnt by solving optimization problems
The two main ingredients of an optimization problem are the
constraints and the objective function
The objective function may be maximized or minimized
Ill-posed problems may have infinitely many solutions
An opt problem may fail to have a solution if its feasible set is empty
Some opt problems may have no constraints (unconstrained)
Example: the C-SVM problem after we introduced hinge loss
There are several techniques to solve optimization problems
In a later discussion we will find the reason behind the name SVM
Have a blast!
Catch-up with you in the next one.

Lec04 - The Most Confident Classifier

Uploaded by

Copyright:

Available Formats

You might also like

Lec04 - The Most Confident Classifier

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec04 - The Most Confident Classifier

Uploaded by

Copyright:

Available Formats

the most confident

Ill-posed problems can be made

This means we are dealing

Let us try to learn a linear

Optimization: a branch of math that gives us techniques to search fora

Goal: learn a linear classifier so that

Thus, correctness becomes our constraint and margin our objective

such that for all

Turns out, searching for these scaled classifiers is simpler

such that for all

such that for all

The classifier returned by solving

Problem is, we do not know which points will get misclassified

Extract a price from What prevents me from misusing the

 Hint 1: for any and if and only if

Looks like the ReLU function used in deep learning

If , then don’t need slack at all i.e.,

Note: the second term looks like a loss function

Zero loss if classification correct and 1

You might also like