498 FA2019 Lecture03

Lecture
3:
Linear Classifiers
Justin Johnson Lecture 3 - 1 September 11, 2019

Reminder: Assignment 1
• http://web.eecs.umich.edu/~justincj/teaching/eecs498/assignment1.html
• Due Sunday September 15, 11:59pm EST
• We have written a homework validation script to check the
format of your .zip file before you submit to Canvas:
• https://github.com/deepvision-class/tools#homework-
validation
• This script ensures that your .zip and .ipynb files are properly
structured; they do not check correctness
• It is your responsibility to make sure your submitted .zip file
passes the validation script
Last time: Image Classification
Input: image Output: Assign image to one
of a fixed set of categories
cat
bird
deer
dog
truck
This image by Nikita is

licensed under CC-BY 2.0

Last Time: Challenges of Recognition
Viewpoint Illumination Deformation Occlusion
This image by Umberto Salvagnin is

This image is CC0 1.0 public domain This image by jonsson is licensed
licensed under CC-BY 2.0
under CC-BY 2.0
Clutter Intraclass Variation
This image is CC0 1.0 public domain This image is CC0 1.0 public domain

Last time: Data-Drive Approach, kNN
1-NN classifier 5-NN classifier
train test
train validation test

Today: Linear Classifiers

Neural Network
Linear
classifiers
This image is CC0 1.0 public domain

Recall CIFAR10
50,000 training images

each image is 32x32x3
10,000 test images.

Parametric Approach
Image
10 numbers giving
f(x,W) class scores
Array of 32x32x3 numbers
(3072 numbers total)
W
parameters
or weights

Parametric Approach: Linear Classifier
Image
f(x,W) = Wx
10 numbers giving
f(x,W) class scores
W
parameters
or weights

(3072,)
Image
f(x,W) = Wx
(10,) (10, 3072)
10 numbers giving
f(x,W) class scores
W
parameters
or weights

(3072,)
Image
f(x,W) = Wx + b (10,)
(10,) (10, 3072)
10 numbers giving
f(x,W) class scores
W
parameters
or weights

Example for 2x2 image, 3 classes (cat/dog/ship)
Stretch pixels into column f(x,W) = Wx + b
56
56 231
231
24 2
24
Input image
2
(2, 2)
(4,)

Example for 2x2 image, 3 classes (cat/dog/ship)
56
0.2 -0.5 0.1 2.0 1.1 -96.8
56 231
231
24 2
1.5 1.3 2.1 0.0
24
+ 3.2
= 437.9
0 0.25 0.2 -0.3 -1.2 61.95

Input image
2
(2, 2)
W (3, 4)
(4,)
b (3,)
(3,)

Linear Classifier: Algebraic Viewpoint
56
0.2 -0.5 0.1 2.0 1.1 -96.8
56 231
231
24 2
1.5 1.3 2.1 0.0
24
+ 3.2
= 437.9
0 0.25 0.2 -0.3 -1.2 61.95

Input image
2
(2, 2)
W (3, 4)
(4,)
b (3,)
(3,)

Add extra one to data vector;
Linear Classifier: Bias Trick bias is absorbed into last
column of weight matrix
Stretch pixels into column
56
0.2 -0.5 0.1 2.0 1.1 -96.8
56 231
231
24 2
1.5 1.3 2.1 0.0 3.2
24
= 437.9
0 0.25 0.2 -0.3 -1.2 61.95

Input image
2
(2, 2)
W (3, 5)
(5,)
(3,)
1

Linear Classifier: Predictions are Linear!
f(x, W) = Wx (ignore bias)
f(cx, W) = W(cx) = c * f(x, W)

Linear Classifier: Predictions are Linear!
f(x, W) = Wx (ignore bias)
f(cx, W) = W(cx) = c * f(x, W)

Image Scores 0.5 * Image 0.5 * Scores
-96.8 -48.4
437.8 218.9
62.0 31.0

Interpreting a Linear Classifier
Algebraic Viewpoint
f(x,W) = Wx + b
56
0.2 -0.5 0.1 2.0 1.1 -96.8
56 231
231
24 2
1.5 1.3 2.1 0.0
24
+ 3.2
= 437.9
0 0.25 0.2 -0.3 -1.2 61.95

Input image
2
(2, 2)
W (3, 4)
(4,) b (3,)
(3,)

Interpreting a Linear Classifier
Algebraic Viewpoint
f(x,W) = Wx + b
0.2 -0.5 1.5 1.3 0 .25

0.2 -0.5 0.1 2.0
56
1.1 -96.8 W
56 231
231 0.1 2.0 2.1 0.0 0.2 -0.3
24 2
1.5 1.3 2.1 0.0
24
+ 3.2
= 437.9
0 0.25 0.2 -0.3 -1.2 61.95

Input image
2
(2, 2)
W (3, 4)
b (3,) b 1.1 3.2 -1.2
(4,)
(3,)
-96.8 437.9 61.95

Interpreting an Linear Classifier
0.2 -0.5 1.5 1.3 0 .25

W
0.1 2.0 2.1 0.0 0.2 -0.3
b 1.1 3.2 -1.2
-96.8 437.9 61.95

Interpreting an Linear Classifier: Visual Viewpoint
0.2 -0.5 1.5 1.3 0 .25

W
0.1 2.0 2.1 0.0 0.2 -0.3
b 1.1 3.2 -1.2
-96.8 437.9 61.95

Linear classifier has one

“template” per category
0.2 -0.5 1.5 1.3 0 .25
W
0.1 2.0 2.1 0.0 0.2 -0.3
b 1.1 3.2 -1.2
-96.8 437.9 61.95

Linear classifier has one

“template” per category
0.2 -0.5 1.5 1.3 0 .25
W
A single template cannot capture 0.1 2.0 2.1 0.0 0.2 -0.3
multiple modes of the data

b 1.1 3.2 -1.2
e.g. horse template has 2 heads! -96.8 437.9 61.95

Interpreting a Linear Classifier: Geometric Viewpoint
Airplane f(x,W) = Wx + b
Score
Classifier Deer Score
score
Car Score
Value of pixel (15, 8, 0) (3072 numbers total)

Pixel
(11, 11, 0)
f(x,W) = Wx + b
Car score
increases
this way
Pixel
(15, 8, 0)
Car Score Array of 32x32x3 numbers

= 0

Car template
Pixel
on this line
(11, 11, 0)
f(x,W) = Wx + b
Car score
increases
this way
Pixel
(15, 8, 0)

= 0

Airplane Pixel Car template
on this line
Score (11, 11, 0)
f(x,W) = Wx + b
Car score
increases
this way
Pixel (15, 8, 0)

Deer = 0
Score
Airplane Pixel Car template
on this line Hyperplanes carving up a
Score (11, 11, 0) high-dimensional space
Car score
increases
this way
Pixel (15, 8, 0)
Car Score
Deer = 0
Score Plot created using Wolfram Cloud

Hard Cases for a Linear Classifier
Class 1: Class 1: Class 1:
First and third quadrants 1 <= L2 norm <= 2 Three modes
Class 2: Class 2: Class 2:

Second and fourth quadrants Everything else Everything else

Recall: Perceptron couldn’t learn XOR
y
X Y F(x,y)
0 0 0
0 1 1
1 0 1
1 1 0 x

Linear Classifier: Three Viewpoints
Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint
f(x,W) = Wx One template Hyperplanes

per class cutting up space

So Far: Defined a linear score function f(x,W) = Wx + b
Given a W, we can
compute class scores
-3.45 -0.51 3.42 for an image x.
-8.87 6.04 4.64
0.09 5.31 2.65
2.9 -4.22 5.1 But how can we
4.48
8.02
-4.19
3.58
2.64 actually choose a
5.55
3.78 4.49 -4.34 good W?
1.06 -4.37 -1.5
-0.36 -2.09 -4.79
-0.72 -2.93 6.14 Cat image by Nikita is licensed under CC-BY 2.0; Car image is CC0 1.0 public domain; Frog image is in the public domain

Choosing a good W f(x,W) = Wx + b
TODO:
1. Use a loss function to

-3.45 -0.51 3.42
-8.87 6.04 4.64
quantify how good a
0.09 5.31 2.65 value of W is
2.9 -4.22 5.1
4.48 -4.19
8.02
2.64
2. Find a W that minimizes
3.58 5.55
3.78 4.49 -4.34 the loss function
1.06 -4.37 -1.5 (optimization)
-0.36 -2.09 -4.79
-0.72 -2.93 6.14
Loss Function
A loss function tells how good our
current classifier is
Low loss = good classifier

High loss = bad classifier
(Also called: objective function;

cost function)

Loss Function
Low loss = good classifier


cost function)
Negative loss function sometimes

called reward function, profit
function, utility function, fitness
function, etc

Loss Function Given a dataset of examples

Where is image and
Low loss = good classifier is (integer) label

cost function)

function, etc


Where is image and
Loss for a single example is
cost function)

function, etc


Where is image and
Loss for a single example is
cost function)
Loss for the dataset is average of
Negative loss function sometimes per-example losses:
function, etc

Multiclass SVM Loss
”The score of the correct class should
be higher than all the other scores”
Loss
Score for
correct class

Multiclass SVM Loss
Loss
Score for
correct class
Highest score
among other classes
Multiclass SVM Loss
Loss
“Hinge Loss”
Score for
correct class
Highest score “Margin”

among other classes
Multiclass SVM Loss
Given an example
”The score of the correct class should ( is image, is label)
Let be scores
Loss
“Hinge Loss”
Then the SVM loss has the form:
Score for
correct class
Highest score “Margin”

among other classes
Multiclass SVM Loss
Given an example
( is image, is label)
Let be scores

cat 3.2 1.3 2.2
car 5.1 4.9 2.5
frog -1.7 2.0 -3.1

Multiclass SVM Loss
Given an example
Let be scores

cat 3.2 1.3 2.2
car 5.1 4.9 2.5 = max(0, 5.1 - 3.2 + 1)
+ max(0, -1.7 - 3.2 + 1)
frog -1.7 2.0 -3.1 = max(0, 2.9) + max(0, -3.9)
= 2.9 + 0
Loss 2.9 = 2.9
Multiclass SVM Loss
Given an example
Let be scores

cat 3.2 1.3 2.2
car 5.1 4.9 2.5 = max(0, 1.3 - 4.9 + 1)
+max(0, 2.0 - 4.9 + 1)
frog -1.7 2.0 -3.1 = max(0, -2.6) + max(0, -1.9)
= 0 + 0
Loss 2.9 0 = 0
Multiclass SVM Loss
Given an example
Let be scores

cat 3.2 1.3 2.2
car 5.1 4.9 2.5 = max(0, 2.2 - (-3.1) + 1)
+max(0, 2.5 - (-3.1) + 1)
frog -1.7 2.0 -3.1 = max(0, 6.3) + max(0, 6.6)
= 6.3 + 6.6
Loss 2.9 0 12.9 = 12.9
Multiclass SVM Loss
Given an example
Let be scores

cat 3.2 1.3 2.2
car 5.1 4.9 2.5 Loss over the dataset is:
frog -1.7 2.0 -3.1 L = (2.9 + 0.0 + 12.9) / 3
= 5.27
Loss 2.9 0 12.9
Multiclass SVM Loss
Given an example
Let be scores

cat 3.2 1.3 2.2
car 5.1 4.9 2.5
Q: What happens to the
frog -1.7 2.0 -3.1 loss if the scores for the
car image change a bit?
Loss 2.9 0 12.9
Multiclass SVM Loss
Given an example
Let be scores

cat 3.2 1.3 2.2
car 5.1 4.9 2.5
Q2: What are the min
frog -1.7 2.0 -3.1 and max possible loss?
Loss 2.9 0 12.9
Multiclass SVM Loss
Given an example
Let be scores

cat 3.2 1.3 2.2
car 5.1 4.9 2.5
Q3: If all the scores
frog -1.7 2.0 -3.1 were random, what
loss would we expect?
Loss 2.9 0 12.9
Multiclass SVM Loss
Given an example
Let be scores

cat 3.2 1.3 2.2
car 5.1 4.9 2.5
Q4: What would happen
frog -1.7 2.0 -3.1 if the sum were over all
classes? (including i = yi)
Loss 2.9 0 12.9
Multiclass SVM Loss
Given an example
Let be scores

cat 3.2 1.3 2.2
car 5.1 4.9 2.5
Q5: What if the loss used
frog -1.7 2.0 -3.1 a mean instead of a sum?
Loss 2.9 0 12.9
Multiclass SVM Loss
Given an example
Let be scores

cat 3.2 1.3 2.2
car 5.1 4.9 2.5 Q6: What if we used
frog -1.7 2.0 -3.1 this loss instead?
Loss 2.9 0 12.9

Multiclass SVM Loss
Q: Suppose we found some W with L = 0. Is it unique?

Multiclass SVM Loss
Q: Suppose we found some W with L = 0. Is it unique?
No! 2W is also has L = 0!

Multiclass SVM Loss
Original W:
= max(0, 1.3 - 4.9 + 1)
+max(0, 2.0 - 4.9 + 1)
= max(0, -2.6) + max(0, -1.9)
= 0 + 0
cat 3.2 1.3 2.2 = 0
Using 2W instead:
car 5.1 4.9 2.5 = max(0, 2.6 - 9.8 + 1)
+max(0, 4.0 - 9.8 + 1)
frog -1.7 2.0 -3.1 = max(0, -6.2) + max(0, -4.8)
= 0 + 0
Loss 2.9 0 12.9 = 0
Multiclass SVM Loss
How should we choose between

cat 3.2 1.3 2.2 W and 2W if they both perform
the same on the training data?
car 5.1 4.9 2.5
frog -1.7 2.0 -3.1
Loss 2.9 0 12.9
Regularization: Beyond Training Error
Data loss: Model predictions

should match training data

Data loss: Model predictions Regularization: Prevent the model

should match training data from doing too well on training data

= regularization strength
(hyperparameter)


(hyperparameter)

Simple examples More complex:

L2 regularization: Dropout
L1 regularization: Batch normalization
Elastic net (L1 + L2): Cutout, Mixup, Stochastic depth, etc…

(hyperparameter)

Purpose of Regularization:
- Express preferences in among models beyond ”minimize training error”
- Avoid overfitting: Prefer simple models that generalize better
- Improve optimization by adding curvature

Regularization: Expressing Preferences
L2 Regularization

Regularization: Expressing Preferences
L2 Regularization
L2 regularization likes to
“spread out” the weights

Regularization: Prefer Simpler Models

f1 f2
y
x
The model f1 fits the training data perfectly
The model f2 has training error, but is simpler
f1 f2
y
F1 is not a linear model; could

be polynomial regression, etc
x
Regularization pushes against fitting the data
too well so we don’t fit noise in the data
f1 f2
Regularization is
y
important! You should
(usually) use it.
F1 is not a linear model; could

be polynomial regression, etc
x
Regularization pushes against fitting the data
too well so we don’t fit noise in the data
Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
cat 3.2
car 5.1
frog -1.7

Softmax
function
cat 3.2
car 5.1
frog -1.7

Softmax
function
cat 3.2
car 5.1
frog -1.7
Unnormalized log-
probabilities / logits

Softmax
function
Probabilities
must be >= 0
cat 3.2 24.5
exp
car 5.1 164.0
frog -1.7 0.18
Unnormalized log- unnormalized
probabilities / logits probabilities
Softmax
function
Probabilities Probabilities
must be >= 0 must sum to 1
cat 3.2 24.5 0.13
exp normalize
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
probabilities
Softmax
function
cat 3.2 24.5 0.13 Li = -log(0.13)
exp normalize = 2.04
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
probabilities
Softmax
function
cat 3.2 24.5 0.13 Li = -log(0.13)
exp normalize = 2.04
car 5.1 164.0 0.87
Maximum Likelihood Estimation
frog -1.7 0.18 0.00 Choose weights to maximize the
likelihood of the observed data
probabilities (See EECS 445 or EECS 545)
Softmax
function
cat 3.2 24.5 0.13 Compare 1.00
exp normalize
car 5.1 164.0 0.87 0.00
frog -1.7 0.18 0.00 0.00
Unnormalized log- unnormalized Correct
probabilities
probabilities / logits probabilities probs
Softmax
function
cat 3.2 24.5 0.13 Compare 1.00
exp normalize Kullback–Leibler
car 5.1 164.0 0.87 divergence 0.00
frog -1.7 0.18 0.00 0.00
probabilities
Softmax
function
cat 3.2 24.5 0.13 Compare 1.00
exp normalize
car 5.1 164.0 0.87 Cross Entropy 0.00
frog -1.7 0.18 0.00 0.00
probabilities
Softmax
function
Maximize probability of correct class Putting it all together:
cat 3.2
car 5.1
frog -1.7

Softmax
function
cat 3.2
car 5.1
Q: What is the min /
frog -1.7 max possible loss Li?

Softmax
function
cat 3.2
car 5.1
Q: What is the min /
frog -1.7 A: Min 0, max +infinity
max possible loss Li?

Softmax
function
cat 3.2
car 5.1
Q: If all scores are
frog -1.7 small random values,
what is the loss?

Softmax
function
cat 3.2
car 5.1
Q: If all scores are
frog -1.7 A: -log(C)
small random values,
log(10) ≈ 2.3
what is the loss?

Cross-Entropy vs SVM Loss
assume scores: Q: What is cross-entropy loss?

What is SVM loss?
[10, -2, 3]
[10, 9, 9] A: Cross-entropy loss > 0
[10, -100, -100] SVM loss = 0
and
assume scores: Q: What is cross-entropy loss?

What is SVM loss?
[10, -2, 3]
[10, 9, 9] A: Cross-entropy loss > 0
[10, -100, -100] SVM loss = 0
and
assume scores: Q: What happens to each loss if I

slightly change the scores of the last
[10, -2, 3] datapoint?
[10, 9, 9]
[10, -100, -100] A: Cross-entropy loss will change;
and SVM loss will stay the same


slightly change the scores of the last
[10, -2, 3] datapoint?
[10, 9, 9]
[10, -100, -100] A: Cross-entropy loss will change;
and SVM loss will stay the same


double the score of the correct class
[10, -2, 3] from 10 to 20?
[10, 9, 9]
[10, -100, -100] A: Cross-entropy loss will decrease,
and SVM loss still 0


double the score of the correct class
[10, -2, 3] from 10 to 20?
[10, 9, 9]
[10, -100, -100] A: Cross-entropy loss will decrease,
and SVM loss still 0

Recap: Three ways to think about linear classifiers
Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint
f(x,W) = Wx One template Hyperplanes

per class cutting up space

Recap: Loss Functions quantify preferences
- We have some dataset of (x, y)
- We have a score function:
- We have a loss function:
Linear classifier
Softmax
SVM
Full loss

Recap: Loss Functions quantify preferences
- We have some dataset of (x, y) Q: How do we find the best W?
- We have a score function:
- We have a loss function:
Linear classifier
Softmax
SVM
Full loss

Next time:
Optimization

498 FA2019 Lecture03

Uploaded by

Copyright:

Available Formats

You might also like

498 FA2019 Lecture03

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

498 FA2019 Lecture03

Uploaded by

Copyright:

Available Formats

Lecture

Justin Johnson Lecture 3 - 1 September 11, 2019

This image by Nikita is

Justin Johnson Lecture 3 - 3 September 11, 2019

This image by Umberto Salvagnin is

Clutter Intraclass Variation

Justin Johnson Lecture 3 - 4 September 11, 2019

train validation test

Justin Johnson Lecture 3 - 5 September 11, 2019

Justin Johnson Lecture 3 - 6 September 11, 2019

This image is CC0 1.0 public domain

Justin Johnson Lecture 3 - 7 September 11, 2019

50,000 training images

10,000 test images.

Justin Johnson Lecture 3 - 8 September 11, 2019

Justin Johnson Lecture 3 - 9 September 11, 2019

Justin Johnson Lecture 3 - 10 September 11, 2019

Justin Johnson Lecture 3 - 11 September 11, 2019

Justin Johnson Lecture 3 - 12 September 11, 2019

Justin Johnson Lecture 3 - 13 September 11, 2019

0 0.25 0.2 -0.3 -1.2 61.95

Justin Johnson Lecture 3 - 14 September 11, 2019

0 0.25 0.2 -0.3 -1.2 61.95

Justin Johnson Lecture 3 - 15 September 11, 2019

0 0.25 0.2 -0.3 -1.2 61.95

Justin Johnson Lecture 3 - 16 September 11, 2019

f(cx, W) = W(cx) = c * f(x, W)

Justin Johnson Lecture 3 - 17 September 11, 2019

f(cx, W) = W(cx) = c * f(x, W)

Justin Johnson Lecture 3 - 18 September 11, 2019

0 0.25 0.2 -0.3 -1.2 61.95

Justin Johnson Lecture 3 - 19 September 11, 2019

0.2 -0.5 1.5 1.3 0 .25

0 0.25 0.2 -0.3 -1.2 61.95

-96.8 437.9 61.95

Justin Johnson Lecture 3 - 20 September 11, 2019

0.2 -0.5 1.5 1.3 0 .25

b 1.1 3.2 -1.2

-96.8 437.9 61.95

Justin Johnson Lecture 3 - 21 September 11, 2019

0.2 -0.5 1.5 1.3 0 .25

b 1.1 3.2 -1.2

-96.8 437.9 61.95

Justin Johnson Lecture 3 - 22 September 11, 2019

Linear classifier has one

b 1.1 3.2 -1.2

-96.8 437.9 61.95

Justin Johnson Lecture 3 - 23 September 11, 2019

Linear classifier has one

multiple modes of the data

e.g. horse template has 2 heads! -96.8 437.9 61.95

Justin Johnson Lecture 3 - 24 September 11, 2019

Justin Johnson Lecture 3 - 25 September 11, 2019

Car Score Array of 32x32x3 numbers

Justin Johnson Lecture 3 - 26 September 11, 2019

Car Score Array of 32x32x3 numbers

Justin Johnson Lecture 3 - 27 September 11, 2019