498 FA2019 Lecture03

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 94

Lecture

3:
Linear Classifiers

Justin Johnson Lecture 3 - 1 September 11, 2019


Reminder: Assignment 1
• http://web.eecs.umich.edu/~justincj/teaching/eecs498/assignment1.html
• Due Sunday September 15, 11:59pm EST
• We have written a homework validation script to check the
format of your .zip file before you submit to Canvas:
• https://github.com/deepvision-class/tools#homework-
validation
• This script ensures that your .zip and .ipynb files are properly
structured; they do not check correctness
• It is your responsibility to make sure your submitted .zip file
passes the validation script
Justin Johnson Lecture 3 - 2 September 11, 2019
Last time: Image Classification
Input: image Output: Assign image to one
of a fixed set of categories

cat
bird
deer
dog
truck

This image by Nikita is


licensed under CC-BY 2.0

Justin Johnson Lecture 3 - 3 September 11, 2019


Last Time: Challenges of Recognition
Viewpoint Illumination Deformation Occlusion

This image by Umberto Salvagnin is


This image is CC0 1.0 public domain This image by jonsson is licensed
licensed under CC-BY 2.0
under CC-BY 2.0

Clutter Intraclass Variation

This image is CC0 1.0 public domain This image is CC0 1.0 public domain

Justin Johnson Lecture 3 - 4 September 11, 2019


Last time: Data-Drive Approach, kNN
1-NN classifier 5-NN classifier

train test

train validation test

Justin Johnson Lecture 3 - 5 September 11, 2019


Today: Linear Classifiers

Justin Johnson Lecture 3 - 6 September 11, 2019


Neural Network

Linear
classifiers

This image is CC0 1.0 public domain

Justin Johnson Lecture 3 - 7 September 11, 2019


Recall CIFAR10

50,000 training images


each image is 32x32x3

10,000 test images.

Justin Johnson Lecture 3 - 8 September 11, 2019


Parametric Approach

Image
10 numbers giving
f(x,W) class scores
Array of 32x32x3 numbers
(3072 numbers total)
W
parameters
or weights

Justin Johnson Lecture 3 - 9 September 11, 2019


Parametric Approach: Linear Classifier

Image
f(x,W) = Wx
10 numbers giving
f(x,W) class scores
Array of 32x32x3 numbers
(3072 numbers total)
W
parameters
or weights

Justin Johnson Lecture 3 - 10 September 11, 2019


Parametric Approach: Linear Classifier
(3072,)

Image
f(x,W) = Wx
(10,) (10, 3072)
10 numbers giving
f(x,W) class scores
Array of 32x32x3 numbers
(3072 numbers total)
W
parameters
or weights

Justin Johnson Lecture 3 - 11 September 11, 2019


Parametric Approach: Linear Classifier
(3072,)

Image
f(x,W) = Wx + b (10,)
(10,) (10, 3072)
10 numbers giving
f(x,W) class scores
Array of 32x32x3 numbers
(3072 numbers total)
W
parameters
or weights

Justin Johnson Lecture 3 - 12 September 11, 2019


Example for 2x2 image, 3 classes (cat/dog/ship)
Stretch pixels into column f(x,W) = Wx + b

56

56 231
231

24 2
24

Input image
2
(2, 2)
(4,)

Justin Johnson Lecture 3 - 13 September 11, 2019


Example for 2x2 image, 3 classes (cat/dog/ship)
Stretch pixels into column f(x,W) = Wx + b

56
0.2 -0.5 0.1 2.0 1.1 -96.8
56 231
231

24 2
1.5 1.3 2.1 0.0
24
+ 3.2
= 437.9

0 0.25 0.2 -0.3 -1.2 61.95


Input image
2
(2, 2)
W (3, 4)
(4,)
b (3,)
(3,)

Justin Johnson Lecture 3 - 14 September 11, 2019


Linear Classifier: Algebraic Viewpoint
Stretch pixels into column f(x,W) = Wx + b

56
0.2 -0.5 0.1 2.0 1.1 -96.8
56 231
231

24 2
1.5 1.3 2.1 0.0
24
+ 3.2
= 437.9

0 0.25 0.2 -0.3 -1.2 61.95


Input image
2
(2, 2)
W (3, 4)
(4,)
b (3,)
(3,)

Justin Johnson Lecture 3 - 15 September 11, 2019


Add extra one to data vector;
Linear Classifier: Bias Trick bias is absorbed into last
column of weight matrix
Stretch pixels into column

56
0.2 -0.5 0.1 2.0 1.1 -96.8
56 231
231

24 2
1.5 1.3 2.1 0.0 3.2
24
= 437.9

0 0.25 0.2 -0.3 -1.2 61.95


Input image
2
(2, 2)
W (3, 5)
(5,)
(3,)
1

Justin Johnson Lecture 3 - 16 September 11, 2019


Linear Classifier: Predictions are Linear!
f(x, W) = Wx (ignore bias)

f(cx, W) = W(cx) = c * f(x, W)

Justin Johnson Lecture 3 - 17 September 11, 2019


Linear Classifier: Predictions are Linear!
f(x, W) = Wx (ignore bias)

f(cx, W) = W(cx) = c * f(x, W)


Image Scores 0.5 * Image 0.5 * Scores
-96.8 -48.4

437.8 218.9

62.0 31.0

Justin Johnson Lecture 3 - 18 September 11, 2019


Interpreting a Linear Classifier
Algebraic Viewpoint
f(x,W) = Wx + b
Stretch pixels into column

56
0.2 -0.5 0.1 2.0 1.1 -96.8
56 231
231

24 2
1.5 1.3 2.1 0.0
24
+ 3.2
= 437.9

0 0.25 0.2 -0.3 -1.2 61.95


Input image
2
(2, 2)
W (3, 4)
(4,) b (3,)
(3,)

Justin Johnson Lecture 3 - 19 September 11, 2019


Interpreting a Linear Classifier
Algebraic Viewpoint
f(x,W) = Wx + b
Stretch pixels into column

0.2 -0.5 1.5 1.3 0 .25


0.2 -0.5 0.1 2.0
56
1.1 -96.8 W
56 231
231 0.1 2.0 2.1 0.0 0.2 -0.3
24 2
1.5 1.3 2.1 0.0
24
+ 3.2
= 437.9

0 0.25 0.2 -0.3 -1.2 61.95


Input image
2
(2, 2)
W (3, 4)
b (3,) b 1.1 3.2 -1.2
(4,)
(3,)

-96.8 437.9 61.95

Justin Johnson Lecture 3 - 20 September 11, 2019


Interpreting an Linear Classifier

0.2 -0.5 1.5 1.3 0 .25


W
0.1 2.0 2.1 0.0 0.2 -0.3

b 1.1 3.2 -1.2

-96.8 437.9 61.95

Justin Johnson Lecture 3 - 21 September 11, 2019


Interpreting an Linear Classifier: Visual Viewpoint

0.2 -0.5 1.5 1.3 0 .25


W
0.1 2.0 2.1 0.0 0.2 -0.3

b 1.1 3.2 -1.2

-96.8 437.9 61.95

Justin Johnson Lecture 3 - 22 September 11, 2019


Interpreting an Linear Classifier: Visual Viewpoint

Linear classifier has one


“template” per category
0.2 -0.5 1.5 1.3 0 .25
W
0.1 2.0 2.1 0.0 0.2 -0.3

b 1.1 3.2 -1.2

-96.8 437.9 61.95

Justin Johnson Lecture 3 - 23 September 11, 2019


Interpreting an Linear Classifier: Visual Viewpoint

Linear classifier has one


“template” per category
0.2 -0.5 1.5 1.3 0 .25
W
A single template cannot capture 0.1 2.0 2.1 0.0 0.2 -0.3

multiple modes of the data


b 1.1 3.2 -1.2

e.g. horse template has 2 heads! -96.8 437.9 61.95

Justin Johnson Lecture 3 - 24 September 11, 2019


Interpreting a Linear Classifier: Geometric Viewpoint

Airplane f(x,W) = Wx + b
Score
Classifier Deer Score
score

Car Score
Array of 32x32x3 numbers
Value of pixel (15, 8, 0) (3072 numbers total)

Justin Johnson Lecture 3 - 25 September 11, 2019


Interpreting a Linear Classifier: Geometric Viewpoint
Pixel
(11, 11, 0)
f(x,W) = Wx + b
Car score
increases
this way
Pixel
(15, 8, 0)

Car Score Array of 32x32x3 numbers


(3072 numbers total)
= 0

Justin Johnson Lecture 3 - 26 September 11, 2019


Interpreting a Linear Classifier: Geometric Viewpoint
Car template
Pixel
on this line
(11, 11, 0)
f(x,W) = Wx + b
Car score
increases
this way
Pixel
(15, 8, 0)

Car Score Array of 32x32x3 numbers


(3072 numbers total)
= 0

Justin Johnson Lecture 3 - 27 September 11, 2019


Interpreting a Linear Classifier: Geometric Viewpoint
Airplane Pixel Car template
on this line
Score (11, 11, 0)
f(x,W) = Wx + b
Car score
increases
this way
Pixel (15, 8, 0)

Car Score Array of 32x32x3 numbers


(3072 numbers total)
Deer = 0
Score
Justin Johnson Lecture 3 - 28 September 11, 2019
Interpreting a Linear Classifier: Geometric Viewpoint
Airplane Pixel Car template
on this line Hyperplanes carving up a
Score (11, 11, 0) high-dimensional space
Car score
increases
this way
Pixel (15, 8, 0)

Car Score
Deer = 0
Score Plot created using Wolfram Cloud

Justin Johnson Lecture 3 - 29 September 11, 2019


Hard Cases for a Linear Classifier
Class 1: Class 1: Class 1:
First and third quadrants 1 <= L2 norm <= 2 Three modes

Class 2: Class 2: Class 2:


Second and fourth quadrants Everything else Everything else

Justin Johnson Lecture 3 - 30 September 11, 2019


Recall: Perceptron couldn’t learn XOR

y
X Y F(x,y)
0 0 0
0 1 1
1 0 1
1 1 0 x

Justin Johnson Lecture 3 - 31 September 11, 2019


Linear Classifier: Three Viewpoints
Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint

f(x,W) = Wx One template Hyperplanes


per class cutting up space

Justin Johnson Lecture 3 - 32 September 11, 2019


So Far: Defined a linear score function f(x,W) = Wx + b

Given a W, we can
compute class scores
-3.45 -0.51 3.42 for an image x.
-8.87 6.04 4.64
0.09 5.31 2.65
2.9 -4.22 5.1 But how can we
4.48
8.02
-4.19
3.58
2.64 actually choose a
5.55
3.78 4.49 -4.34 good W?
1.06 -4.37 -1.5
-0.36 -2.09 -4.79
-0.72 -2.93 6.14 Cat image by Nikita is licensed under CC-BY 2.0; Car image is CC0 1.0 public domain; Frog image is in the public domain

Justin Johnson Lecture 3 - 33 September 11, 2019


Choosing a good W f(x,W) = Wx + b

TODO:

1. Use a loss function to


-3.45 -0.51 3.42
-8.87 6.04 4.64
quantify how good a
0.09 5.31 2.65 value of W is
2.9 -4.22 5.1
4.48 -4.19
8.02
2.64
2. Find a W that minimizes
3.58 5.55
3.78 4.49 -4.34 the loss function
1.06 -4.37 -1.5 (optimization)
-0.36 -2.09 -4.79
-0.72 -2.93 6.14
Justin Johnson Lecture 3 - 34 September 11, 2019
Loss Function
A loss function tells how good our
current classifier is

Low loss = good classifier


High loss = bad classifier

(Also called: objective function;


cost function)

Justin Johnson Lecture 3 - 35 September 11, 2019


Loss Function
A loss function tells how good our
current classifier is

Low loss = good classifier


High loss = bad classifier

(Also called: objective function;


cost function)

Negative loss function sometimes


called reward function, profit
function, utility function, fitness
function, etc

Justin Johnson Lecture 3 - 36 September 11, 2019


Loss Function Given a dataset of examples

A loss function tells how good our


current classifier is
Where is image and
Low loss = good classifier is (integer) label
High loss = bad classifier

(Also called: objective function;


cost function)

Negative loss function sometimes


called reward function, profit
function, utility function, fitness
function, etc

Justin Johnson Lecture 3 - 37 September 11, 2019


Loss Function Given a dataset of examples

A loss function tells how good our


current classifier is
Where is image and
Low loss = good classifier is (integer) label
High loss = bad classifier
Loss for a single example is
(Also called: objective function;
cost function)

Negative loss function sometimes


called reward function, profit
function, utility function, fitness
function, etc

Justin Johnson Lecture 3 - 38 September 11, 2019


Loss Function Given a dataset of examples

A loss function tells how good our


current classifier is
Where is image and
Low loss = good classifier is (integer) label
High loss = bad classifier
Loss for a single example is
(Also called: objective function;
cost function)
Loss for the dataset is average of
Negative loss function sometimes per-example losses:
called reward function, profit
function, utility function, fitness
function, etc

Justin Johnson Lecture 3 - 39 September 11, 2019


Multiclass SVM Loss
”The score of the correct class should
be higher than all the other scores”

Loss

Score for
correct class

Justin Johnson Lecture 3 - 40 September 11, 2019


Multiclass SVM Loss
”The score of the correct class should
be higher than all the other scores”

Loss

Score for
correct class

Highest score
among other classes
Justin Johnson Lecture 3 - 41 September 11, 2019
Multiclass SVM Loss
”The score of the correct class should
be higher than all the other scores”

Loss
“Hinge Loss”

Score for
correct class

Highest score “Margin”


among other classes
Justin Johnson Lecture 3 - 42 September 11, 2019
Multiclass SVM Loss
Given an example
”The score of the correct class should ( is image, is label)
be higher than all the other scores”
Let be scores
Loss
“Hinge Loss”
Then the SVM loss has the form:
Score for
correct class

Highest score “Margin”


among other classes
Justin Johnson Lecture 3 - 43 September 11, 2019
Multiclass SVM Loss
Given an example
( is image, is label)

Let be scores

Then the SVM loss has the form:


cat 3.2 1.3 2.2
car 5.1 4.9 2.5
frog -1.7 2.0 -3.1

Justin Johnson Lecture 3 - 44 September 11, 2019


Multiclass SVM Loss
Given an example
( is image, is label)

Let be scores

Then the SVM loss has the form:


cat 3.2 1.3 2.2
car 5.1 4.9 2.5 = max(0, 5.1 - 3.2 + 1)
+ max(0, -1.7 - 3.2 + 1)
frog -1.7 2.0 -3.1 = max(0, 2.9) + max(0, -3.9)
= 2.9 + 0
Loss 2.9 = 2.9
Justin Johnson Lecture 3 - 45 September 11, 2019
Multiclass SVM Loss
Given an example
( is image, is label)

Let be scores

Then the SVM loss has the form:


cat 3.2 1.3 2.2
car 5.1 4.9 2.5 = max(0, 1.3 - 4.9 + 1)
+max(0, 2.0 - 4.9 + 1)
frog -1.7 2.0 -3.1 = max(0, -2.6) + max(0, -1.9)
= 0 + 0
Loss 2.9 0 = 0
Justin Johnson Lecture 3 - 46 September 11, 2019
Multiclass SVM Loss
Given an example
( is image, is label)

Let be scores

Then the SVM loss has the form:


cat 3.2 1.3 2.2
car 5.1 4.9 2.5 = max(0, 2.2 - (-3.1) + 1)
+max(0, 2.5 - (-3.1) + 1)
frog -1.7 2.0 -3.1 = max(0, 6.3) + max(0, 6.6)
= 6.3 + 6.6
Loss 2.9 0 12.9 = 12.9
Justin Johnson Lecture 3 - 47 September 11, 2019
Multiclass SVM Loss
Given an example
( is image, is label)

Let be scores

Then the SVM loss has the form:


cat 3.2 1.3 2.2
car 5.1 4.9 2.5 Loss over the dataset is:
frog -1.7 2.0 -3.1 L = (2.9 + 0.0 + 12.9) / 3
= 5.27
Loss 2.9 0 12.9
Justin Johnson Lecture 3 - 48 September 11, 2019
Multiclass SVM Loss
Given an example
( is image, is label)

Let be scores

Then the SVM loss has the form:


cat 3.2 1.3 2.2
car 5.1 4.9 2.5
Q: What happens to the
frog -1.7 2.0 -3.1 loss if the scores for the
car image change a bit?
Loss 2.9 0 12.9
Justin Johnson Lecture 3 - 49 September 11, 2019
Multiclass SVM Loss
Given an example
( is image, is label)

Let be scores

Then the SVM loss has the form:


cat 3.2 1.3 2.2
car 5.1 4.9 2.5
Q2: What are the min
frog -1.7 2.0 -3.1 and max possible loss?
Loss 2.9 0 12.9
Justin Johnson Lecture 3 - 50 September 11, 2019
Multiclass SVM Loss
Given an example
( is image, is label)

Let be scores

Then the SVM loss has the form:


cat 3.2 1.3 2.2
car 5.1 4.9 2.5
Q3: If all the scores
frog -1.7 2.0 -3.1 were random, what
loss would we expect?
Loss 2.9 0 12.9
Justin Johnson Lecture 3 - 51 September 11, 2019
Multiclass SVM Loss
Given an example
( is image, is label)

Let be scores

Then the SVM loss has the form:


cat 3.2 1.3 2.2
car 5.1 4.9 2.5
Q4: What would happen
frog -1.7 2.0 -3.1 if the sum were over all
classes? (including i = yi)
Loss 2.9 0 12.9
Justin Johnson Lecture 3 - 52 September 11, 2019
Multiclass SVM Loss
Given an example
( is image, is label)

Let be scores

Then the SVM loss has the form:


cat 3.2 1.3 2.2
car 5.1 4.9 2.5
Q5: What if the loss used
frog -1.7 2.0 -3.1 a mean instead of a sum?
Loss 2.9 0 12.9
Justin Johnson Lecture 3 - 53 September 11, 2019
Multiclass SVM Loss
Given an example
( is image, is label)

Let be scores

Then the SVM loss has the form:


cat 3.2 1.3 2.2
car 5.1 4.9 2.5 Q6: What if we used
frog -1.7 2.0 -3.1 this loss instead?

Loss 2.9 0 12.9


Justin Johnson Lecture 3 - 54 September 11, 2019
Multiclass SVM Loss

Q: Suppose we found some W with L = 0. Is it unique?

Justin Johnson Lecture 3 - 55 September 11, 2019


Multiclass SVM Loss

Q: Suppose we found some W with L = 0. Is it unique?

No! 2W is also has L = 0!

Justin Johnson Lecture 3 - 56 September 11, 2019


Multiclass SVM Loss
Original W:
= max(0, 1.3 - 4.9 + 1)
+max(0, 2.0 - 4.9 + 1)
= max(0, -2.6) + max(0, -1.9)
= 0 + 0
cat 3.2 1.3 2.2 = 0
Using 2W instead:
car 5.1 4.9 2.5 = max(0, 2.6 - 9.8 + 1)
+max(0, 4.0 - 9.8 + 1)
frog -1.7 2.0 -3.1 = max(0, -6.2) + max(0, -4.8)
= 0 + 0
Loss 2.9 0 12.9 = 0
Justin Johnson Lecture 3 - 57 September 11, 2019
Multiclass SVM Loss

How should we choose between


cat 3.2 1.3 2.2 W and 2W if they both perform
the same on the training data?
car 5.1 4.9 2.5
frog -1.7 2.0 -3.1
Loss 2.9 0 12.9
Justin Johnson Lecture 3 - 58 September 11, 2019
Regularization: Beyond Training Error

Data loss: Model predictions


should match training data

Justin Johnson Lecture 3 - 59 September 11, 2019


Regularization: Beyond Training Error

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training data

Justin Johnson Lecture 3 - 60 September 11, 2019


Regularization: Beyond Training Error

= regularization strength
(hyperparameter)

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training data

Justin Johnson Lecture 3 - 61 September 11, 2019


Regularization: Beyond Training Error

= regularization strength
(hyperparameter)

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training data

Simple examples More complex:


L2 regularization: Dropout
L1 regularization: Batch normalization
Elastic net (L1 + L2): Cutout, Mixup, Stochastic depth, etc…

Justin Johnson Lecture 3 - 62 September 11, 2019


Regularization: Beyond Training Error

= regularization strength
(hyperparameter)

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training data

Purpose of Regularization:
- Express preferences in among models beyond ”minimize training error”
- Avoid overfitting: Prefer simple models that generalize better
- Improve optimization by adding curvature

Justin Johnson Lecture 3 - 63 September 11, 2019


Regularization: Expressing Preferences
L2 Regularization

Justin Johnson Lecture 3 - 64 September 11, 2019


Regularization: Expressing Preferences
L2 Regularization

L2 regularization likes to
“spread out” the weights

Justin Johnson Lecture 3 - 65 September 11, 2019


Regularization: Prefer Simpler Models

Justin Johnson Lecture 3 - 66 September 11, 2019


Regularization: Prefer Simpler Models
f1 f2
y

x
The model f1 fits the training data perfectly
The model f2 has training error, but is simpler
Justin Johnson Lecture 3 - 67 September 11, 2019
Regularization: Prefer Simpler Models
f1 f2
y

F1 is not a linear model; could


be polynomial regression, etc

x
Regularization pushes against fitting the data
too well so we don’t fit noise in the data
Justin Johnson Lecture 3 - 68 September 11, 2019
Regularization: Prefer Simpler Models
f1 f2
Regularization is
y
important! You should
(usually) use it.

F1 is not a linear model; could


be polynomial regression, etc

x
Regularization pushes against fitting the data
too well so we don’t fit noise in the data
Justin Johnson Lecture 3 - 69 September 11, 2019
Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities

cat 3.2
car 5.1
frog -1.7

Justin Johnson Lecture 3 - 70 September 11, 2019


Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function

cat 3.2
car 5.1
frog -1.7

Justin Johnson Lecture 3 - 71 September 11, 2019


Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function

cat 3.2
car 5.1
frog -1.7
Unnormalized log-
probabilities / logits

Justin Johnson Lecture 3 - 72 September 11, 2019


Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function
Probabilities
must be >= 0
cat 3.2 24.5
exp
car 5.1 164.0
frog -1.7 0.18
Unnormalized log- unnormalized
probabilities / logits probabilities
Justin Johnson Lecture 3 - 73 September 11, 2019
Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function
Probabilities Probabilities
must be >= 0 must sum to 1
cat 3.2 24.5 0.13
exp normalize
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
Unnormalized log- unnormalized
probabilities
probabilities / logits probabilities
Justin Johnson Lecture 3 - 74 September 11, 2019
Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function
Probabilities Probabilities
must be >= 0 must sum to 1
cat 3.2 24.5 0.13 Li = -log(0.13)
exp normalize = 2.04
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
Unnormalized log- unnormalized
probabilities
probabilities / logits probabilities
Justin Johnson Lecture 3 - 75 September 11, 2019
Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function
Probabilities Probabilities
must be >= 0 must sum to 1
cat 3.2 24.5 0.13 Li = -log(0.13)
exp normalize = 2.04
car 5.1 164.0 0.87
Maximum Likelihood Estimation
frog -1.7 0.18 0.00 Choose weights to maximize the
likelihood of the observed data
Unnormalized log- unnormalized
probabilities (See EECS 445 or EECS 545)
probabilities / logits probabilities
Justin Johnson Lecture 3 - 76 September 11, 2019
Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function
Probabilities Probabilities
must be >= 0 must sum to 1
cat 3.2 24.5 0.13 Compare 1.00
exp normalize
car 5.1 164.0 0.87 0.00
frog -1.7 0.18 0.00 0.00
Unnormalized log- unnormalized Correct
probabilities
probabilities / logits probabilities probs
Justin Johnson Lecture 3 - 77 September 11, 2019
Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function
Probabilities Probabilities
must be >= 0 must sum to 1
cat 3.2 24.5 0.13 Compare 1.00
exp normalize Kullback–Leibler
car 5.1 164.0 0.87 divergence 0.00
frog -1.7 0.18 0.00 0.00
Unnormalized log- unnormalized Correct
probabilities
probabilities / logits probabilities probs
Justin Johnson Lecture 3 - 78 September 11, 2019
Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function
Probabilities Probabilities
must be >= 0 must sum to 1
cat 3.2 24.5 0.13 Compare 1.00
exp normalize
car 5.1 164.0 0.87 Cross Entropy 0.00
frog -1.7 0.18 0.00 0.00
Unnormalized log- unnormalized Correct
probabilities
probabilities / logits probabilities probs
Justin Johnson Lecture 3 - 79 September 11, 2019
Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function
Maximize probability of correct class Putting it all together:

cat 3.2
car 5.1
frog -1.7

Justin Johnson Lecture 3 - 80 September 11, 2019


Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function
Maximize probability of correct class Putting it all together:

cat 3.2
car 5.1
Q: What is the min /
frog -1.7 max possible loss Li?

Justin Johnson Lecture 3 - 81 September 11, 2019


Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function
Maximize probability of correct class Putting it all together:

cat 3.2
car 5.1
Q: What is the min /
frog -1.7 A: Min 0, max +infinity
max possible loss Li?

Justin Johnson Lecture 3 - 82 September 11, 2019


Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function
Maximize probability of correct class Putting it all together:

cat 3.2
car 5.1
Q: If all scores are
frog -1.7 small random values,
what is the loss?

Justin Johnson Lecture 3 - 83 September 11, 2019


Cross-Entropy Loss (Multinomial Logistic Regression)
Want to interpret raw classifier scores as probabilities
Softmax
function
Maximize probability of correct class Putting it all together:

cat 3.2
car 5.1
Q: If all scores are
frog -1.7 A: -log(C)
small random values,
log(10) ≈ 2.3
what is the loss?

Justin Johnson Lecture 3 - 84 September 11, 2019


Cross-Entropy vs SVM Loss

assume scores: Q: What is cross-entropy loss?


What is SVM loss?
[10, -2, 3]
[10, 9, 9] A: Cross-entropy loss > 0
[10, -100, -100] SVM loss = 0
and
Justin Johnson Lecture 3 - 85 September 11, 2019
Cross-Entropy vs SVM Loss

assume scores: Q: What is cross-entropy loss?


What is SVM loss?
[10, -2, 3]
[10, 9, 9] A: Cross-entropy loss > 0
[10, -100, -100] SVM loss = 0
and
Justin Johnson Lecture 3 - 86 September 11, 2019
Cross-Entropy vs SVM Loss

assume scores: Q: What happens to each loss if I


slightly change the scores of the last
[10, -2, 3] datapoint?
[10, 9, 9]
[10, -100, -100] A: Cross-entropy loss will change;
and SVM loss will stay the same

Justin Johnson Lecture 3 - 87 September 11, 2019


Cross-Entropy vs SVM Loss

assume scores: Q: What happens to each loss if I


slightly change the scores of the last
[10, -2, 3] datapoint?
[10, 9, 9]
[10, -100, -100] A: Cross-entropy loss will change;
and SVM loss will stay the same

Justin Johnson Lecture 3 - 88 September 11, 2019


Cross-Entropy vs SVM Loss

assume scores: Q: What happens to each loss if I


double the score of the correct class
[10, -2, 3] from 10 to 20?
[10, 9, 9]
[10, -100, -100] A: Cross-entropy loss will decrease,
and SVM loss still 0

Justin Johnson Lecture 3 - 89 September 11, 2019


Cross-Entropy vs SVM Loss

assume scores: Q: What happens to each loss if I


double the score of the correct class
[10, -2, 3] from 10 to 20?
[10, 9, 9]
[10, -100, -100] A: Cross-entropy loss will decrease,
and SVM loss still 0

Justin Johnson Lecture 3 - 90 September 11, 2019


Recap: Three ways to think about linear classifiers
Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint

f(x,W) = Wx One template Hyperplanes


per class cutting up space

Justin Johnson Lecture 3 - 91 September 11, 2019


Recap: Loss Functions quantify preferences
- We have some dataset of (x, y)
- We have a score function:
- We have a loss function:
Linear classifier

Softmax
SVM

Full loss

Justin Johnson Lecture 3 - 92 September 11, 2019


Recap: Loss Functions quantify preferences
- We have some dataset of (x, y) Q: How do we find the best W?
- We have a score function:
- We have a loss function:
Linear classifier

Softmax
SVM

Full loss

Justin Johnson Lecture 3 - 93 September 11, 2019


Next time:
Optimization

Justin Johnson Lecture 3 - 94 September 11, 2019

You might also like