Professional Documents
Culture Documents
Lecture 4 - Linear Classification
Lecture 4 - Linear Classification
Supervised Learning:
Linear classification
Dr. Rajdip Nayek
Block V, Room 418D
Department of Applied Mechanics
Indian Institute of Technology Delhi
E-mail: rajdipn@am.iitd.ac.in
Introduction
2
Learning goals
• Be able to specify weights and biases by hand to represent simple Boolean functions
(e.g. AND, OR, NOT)
• Understand why classification error and squared error are problematic cost functions
for classification.
3
Learning goals
• Know what cross-entropy is and understand why it can be easier to optimize than
squared error (assuming a logistic activation function)
• Be able to derive the gradient descent updates for all of the models and cost functions
mentioned in this lecture
4
Binary linear classification
(𝑖) (𝑖) 𝑁
• Training set: 𝑁 pairs of examples, that is, 𝒙 , 𝑡 𝑖=1
• 𝑡 (𝑖) can take values in {0, 1} (because it is binary valued)
0 𝑖𝑓 𝑧 < 0
𝑦=ቊ
1 𝑖𝑓 𝑧 ≥ 0
𝑦
1.0
0.5
0.0 𝑧
-4 -3 -2 -1 0 1 2 3 4 6
Binary linear classification
• This is basically a special case of the single neuron processing unit; this is also
called a perceptron
𝑧 = 𝒘𝑇 𝒙 + 𝑏 = 𝑤1 𝑥1 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏
0 𝑖𝑓 𝑧 < 0
𝑦=ቊ
1 𝑖𝑓 𝑧 ≥ 0
Output
𝑦
Weights
𝑏 𝑤𝐾
𝑤1
1
𝑥1 ⋯ 𝑥𝐾
Input nodes
7
Binary linear classification
• This is basically a special case of the single neuron processing unit; this is also
called a perceptron
𝑧 = 𝒘𝑇 𝒙 + 𝑏 = 𝑤1 𝑥1 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏
0 𝑖𝑓 𝑧 < 0
𝑦=ቊ
1 𝑖𝑓 𝑧 ≥ 0
Output
𝑦 𝜎(𝑧)
Weights Input
Weights 𝑦 = 𝜎 𝒘𝑇 𝒙 + 𝑏
𝑏 𝑤𝐾
𝑤1
Output
1 Bias 𝑧
𝑥1 ⋯ 𝑥𝐾 Nonlinear
Input nodes Activation
Nonlinear
function
Activation
function
8
Ex: Whether to watch a movie or not?
• Consider a task of determining whether to watch a
movie or not?
9
Ex: Whether to watch a movie or not?
• So what could the bias mean in this case?
10
Ex: Whether to watch a movie or not?
• So what could the bias mean in this case?
11
Ex: Whether to watch a movie or not?
• So what could the bias mean in this case?
12
Example of AND logic gate
• All computer operate using logic gates; can we linearly classify outputs of logic gates?
• We can use a perceptron to classify outputs from several logic gates such as AND, NOT,
OR, etc. These are simple functions which are often used in logical reasoning with binary
outputs (1-0 or Yes/No)
• AND logic gate
Input Input Output
𝑥1 𝑥2 𝑦
Input space
𝑥2 0 0 0
1.0 1 0 0
0 1 0
1 1 1
• For this 2-D input problem, a line is able to separate the inputs
0.0 𝑥1
0.0 1.0
15
Adding nonlinear input features
• Perceptrons can not learn certain functions which are not linearly separable
1D example:
16
XOR problem using nonlinear input features
• XOR is not linearly separable
• Introduce a nonlinear input feature
Input space
Input Input Input Output
𝑥1 𝑥2 𝑥3 = 𝑥1 𝑥2 𝑦 𝑥2
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
𝑥1
𝑥3
17
Recap and what’s next
• We looked at perceptron for binary linear classification
• Limitations of perceptron
18
Loss functions for binary classification
𝑁
• Data: 𝑁 examples of 𝒙(𝑖) , 𝑡 𝑖
𝑖=1
, 𝒙 ∈ ℝ𝐾
, 𝑦 ∈ 0,1
0 𝑖𝑓 𝑧 < 0
• Model: 𝑧 = 𝒘𝑇 𝒙 + 𝑏 = 𝑤1 𝑥1 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏, 𝑦 = ቊ
1 𝑖𝑓 𝑧 ≥ 0
ℒ0−1 = 𝑁1 σ𝑁
𝑖=1 𝕀 𝑡
(𝑖)
≠ 𝑦 (𝑖) (𝑧)
𝕀 𝑡≠𝑦
0 if 𝑡 = 𝑦
where , 𝕀 𝑡 ≠ 𝑦 = ቊ
1 if 𝑡 ≠ 𝑦
19
Problems with 0-1 loss function
• Let’s look at the derivative of the ℒ0−1 loss function
𝑑ℒ0−1 𝑑ℒ0−1 𝑑𝑧
• Chain rule: =
𝑑𝒘 𝑑𝑧 𝑑𝒘
𝑑ℒ0−1
• Even if you did calculate on either side
𝑑𝑧
𝑑ℒ
of zero, the value of 0−1 = 0 since ℒ0−1 is
𝑑𝑧
constant on either side
𝕀 𝑡≠𝑦
• So almost all points have zero gradient!!
• Yes we can. First, define a logistic (or a sigmoid) activation function over 𝑧 to
𝑦
1.0
1
𝑦= 𝜎 𝑧 =
1 + 𝑒 −𝑧 0.5
0.0 𝑧
-4 -3 -2 -1 0 1 2 3 4
• A linear model with sigmoid nonlinearity and a continuous squared error loss function:
𝑧 = 𝒘𝑇 𝒙 + 𝑏
𝑦=𝜎 𝑧
𝑁
1 2
ℒ𝑆𝐸 = 2𝑁
𝑡 (𝑖) − 𝑦 (𝑖)
𝑖=1
21
Gradient problems with sigmoid activation
𝑦𝑝
• Gradient using chain rule 1.0
𝑁
𝑑ℒ𝑆𝐸 𝑑ℒ𝑆𝐸 𝑑𝑦 (𝑖) 𝑑𝑧 (𝑖)
= (𝑖)
𝑑𝒘 𝑑𝑦 𝑑𝑧 (𝑖) 𝑑𝒘 0.5
𝑖=1
0.0 𝑧
• The gradient is very small at the two tails -4 -3 -2 -1 0 1 2 3 4
1
• E.g. you predict 𝑧 = −4 for which 𝑦 = 1 (a big mistake) 𝑦= 𝜎 𝑧 =
1 + 𝑒 −𝑧
but this badly mis-classified example will have tiny
𝑑𝑦
effect on the training algorithm because gradient
𝑑𝑧
Sigmoid activation does not have a strong gradient signal at the tails
22
Problems with squared error loss function
• It does not distinguish between bad predictions from extremely bad predictions
• E.g. If 𝑡 = 1, then predictions 𝑦 = 0.01 and 𝑦 = 0.001 have roughly the same
squared-error loss 𝑡 − 𝑦 2 , even through 𝑦 = 0.001 is more wrong
𝑦
1.0
0.5
0.0 𝑧
-4 -3 -2 -1 0 1 2 3 4
23
Problems with squared error loss function
• It does not distinguish between bad predictions from extremely bad predictions
• E.g. If 𝑡 = 1, then predictions 𝑦 = 0.01 and 𝑦 = 0.001 have roughly the same
squared-error loss 𝑡 − 𝑦 2 , even through 𝑦 = 0.001 is more wrong
• From the perspective of optimization, the fact that the losses are nearly equivalent is
a big problem, as the loss function is not very sensitive to the parameters, and
hence training the parameters become difficult
24
Cross-entropy loss function
• The problem with squared-error loss is that it treats 𝑦 = 0.01 and 𝑦 = 0.0001 as
nearly equivalent for 𝑡 = 1
• We’d like a loss function which makes these differences look very different
− log 𝑦 if 𝑡 = 1
ℒ𝐶𝐸 =ቊ
− log 1 − 𝑦 if 𝑡 = 0
Cross-entropy loss
• For the example, ℒ𝐶𝐸 0.01, 1 = 4.6 and 𝑡=1 𝑡=0
ℒ𝐶𝐸 0.0001, 1 = 9.2, so CE treats the
latter much worse
𝑧 = 𝒘𝑇 𝒙 + 𝑏
𝑁
𝑑ℒ𝐶𝐸 𝑑ℒ𝐶𝐸 𝑑𝑦 (𝑖) 𝑑𝑧 (𝑖)
1 = (𝑖)
𝑦=𝜎 𝑧 = 𝑑𝒘 𝑑𝑦 𝑑𝑧 (𝑖) 𝑑𝒘
1 + 𝑒 −𝑧 𝑖=1
𝑁
𝑁
𝑑ℒ𝐶𝐸 𝑡 (𝑖) 1 − 𝑡 (𝑖)
1 = − (𝑖) +
𝑑𝑦 (𝑖) 𝑦 1 − 𝑦 (𝑖)
ℒ𝐶𝐸 = − 𝑡 (𝑖) log 𝑦 (𝑖) − 1 − 𝑡 (𝑖) log 1 − 𝑦 (𝑖) 𝑖=1
𝑁
𝑖=1
𝑑𝑧 (𝑖)
= 𝒙(𝑖)
Gradient descent updates for 𝒘 𝑑𝒘
27
Multiclass classification
• So far we’ve talked about binary classification, but most classification problems
involve more than two categories
• Fortunately, this doesn’t require any new ideas. Everything pretty much works very
similar to the binary case
28
Multiclass classification
• So far we’ve talked about binary classification, but most classification problems
involve more than two categories
1 0 0
0 1 0
0 0 0
0 0 0
0 0 0
⋮ ⋮ ⋮
0 0 1 29
Multiclass classification
• Now there are 𝐾 input dimensions and 𝐿 output dimensions, so we need 𝐾 × 𝐿
weights, which we arrange as a weight matrix 𝐖
• Componentwise: 𝑧𝑙 = σ𝐾
𝑘=1 𝑤𝑙𝑘 𝑥𝑘 + 𝑏𝑙
• Vectorized: 𝒛 = 𝐖𝒙 + 𝒃
30
Multiclass classification
31
Multiclass classification
• Now there are 𝐾 input dimensions and 𝐿 output dimensions, so we need 𝐾 × 𝐿
weights, which we arrange as a weight matrix 𝐖
• Component-wise: 𝑧𝑙 = σ𝐾
𝑘=1 𝑤𝑙𝑘 𝑥𝑘 + 𝑏𝑙
• Vectorized: 𝒛 = 𝐖𝒙 + 𝒃
𝑒 𝑧𝑙
𝑦𝑙 = softmax 𝑧1 , ⋯ , 𝑧𝐿 𝑙 =
σ 𝑗 𝑒 𝑧𝑗
32
Multiclass classification
𝑒 𝑧𝑙
𝑦𝑙 = softmax 𝑧1 , ⋯ , 𝑧𝐿 𝑙 =
σ 𝑗 𝑒 𝑧𝑗
33
Multiclass classification
• Compute linear predictions:
• Componentwise: 𝑧𝑙 = σ𝐾
𝑘=1 𝑤𝑙𝑘 𝑥𝑘 + 𝑏𝑙 )
• Vectorized: 𝒛 = 𝐖𝒙 + 𝒃
(𝑖) (𝑖)
ℒ𝐶𝐸 = − σ𝑁 𝐿
𝑖=1 σ𝑙=1 𝑡𝑙 log 𝑦𝑙 = − σ𝑁
𝑖=1 𝒕
(𝒊)𝑇
log 𝒚(𝑖)
34