Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

APL 745

Supervised Learning:
Linear classification
Dr. Rajdip Nayek
Block V, Room 418D
Department of Applied Mechanics
Indian Institute of Technology Delhi

E-mail: rajdipn@am.iitd.ac.in
Introduction

• In regression, the goal was to predict a scalar-valued output from a set


of input features. The question there was “how much”

• Here we will focus on classification where the goal is to predict a


category. Here the question is “which one”

• Some examples of classification

• Whether an email is spam or not spam?

• Is this customer more likely to sign up or not to sign up for a subscription?

• Does this image depict a donkey, a dog, a cat, or a camel?

• Which movie are you most likely to watch next?

2
Learning goals

• Know what is meant by binary linear classification

• Be able to specify weights and biases by hand to represent simple Boolean functions
(e.g. AND, OR, NOT)

• Be aware of the limitations of linear classifiers

• Understand why classification error and squared error are problematic cost functions
for classification.

3
Learning goals

• Know what cross-entropy is and understand why it can be easier to optimize than
squared error (assuming a logistic activation function)

• Be able to derive the gradient descent updates for all of the models and cost functions
mentioned in this lecture

• Understand how multi-class classification works

4
Binary linear classification

• Classification: Predict a class label (or a discrete-valued target)

• Binary: Predict a binary output 𝑦 ∈ 0, 1 or {No, Yes}


- Distinguish between two categories

• Linear: The classification is done using a linear function of input features 𝒙 ∈ ℝ𝐾

(𝑖) (𝑖) 𝑁
• Training set: 𝑁 pairs of examples, that is, 𝒙 , 𝑡 𝑖=1
• 𝑡 (𝑖) can take values in {0, 1} (because it is binary valued)

Goal: To correctly classify all the 𝑁 training examples


(and hopefully the ones in the test set as well)
5
Binary linear classification
• In linear classification, we choose a linear model, which determines the output
predictions 𝑦 from the input features 𝒙
𝑧 = 𝒘𝑇 𝒙 + 𝑏 = 𝑤1 𝑥1 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏

• For classification, we apply a threshold on 𝑧 to get 𝑦

0 𝑖𝑓 𝑧 < 0
𝑦=ቊ
1 𝑖𝑓 𝑧 ≥ 0

𝑦
1.0

0.5

0.0 𝑧
-4 -3 -2 -1 0 1 2 3 4 6
Binary linear classification

• This is basically a special case of the single neuron processing unit; this is also
called a perceptron
𝑧 = 𝒘𝑇 𝒙 + 𝑏 = 𝑤1 𝑥1 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏

0 𝑖𝑓 𝑧 < 0
𝑦=ቊ
1 𝑖𝑓 𝑧 ≥ 0

Output
𝑦

Weights
𝑏 𝑤𝐾
𝑤1
1
𝑥1 ⋯ 𝑥𝐾
Input nodes

7
Binary linear classification

• This is basically a special case of the single neuron processing unit; this is also
called a perceptron
𝑧 = 𝒘𝑇 𝒙 + 𝑏 = 𝑤1 𝑥1 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏

0 𝑖𝑓 𝑧 < 0
𝑦=ቊ
1 𝑖𝑓 𝑧 ≥ 0

Output
𝑦 𝜎(𝑧)
Weights Input

Weights 𝑦 = 𝜎 𝒘𝑇 𝒙 + 𝑏
𝑏 𝑤𝐾
𝑤1
Output
1 Bias 𝑧
𝑥1 ⋯ 𝑥𝐾 Nonlinear
Input nodes Activation
Nonlinear
function
Activation
function
8
Ex: Whether to watch a movie or not?
• Consider a task of determining whether to watch a
movie or not?

• Suppose there are three decision inputs, 𝒙 ∈ ℝ3


Watch a movie or not?
𝑦 • Based on your past viewing experience, you may
provide give a high weight to 𝑥3 as compared to
Weights
𝑏 other inputs (because you like Nolan’s movies)
𝑤1 𝑤2 𝑤3
• In such a case, you may still watch the movie even if
1 a movie may not have Matt Damon as actor or the
𝑥1 𝑥2 𝑥3
Decision genre may not be thriller
inputs
𝑥1 = 0, 𝑥2 = 0, 𝑥3 = 1

9
Ex: Whether to watch a movie or not?
• So what could the bias mean in this case?

• Suppose you are a movie buff (you like to watch any


movie), your bias 𝑏 = 0
Watch a movie or not?
𝑦
Weights
𝑏
𝑤1 𝑤2 𝑤3
1
𝑥1 𝑥2 𝑥3
Decision
inputs

10
Ex: Whether to watch a movie or not?
• So what could the bias mean in this case?

• Suppose you are a movie buff (you like to watch any


movie), your bias 𝑏 = 0
Watch a movie or not?
𝑦 • Alternatively, if you only watch movies with actor
Matt Damon, genre is thriller and director is Matt
Weights
𝑏 Damon, then your bias 𝑏 = −3
𝑤1 𝑤2 𝑤3
1
𝑥1 𝑥2 𝑥3
Decision
inputs

11
Ex: Whether to watch a movie or not?
• So what could the bias mean in this case?

• Suppose you are a movie buff (you like to watch any


movie), your bias 𝑏 = 0
Watch a movie or not?
𝑦 • Alternatively, if you only watch movies with actor
Matt Damon, genre is thriller and director is Matt
Weights
𝑏 Damon, then your bias 𝑏 = −3
𝑤1 𝑤2 𝑤3
• So the weights and bias will depend on data (user
1 history in this case) and that is the reason we want
𝑥1 𝑥2 𝑥3
Decision to learn the weights and bias for prediction
inputs

12
Example of AND logic gate
• All computer operate using logic gates; can we linearly classify outputs of logic gates?

• We can use a perceptron to classify outputs from several logic gates such as AND, NOT,
OR, etc. These are simple functions which are often used in logical reasoning with binary
outputs (1-0 or Yes/No)
• AND logic gate
Input Input Output
𝑥1 𝑥2 𝑦
Input space
𝑥2 0 0 0

1.0 1 0 0
0 1 0
1 1 1

Linear classification implies


that the training examples can
0.0 𝑥1 be separated by a hyperplane
0.0 1.0 (in 2-D, it is a line)
13
Example of OR logic gate
Input space
𝑥2
Input Input Output 1.0
𝑥1 𝑥2 𝑦
0 0 0
1 0 1
0 1 1
1 1 1
0.0 𝑥1
0.0 1.0

• For this 2-D input problem, a line is able to separate the inputs

• Hence, OR gate can also be classified using a perceptron


14
How about the XOR gate?
Input Input Output
𝑥1 𝑥2 𝑦 • Can a line separate the two classes of
outputs?
0 0 0
1 0 1
0 1 1
1 1 0 Input space
𝑥2
1.0

• XOR is not linearly separable

0.0 𝑥1
0.0 1.0
15
Adding nonlinear input features
• Perceptrons can not learn certain functions which are not linearly separable

1D example:

Not linearly separable

Including nonlinear features will make it separable


Now it becomes 2D

16
XOR problem using nonlinear input features
• XOR is not linearly separable
• Introduce a nonlinear input feature

Input space
Input Input Input Output
𝑥1 𝑥2 𝑥3 = 𝑥1 𝑥2 𝑦 𝑥2
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
𝑥1

• A plane can separate the training data

𝑥3

17
Recap and what’s next
• We looked at perceptron for binary linear classification

• Limitations of perceptron

• It can only distinguish into two categories (i.e. 0-1)

• No guarantees if the data is not linearly separable

• Need to add nonlinear features to address limitations

• Now the question is how do optimize the parameters of the model

• What loss function to use? Recall we had defined a loss function in


regression

• Also how to do multi-class classification instead of binary classification

18
Loss functions for binary classification
𝑁
• Data: 𝑁 examples of 𝒙(𝑖) , 𝑡 𝑖
𝑖=1
, 𝒙 ∈ ℝ𝐾
, 𝑦 ∈ 0,1

0 𝑖𝑓 𝑧 < 0
• Model: 𝑧 = 𝒘𝑇 𝒙 + 𝑏 = 𝑤1 𝑥1 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏, 𝑦 = ቊ
1 𝑖𝑓 𝑧 ≥ 0

• Loss function: A seemingly obvious loss function is 0-1 loss

ℒ0−1 = 𝑁1 σ𝑁
𝑖=1 𝕀 𝑡
(𝑖)
≠ 𝑦 (𝑖) (𝑧)

𝕀 𝑡≠𝑦
0 if 𝑡 = 𝑦
where , 𝕀 𝑡 ≠ 𝑦 = ቊ
1 if 𝑡 ≠ 𝑦

• The ℒ0−1 loss function calculates the fraction of mis-classified examples

19
Problems with 0-1 loss function
• Let’s look at the derivative of the ℒ0−1 loss function

𝑑ℒ0−1 𝑑ℒ0−1 𝑑𝑧
• Chain rule: =
𝑑𝒘 𝑑𝑧 𝑑𝒘

• We can’t find a derivative or gradient of a discontinuous function

𝑑ℒ0−1
• Even if you did calculate on either side
𝑑𝑧
𝑑ℒ
of zero, the value of 0−1 = 0 since ℒ0−1 is
𝑑𝑧
constant on either side

𝕀 𝑡≠𝑦
• So almost all points have zero gradient!!

• This also means change in the weights would


not change the loss function

So a 0-1 loss function is useless to do practical optimization


20
A squared loss function with sigmoid neuron
• Can we instead work with a continuous squared loss function (just like in regression)?

• Yes we can. First, define a logistic (or a sigmoid) activation function over 𝑧 to
𝑦
1.0

1
𝑦= 𝜎 𝑧 =
1 + 𝑒 −𝑧 0.5

0.0 𝑧
-4 -3 -2 -1 0 1 2 3 4

• A linear model with sigmoid nonlinearity and a continuous squared error loss function:
𝑧 = 𝒘𝑇 𝒙 + 𝑏
𝑦=𝜎 𝑧
𝑁
1 2
ℒ𝑆𝐸 = 2𝑁
෍ 𝑡 (𝑖) − 𝑦 (𝑖)
𝑖=1
21
Gradient problems with sigmoid activation
𝑦𝑝
• Gradient using chain rule 1.0

𝑁
𝑑ℒ𝑆𝐸 𝑑ℒ𝑆𝐸 𝑑𝑦 (𝑖) 𝑑𝑧 (𝑖)
= ෍ (𝑖)
𝑑𝒘 𝑑𝑦 𝑑𝑧 (𝑖) 𝑑𝒘 0.5
𝑖=1

0.0 𝑧
• The gradient is very small at the two tails -4 -3 -2 -1 0 1 2 3 4

1
• E.g. you predict 𝑧 = −4 for which 𝑦 = 1 (a big mistake) 𝑦= 𝜎 𝑧 =
1 + 𝑒 −𝑧
but this badly mis-classified example will have tiny
𝑑𝑦
effect on the training algorithm because gradient
𝑑𝑧

around 𝑧 = −4 is very small

Sigmoid activation does not have a strong gradient signal at the tails
22
Problems with squared error loss function

• Squared error loss in a classification setting is not very good

• It does not distinguish between bad predictions from extremely bad predictions

• E.g. If 𝑡 = 1, then predictions 𝑦 = 0.01 and 𝑦 = 0.001 have roughly the same
squared-error loss 𝑡 − 𝑦 2 , even through 𝑦 = 0.001 is more wrong

𝑦
1.0

0.5

0.0 𝑧
-4 -3 -2 -1 0 1 2 3 4

23
Problems with squared error loss function

• Squared error loss in a classification setting is not very good

• It does not distinguish between bad predictions from extremely bad predictions

• E.g. If 𝑡 = 1, then predictions 𝑦 = 0.01 and 𝑦 = 0.001 have roughly the same
squared-error loss 𝑡 − 𝑦 2 , even through 𝑦 = 0.001 is more wrong

• From the perspective of optimization, the fact that the losses are nearly equivalent is
a big problem, as the loss function is not very sensitive to the parameters, and
hence training the parameters become difficult

Squared loss functions not good for classification


due to poor sensitivity to parameter changes

Can we somehow use logarithm over errors?

24
Cross-entropy loss function

• The problem with squared-error loss is that it treats 𝑦 = 0.01 and 𝑦 = 0.0001 as
nearly equivalent for 𝑡 = 1

• We’d like a loss function which makes these differences look very different

• One such loss function is cross-entropy (CE) for a single example

− log 𝑦 if 𝑡 = 1
ℒ𝐶𝐸 =ቊ
− log 1 − 𝑦 if 𝑡 = 0

Cross-entropy loss
• For the example, ℒ𝐶𝐸 0.01, 1 = 4.6 and 𝑡=1 𝑡=0
ℒ𝐶𝐸 0.0001, 1 = 9.2, so CE treats the
latter much worse

• A compact way to write the formula for


cross entropy for all examples
𝑦
𝑁
1
ℒ𝐶𝐸 = ෍ − 𝑡 (𝑖) log 𝑦 (𝑖) − 1 − 𝑡 (𝑖) log 1 − 𝑦 (𝑖)
𝑁 25
𝑖=1
Logistic regression

• Logistic regression combines sigmoid activation function with cross-entropy loss


function

• Logistic regression is a CLASSIFICATION algorithm (not regression)

𝑧 = 𝒘𝑇 𝒙 + 𝑏
𝑁
𝑑ℒ𝐶𝐸 𝑑ℒ𝐶𝐸 𝑑𝑦 (𝑖) 𝑑𝑧 (𝑖)
1 = ෍ (𝑖)
𝑦=𝜎 𝑧 = 𝑑𝒘 𝑑𝑦 𝑑𝑧 (𝑖) 𝑑𝒘
1 + 𝑒 −𝑧 𝑖=1
𝑁
𝑁
𝑑ℒ𝐶𝐸 𝑡 (𝑖) 1 − 𝑡 (𝑖)
1 = − ෍ (𝑖) +
𝑑𝑦 (𝑖) 𝑦 1 − 𝑦 (𝑖)
ℒ𝐶𝐸 = ෍ − 𝑡 (𝑖) log 𝑦 (𝑖) − 1 − 𝑡 (𝑖) log 1 − 𝑦 (𝑖) 𝑖=1
𝑁
𝑖=1
𝑑𝑧 (𝑖)
= 𝒙(𝑖)
Gradient descent updates for 𝒘 𝑑𝒘

𝑑ℒ𝐶𝐸 (𝒘) 𝑑𝑦 (𝑖) (𝑖) (𝑖)


𝒘←𝒘−𝛼 = 𝑦 1 − 𝑦
𝑑𝒘 𝑑𝑧 (𝑖)
Similarly, Gradient descent updates for 𝒘 26
Multi-class classification

27
Multiclass classification
• So far we’ve talked about binary classification, but most classification problems
involve more than two categories

• Outputs form a discrete set 1, 2, ⋯ , 𝐿 , that is 𝐿 number of output classes

• Fortunately, this doesn’t require any new ideas. Everything pretty much works very
similar to the binary case
28
Multiclass classification
• So far we’ve talked about binary classification, but most classification problems
involve more than two categories

• Outputs form a discrete set 1, ⋯ , 𝐿 , that is 𝐿 number of output classes

• How to represent the outputs?

• It is often more convenient to represent


them as one-hot vectors

1 0 0
0 1 0
0 0 0
0 0 0
0 0 0
⋮ ⋮ ⋮
0 0 1 29
Multiclass classification
• Now there are 𝐾 input dimensions and 𝐿 output dimensions, so we need 𝐾 × 𝐿
weights, which we arrange as a weight matrix 𝐖

• Also, we have a 𝐾-dimensional vector 𝒃 of biases

• Compute linear predictions:

• Componentwise: 𝑧𝑙 = σ𝐾
𝑘=1 𝑤𝑙𝑘 𝑥𝑘 + 𝑏𝑙

• Vectorized: 𝒛 = 𝐖𝒙 + 𝒃

30
Multiclass classification

31
Multiclass classification
• Now there are 𝐾 input dimensions and 𝐿 output dimensions, so we need 𝐾 × 𝐿
weights, which we arrange as a weight matrix 𝐖

• Also, we have a 𝐾-dimensional vector 𝒃 of biases

• Compute linear predictions:

• Component-wise: 𝑧𝑙 = σ𝐾
𝑘=1 𝑤𝑙𝑘 𝑥𝑘 + 𝑏𝑙

• Vectorized: 𝒛 = 𝐖𝒙 + 𝒃

• Next the activation function:

• Softmax function is a multivariable generalization of the logistic function

𝑒 𝑧𝑙
𝑦𝑙 = softmax 𝑧1 , ⋯ , 𝑧𝐿 𝑙 =
σ 𝑗 𝑒 𝑧𝑗

32
Multiclass classification

𝑒 𝑧𝑙
𝑦𝑙 = softmax 𝑧1 , ⋯ , 𝑧𝐿 𝑙 =
σ 𝑗 𝑒 𝑧𝑗

33
Multiclass classification
• Compute linear predictions:

• Componentwise: 𝑧𝑙 = σ𝐾
𝑘=1 𝑤𝑙𝑘 𝑥𝑘 + 𝑏𝑙 )

• Vectorized: 𝒛 = 𝐖𝒙 + 𝒃

• Next the activation function:

• Softmax function is a multivariable generalization of the logistic function


▪ Outputs are positive and sum to 1
▪ Can be interpreted as probability 𝑒 𝑧𝑙
𝑦𝑙 = softmax 𝑧1 , ⋯ , 𝑧𝐿 𝑙 =
distribution over 𝐿 classes σ𝑗 𝑒 𝑧 𝑗

• Finally, we can use a generalized cross-entropy loss function:

(𝑖) (𝑖)
ℒ𝐶𝐸 = − σ𝑁 𝐿
𝑖=1 σ𝑙=1 𝑡𝑙 log 𝑦𝑙 = − σ𝑁
𝑖=1 𝒕
(𝒊)𝑇
log 𝒚(𝑖)

34

You might also like