Lecture 4 - Linear Classification

APL 745
Supervised Learning:
Linear classification
Dr. Rajdip Nayek
Block V, Room 418D
Department of Applied Mechanics
Indian Institute of Technology Delhi
E-mail: rajdipn@am.iitd.ac.in
Introduction
• In regression, the goal was to predict a scalar-valued output from a set

of input features. The question there was “how much”
• Here we will focus on classification where the goal is to predict a

category. Here the question is “which one”
• Some examples of classification
• Whether an email is spam or not spam?
• Is this customer more likely to sign up or not to sign up for a subscription?
• Does this image depict a donkey, a dog, a cat, or a camel?
• Which movie are you most likely to watch next?
2
Learning goals
• Know what is meant by binary linear classification
• Be able to specify weights and biases by hand to represent simple Boolean functions
(e.g. AND, OR, NOT)
• Be aware of the limitations of linear classifiers
• Understand why classification error and squared error are problematic cost functions
for classification.
3
Learning goals
• Know what cross-entropy is and understand why it can be easier to optimize than
squared error (assuming a logistic activation function)
• Be able to derive the gradient descent updates for all of the models and cost functions
mentioned in this lecture
• Understand how multi-class classification works
4
Binary linear classification
• Classification: Predict a class label (or a discrete-valued target)
• Binary: Predict a binary output 𝑦 ∈ 0, 1 or {No, Yes}

- Distinguish between two categories
• Linear: The classification is done using a linear function of input features 𝒙 ∈ ℝ𝐾
(𝑖) (𝑖) 𝑁
• Training set: 𝑁 pairs of examples, that is, 𝒙 , 𝑡 𝑖=1
• 𝑡 (𝑖) can take values in {0, 1} (because it is binary valued)
Goal: To correctly classify all the 𝑁 training examples

(and hopefully the ones in the test set as well)
5
• In linear classification, we choose a linear model, which determines the output
predictions 𝑦 from the input features 𝒙
𝑧 = 𝒘𝑇 𝒙 + 𝑏 = 𝑤1 𝑥1 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏
• For classification, we apply a threshold on 𝑧 to get 𝑦
0 𝑖𝑓 𝑧 < 0
𝑦=ቊ
1 𝑖𝑓 𝑧 ≥ 0
𝑦
1.0
0.5
0.0 𝑧
-4 -3 -2 -1 0 1 2 3 4 6
• This is basically a special case of the single neuron processing unit; this is also
called a perceptron
𝑧 = 𝒘𝑇 𝒙 + 𝑏 = 𝑤1 𝑥1 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏
0 𝑖𝑓 𝑧 < 0
𝑦=ቊ
1 𝑖𝑓 𝑧 ≥ 0
Output
𝑦
Weights
𝑏 𝑤𝐾
𝑤1
1
𝑥1 ⋯ 𝑥𝐾
Input nodes
7
• This is basically a special case of the single neuron processing unit; this is also
called a perceptron
𝑧 = 𝒘𝑇 𝒙 + 𝑏 = 𝑤1 𝑥1 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏
0 𝑖𝑓 𝑧 < 0
𝑦=ቊ
1 𝑖𝑓 𝑧 ≥ 0
Output
𝑦 𝜎(𝑧)
Weights Input
Weights 𝑦 = 𝜎 𝒘𝑇 𝒙 + 𝑏
𝑏 𝑤𝐾
𝑤1
Output
1 Bias 𝑧
𝑥1 ⋯ 𝑥𝐾 Nonlinear
Input nodes Activation
Nonlinear
function
Activation
function
8
Ex: Whether to watch a movie or not?
• Consider a task of determining whether to watch a
movie or not?
• Suppose there are three decision inputs, 𝒙 ∈ ℝ3

Watch a movie or not?
𝑦 • Based on your past viewing experience, you may
provide give a high weight to 𝑥3 as compared to
Weights
𝑏 other inputs (because you like Nolan’s movies)
𝑤1 𝑤2 𝑤3
• In such a case, you may still watch the movie even if
1 a movie may not have Matt Damon as actor or the
𝑥1 𝑥2 𝑥3
Decision genre may not be thriller
inputs
𝑥1 = 0, 𝑥2 = 0, 𝑥3 = 1
9
• So what could the bias mean in this case?
• Suppose you are a movie buff (you like to watch any

movie), your bias 𝑏 = 0
𝑦
Weights
𝑏
𝑤1 𝑤2 𝑤3
1
𝑥1 𝑥2 𝑥3
Decision
inputs
10

𝑦 • Alternatively, if you only watch movies with actor
Matt Damon, genre is thriller and director is Matt
Weights
𝑏 Damon, then your bias 𝑏 = −3
𝑤1 𝑤2 𝑤3
1
𝑥1 𝑥2 𝑥3
Decision
inputs
11

𝑦 • Alternatively, if you only watch movies with actor
Matt Damon, genre is thriller and director is Matt
Weights
𝑏 Damon, then your bias 𝑏 = −3
𝑤1 𝑤2 𝑤3
• So the weights and bias will depend on data (user
1 history in this case) and that is the reason we want
𝑥1 𝑥2 𝑥3
Decision to learn the weights and bias for prediction
inputs
12
Example of AND logic gate
• All computer operate using logic gates; can we linearly classify outputs of logic gates?
• We can use a perceptron to classify outputs from several logic gates such as AND, NOT,
OR, etc. These are simple functions which are often used in logical reasoning with binary
outputs (1-0 or Yes/No)
• AND logic gate
Input Input Output
𝑥1 𝑥2 𝑦
Input space
𝑥2 0 0 0
1.0 1 0 0
0 1 0
1 1 1
Linear classification implies

that the training examples can
0.0 𝑥1 be separated by a hyperplane
0.0 1.0 (in 2-D, it is a line)
13
Example of OR logic gate
Input space
𝑥2
Input Input Output 1.0
𝑥1 𝑥2 𝑦
0 0 0
1 0 1
0 1 1
1 1 1
0.0 𝑥1
0.0 1.0
• For this 2-D input problem, a line is able to separate the inputs
• Hence, OR gate can also be classified using a perceptron

14
How about the XOR gate?
Input Input Output
𝑥1 𝑥2 𝑦 • Can a line separate the two classes of
outputs?
0 0 0
1 0 1
0 1 1
1 1 0 Input space
𝑥2
1.0
• XOR is not linearly separable
0.0 𝑥1
0.0 1.0
15
Adding nonlinear input features
• Perceptrons can not learn certain functions which are not linearly separable
1D example:
Not linearly separable
Including nonlinear features will make it separable

Now it becomes 2D
16
XOR problem using nonlinear input features
• XOR is not linearly separable
• Introduce a nonlinear input feature
Input space
Input Input Input Output
𝑥1 𝑥2 𝑥3 = 𝑥1 𝑥2 𝑦 𝑥2
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
𝑥1
• A plane can separate the training data
𝑥3
17
Recap and what’s next
• We looked at perceptron for binary linear classification
• Limitations of perceptron
• It can only distinguish into two categories (i.e. 0-1)
• No guarantees if the data is not linearly separable
• Need to add nonlinear features to address limitations
• Now the question is how do optimize the parameters of the model
• What loss function to use? Recall we had defined a loss function in

regression
• Also how to do multi-class classification instead of binary classification
18
Loss functions for binary classification
𝑁
• Data: 𝑁 examples of 𝒙(𝑖) , 𝑡 𝑖
𝑖=1
, 𝒙 ∈ ℝ𝐾
, 𝑦 ∈ 0,1
0 𝑖𝑓 𝑧 < 0
• Model: 𝑧 = 𝒘𝑇 𝒙 + 𝑏 = 𝑤1 𝑥1 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏, 𝑦 = ቊ
1 𝑖𝑓 𝑧 ≥ 0
• Loss function: A seemingly obvious loss function is 0-1 loss
ℒ0−1 = 𝑁1 σ𝑁
𝑖=1 𝕀 𝑡
(𝑖)
≠ 𝑦 (𝑖) (𝑧)
𝕀 𝑡≠𝑦
0 if 𝑡 = 𝑦
where , 𝕀 𝑡 ≠ 𝑦 = ቊ
1 if 𝑡 ≠ 𝑦
• The ℒ0−1 loss function calculates the fraction of mis-classified examples
19
Problems with 0-1 loss function
• Let’s look at the derivative of the ℒ0−1 loss function
𝑑ℒ0−1 𝑑ℒ0−1 𝑑𝑧
• Chain rule: =
𝑑𝒘 𝑑𝑧 𝑑𝒘
• We can’t find a derivative or gradient of a discontinuous function
𝑑ℒ0−1
• Even if you did calculate on either side
𝑑𝑧
𝑑ℒ
of zero, the value of 0−1 = 0 since ℒ0−1 is
𝑑𝑧
constant on either side
𝕀 𝑡≠𝑦
• So almost all points have zero gradient!!
• This also means change in the weights would

not change the loss function
So a 0-1 loss function is useless to do practical optimization

20
A squared loss function with sigmoid neuron
• Can we instead work with a continuous squared loss function (just like in regression)?
• Yes we can. First, define a logistic (or a sigmoid) activation function over 𝑧 to
𝑦
1.0
1
𝑦= 𝜎 𝑧 =
1 + 𝑒 −𝑧 0.5
0.0 𝑧
-4 -3 -2 -1 0 1 2 3 4
• A linear model with sigmoid nonlinearity and a continuous squared error loss function:
𝑧 = 𝒘𝑇 𝒙 + 𝑏
𝑦=𝜎 𝑧
𝑁
1 2
ℒ𝑆𝐸 = 2𝑁
෍ 𝑡 (𝑖) − 𝑦 (𝑖)
𝑖=1
21
Gradient problems with sigmoid activation
𝑦𝑝
• Gradient using chain rule 1.0
𝑁
𝑑ℒ𝑆𝐸 𝑑ℒ𝑆𝐸 𝑑𝑦 (𝑖) 𝑑𝑧 (𝑖)
= ෍ (𝑖)
𝑑𝒘 𝑑𝑦 𝑑𝑧 (𝑖) 𝑑𝒘 0.5
𝑖=1
0.0 𝑧
• The gradient is very small at the two tails -4 -3 -2 -1 0 1 2 3 4
1
• E.g. you predict 𝑧 = −4 for which 𝑦 = 1 (a big mistake) 𝑦= 𝜎 𝑧 =
1 + 𝑒 −𝑧
but this badly mis-classified example will have tiny
𝑑𝑦
effect on the training algorithm because gradient
𝑑𝑧
around 𝑧 = −4 is very small
Sigmoid activation does not have a strong gradient signal at the tails
22
Problems with squared error loss function
• Squared error loss in a classification setting is not very good
• It does not distinguish between bad predictions from extremely bad predictions
• E.g. If 𝑡 = 1, then predictions 𝑦 = 0.01 and 𝑦 = 0.001 have roughly the same
squared-error loss 𝑡 − 𝑦 2 , even through 𝑦 = 0.001 is more wrong
𝑦
1.0
0.5
0.0 𝑧
-4 -3 -2 -1 0 1 2 3 4
23
Problems with squared error loss function
• Squared error loss in a classification setting is not very good
• It does not distinguish between bad predictions from extremely bad predictions
• E.g. If 𝑡 = 1, then predictions 𝑦 = 0.01 and 𝑦 = 0.001 have roughly the same
squared-error loss 𝑡 − 𝑦 2 , even through 𝑦 = 0.001 is more wrong
• From the perspective of optimization, the fact that the losses are nearly equivalent is
a big problem, as the loss function is not very sensitive to the parameters, and
hence training the parameters become difficult
Squared loss functions not good for classification

due to poor sensitivity to parameter changes
Can we somehow use logarithm over errors?
24
Cross-entropy loss function
• The problem with squared-error loss is that it treats 𝑦 = 0.01 and 𝑦 = 0.0001 as
nearly equivalent for 𝑡 = 1
• We’d like a loss function which makes these differences look very different
• One such loss function is cross-entropy (CE) for a single example
− log 𝑦 if 𝑡 = 1
ℒ𝐶𝐸 =ቊ
− log 1 − 𝑦 if 𝑡 = 0
Cross-entropy loss
• For the example, ℒ𝐶𝐸 0.01, 1 = 4.6 and 𝑡=1 𝑡=0
ℒ𝐶𝐸 0.0001, 1 = 9.2, so CE treats the
latter much worse
• A compact way to write the formula for

cross entropy for all examples
𝑦
𝑁
1
ℒ𝐶𝐸 = ෍ − 𝑡 (𝑖) log 𝑦 (𝑖) − 1 − 𝑡 (𝑖) log 1 − 𝑦 (𝑖)
𝑁 25
𝑖=1
Logistic regression
• Logistic regression combines sigmoid activation function with cross-entropy loss

function
• Logistic regression is a CLASSIFICATION algorithm (not regression)
𝑧 = 𝒘𝑇 𝒙 + 𝑏
𝑁
𝑑ℒ𝐶𝐸 𝑑ℒ𝐶𝐸 𝑑𝑦 (𝑖) 𝑑𝑧 (𝑖)
1 = ෍ (𝑖)
𝑦=𝜎 𝑧 = 𝑑𝒘 𝑑𝑦 𝑑𝑧 (𝑖) 𝑑𝒘
1 + 𝑒 −𝑧 𝑖=1
𝑁
𝑁
𝑑ℒ𝐶𝐸 𝑡 (𝑖) 1 − 𝑡 (𝑖)
1 = − ෍ (𝑖) +
𝑑𝑦 (𝑖) 𝑦 1 − 𝑦 (𝑖)
ℒ𝐶𝐸 = ෍ − 𝑡 (𝑖) log 𝑦 (𝑖) − 1 − 𝑡 (𝑖) log 1 − 𝑦 (𝑖) 𝑖=1
𝑁
𝑖=1
𝑑𝑧 (𝑖)
= 𝒙(𝑖)
Gradient descent updates for 𝒘 𝑑𝒘
𝑑ℒ𝐶𝐸 (𝒘) 𝑑𝑦 (𝑖) (𝑖) (𝑖)

𝒘←𝒘−𝛼 = 𝑦 1 − 𝑦
𝑑𝒘 𝑑𝑧 (𝑖)
Similarly, Gradient descent updates for 𝒘 26
Multi-class classification
27
Multiclass classification
• So far we’ve talked about binary classification, but most classification problems
involve more than two categories
• Outputs form a discrete set 1, 2, ⋯ , 𝐿 , that is 𝐿 number of output classes
• Fortunately, this doesn’t require any new ideas. Everything pretty much works very
similar to the binary case
28
• So far we’ve talked about binary classification, but most classification problems
involve more than two categories
• Outputs form a discrete set 1, ⋯ , 𝐿 , that is 𝐿 number of output classes
• How to represent the outputs?
• It is often more convenient to represent

them as one-hot vectors
1 0 0
0 1 0
0 0 0
0 0 0
0 0 0
⋮ ⋮ ⋮
0 0 1 29
• Now there are 𝐾 input dimensions and 𝐿 output dimensions, so we need 𝐾 × 𝐿
weights, which we arrange as a weight matrix 𝐖
• Also, we have a 𝐾-dimensional vector 𝒃 of biases
• Compute linear predictions:
• Componentwise: 𝑧𝑙 = σ𝐾
𝑘=1 𝑤𝑙𝑘 𝑥𝑘 + 𝑏𝑙
• Vectorized: 𝒛 = 𝐖𝒙 + 𝒃
30
31
• Now there are 𝐾 input dimensions and 𝐿 output dimensions, so we need 𝐾 × 𝐿
weights, which we arrange as a weight matrix 𝐖
• Also, we have a 𝐾-dimensional vector 𝒃 of biases
• Component-wise: 𝑧𝑙 = σ𝐾
𝑘=1 𝑤𝑙𝑘 𝑥𝑘 + 𝑏𝑙
• Next the activation function:
• Softmax function is a multivariable generalization of the logistic function
𝑒 𝑧𝑙
𝑦𝑙 = softmax 𝑧1 , ⋯ , 𝑧𝐿 𝑙 =
σ 𝑗 𝑒 𝑧𝑗
32
𝑒 𝑧𝑙
σ 𝑗 𝑒 𝑧𝑗
33
• Componentwise: 𝑧𝑙 = σ𝐾
𝑘=1 𝑤𝑙𝑘 𝑥𝑘 + 𝑏𝑙 )
• Next the activation function:
• Softmax function is a multivariable generalization of the logistic function

▪ Outputs are positive and sum to 1
▪ Can be interpreted as probability 𝑒 𝑧𝑙
distribution over 𝐿 classes σ𝑗 𝑒 𝑧 𝑗
• Finally, we can use a generalized cross-entropy loss function:
(𝑖) (𝑖)
ℒ𝐶𝐸 = − σ𝑁 𝐿
𝑖=1 σ𝑙=1 𝑡𝑙 log 𝑦𝑙 = − σ𝑁
𝑖=1 𝒕
(𝒊)𝑇
log 𝒚(𝑖)
34

Lecture 4 - Linear Classification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4 - Linear Classification

Uploaded by

Copyright:

Available Formats

APL 745

• In regression, the goal was to predict a scalar-valued output from a set

• Here we will focus on classification where the goal is to predict a

• Some examples of classification

• Whether an email is spam or not spam?

• Is this customer more likely to sign up or not to sign up for a subscription?

• Does this image depict a donkey, a dog, a cat, or a camel?

• Which movie are you most likely to watch next?

• Know what is meant by binary linear classification

• Be aware of the limitations of linear classifiers

• Understand how multi-class classification works

• Classification: Predict a class label (or a discrete-valued target)

• Binary: Predict a binary output 𝑦 ∈ 0, 1 or {No, Yes}

• Linear: The classification is done using a linear function of input features 𝒙 ∈ ℝ𝐾

Goal: To correctly classify all the 𝑁 training examples

• For classification, we apply a threshold on 𝑧 to get 𝑦

• Suppose there are three decision inputs, 𝒙 ∈ ℝ3

• Suppose you are a movie buff (you like to watch any

• Suppose you are a movie buff (you like to watch any

• Suppose you are a movie buff (you like to watch any

Linear classification implies

• Hence, OR gate can also be classified using a perceptron

• XOR is not linearly separable

Not linearly separable

Including nonlinear features will make it separable

• A plane can separate the training data

• It can only distinguish into two categories (i.e. 0-1)

• No guarantees if the data is not linearly separable

• Need to add nonlinear features to address limitations

• Now the question is how do optimize the parameters of the model

• What loss function to use? Recall we had defined a loss function in

• Also how to do multi-class classification instead of binary classification

• Loss function: A seemingly obvious loss function is 0-1 loss

• The ℒ0−1 loss function calculates the fraction of mis-classified examples

• We can’t find a derivative or gradient of a discontinuous function

• This also means change in the weights would

So a 0-1 loss function is useless to do practical optimization

around 𝑧 = −4 is very small

• Squared error loss in a classification setting is not very good

• Squared error loss in a classification setting is not very good

Squared loss functions not good for classification

Can we somehow use logarithm over errors?

• One such loss function is cross-entropy (CE) for a single example

• A compact way to write the formula for

• Logistic regression combines sigmoid activation function with cross-entropy loss

• Logistic regression is a CLASSIFICATION algorithm (not regression)

𝑑ℒ𝐶𝐸 (𝒘) 𝑑𝑦 (𝑖) (𝑖) (𝑖)

• Outputs form a discrete set 1, 2, ⋯ , 𝐿 , that is 𝐿 number of output classes

• Outputs form a discrete set 1, ⋯ , 𝐿 , that is 𝐿 number of output classes

• How to represent the outputs?

• It is often more convenient to represent

• Also, we have a 𝐾-dimensional vector 𝒃 of biases

• Compute linear predictions:

• Also, we have a 𝐾-dimensional vector 𝒃 of biases

• Compute linear predictions:

• Next the activation function:

• Softmax function is a multivariable generalization of the logistic function

• Next the activation function:

• Softmax function is a multivariable generalization of the logistic function

• Finally, we can use a generalized cross-entropy loss function:

You might also like