Lecture 3 - Linear Regression

APL 745
Supervised Learning:
Linear regression
Dr. Rajdip Nayek
Block V, Room 418D
Department of Applied Mechanics
Indian Institute of Technology Delhi
E-mail: rajdipn@am.iitd.ac.in
Learning goals
• Understand the difference between regression and classification
• Formulate the linear regression task mathematically as an optimization problem
• Think about the data points and the model parameters as vectors
• Solve the optimization problem using two different strategies: deriving a closed-form
solution, and applying gradient descent
• Write the algorithm in terms of linear algebra, so that we can think about it more
easily
• Analyzing the generalization performance of an algorithm, and in particular the

problems of overfitting and underfitting
• Making a linear algorithm more powerful using nonlinear basis functions, or features
2
Supervised learning
• Given a set of examples in the form (𝒙, 𝑡)
• Ex. 𝒙 is 𝐾-dimensional input, 𝑡 is scalar output
• Objective: To find a relation or function between the inputs and outputs

Domain Range
(input) (output)
𝑓(𝒙)
𝒙 𝑡
3
Supervised learning

Domain Range
(input) (output)
𝑓(𝒙)
𝒙 𝑡
4
Supervised learning

Domain Range
(input) (output)
𝑓(𝒙)
𝒙 𝑡
• In practice, the underlying true relation 𝑓 𝒙 → 𝑡 is never known
• So we hypothesize a model ℎ(𝒙) to approximate 𝑓 𝒙
5
Supervised learning
• Two types of supervised learning
• Classification: range (output space) of 𝑓(𝒙) is categorical
• Regression: range (output space) of 𝑓(𝒙) is continuous
• In both cases, input space can be discrete or continuous
Domain Range
(input) (output)
𝑓(𝒙)
𝒙 𝑡
6
An example of classification
• Problem: Will you enjoy an outdoor sport based on the weather?
Pollution Humidity Wind Sky Enjoy

• Data:
sport
Mild Normal Strong Sunny YES
High High Weak Cloudy NO
Moderate High Strong Sunny YES
Mild Normal Weak Sunny YES
• 𝒙 = 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 is the feature vector which is input to the function 𝑓(𝒙)

• The outputs are binary in this case → 𝑌𝑒𝑠, 𝑁𝑜
• Find the rule/relation between the inputs and outputs
• Again, the function 𝑓(𝒙) is unknown, and we approximate using model ℎ(𝒙)
• Possible models ℎ(𝒙)
• ℎ1 𝒙 : Sky = sunny → Enjoy Sport = Yes
• ℎ2 𝒙 : Sky = sunny and Pollution = Mild → Enjoy Sport = Yes 7
An example of regression
• Problem: Predict the price of house based on floor area of house

• "𝑥 = Floor area” is the input
• The “𝑡 = price of house” is the output which lies in ℝ+
• Data:
Floor area Price of 𝑡
house
1200 40 L
2400 60 L
3000 85 L
4000 100 L
5000 62 L
6000 180 L 𝑥
8
More examples of classification/regression
Classification/
Problem Input domain Output range
Regression
Spam detection Text (set of words)
Stock price Time-series of prices

prediction
Speech recognition Audio signal
9
More examples of classification/regression
Classification/
Problem Input domain Output range
Regression
Digit recognition Images of digits
Housing valuation House features
Sensor data (images,

Weather prediction
wind speed)
10
Key components of any ML algorithm
1. The data that we learn from
2. A model with parameters to learn from the data
3. An loss function that quantifies how well (or badly) the model is
doing
4. An optimization algorithm to adjust the modelʼs parameters to

optimize the objective function
11
Linear regression
• 𝒙(𝑖) is a 𝐾-dimensional input vector, 𝑡 (𝑖) is

a scalar output (also called the target)
(𝑖) (𝑖) 𝑁
• Given several pairs 𝒙 , 𝑡 𝑖=1
• Fit a relation between the scalar output 𝑡

as a function of the vector input 𝒙 ∈ ℝ𝐾
12
Linear regression
• 𝒙(𝑖) is a 𝐾-dimensional input vector, 𝑡 (𝑖) is

a scalar output (also called the target)
𝑁
• Given several pairs 𝒙(𝑖) , 𝑡 (𝑖) 𝑖=1
• Fit a relation between the scalar output 𝑡 𝑡

as a function of the vector input 𝒙 ∈ ℝ𝐾
• A simple approach is to choose a linear model, such that the output is a

linear function of the input vector
13
Linear regression: Problem setup
(𝑖) (𝑖) 𝑁
• Find a relation between vector 𝒙 and 𝑡, given data 𝒙 , 𝑡 𝑖=1
• A linear model with 𝐾 input features
𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏 = 𝒘𝑇 𝒙 + 𝑏
• 𝑦 is the model prediction 𝑦
• 𝒘 is the weight vector
• 𝑏 is the bias (i.e. the intercept) A line in case

of scalar
• 𝒘 and 𝑏 together are the parameters input
of the linear model 𝑥
14
Linear regression: Problem setup
(𝑖) (𝑖) 𝑁
• How good is the model fit with respect to the data pairs 𝒙 , 𝑡 𝑖=1 ?
• To assess model fit, we define errors and accumulate them using a loss function
2
• Squared error for the 𝑖 th example: 𝑡 (𝑖) − 𝑦 (𝑖)
• Loss function: Take average of all errors
𝑁 𝑁
1 (𝑖) (𝑖) 2 1 (𝑖) 𝑇 𝑖 2
ℒ 𝒘, 𝑏 = 2𝑁
෍ 𝑡 −𝑦 = 2𝑁
෍ 𝑡 −𝒘 𝒙 −𝑏
𝑖=1 𝑖=1
1 1
• The 2 factor is for convenience and 𝑁 factor is for taking average over the full dataset
Goal: To choose 𝒘 and 𝑏 such that the loss ℒ 𝒘, 𝑏 is minimized

15
Linear regression: Problem setup summary
(𝑖) (𝑖) 𝑁
• Find a relation between vector 𝒙 and 𝑡, given data 𝒙 , 𝑡 𝑖=1
• A linear model with 𝐾 input features
𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏 = 𝒘𝑇 𝒙 + 𝑏
16
Vectorization
• We can organize all the data into a matrix 𝐗 with one row per example, and all the
output predictions into a vector 𝒚
One feature across
all examples
1 𝑻
𝒙 1 3 −2 0.1 One example
𝐗= ⋮ = ⋮ ⋮ ⋮ ⋮
𝑻
𝒙 𝑁 9 2 0 −0.2
𝑁×𝐾
• Compute the predicted output for the entire dataset

1 𝑻
𝒙 𝒘+𝑏 𝒘𝑇 𝒙 1 𝑦 (1)
+𝑏
𝐗𝐰 + 𝟏𝑏 = ⋮ = ⋮ = ⋮ =𝒚
𝑻
𝒙 𝑁 𝒘+𝑏 𝒘𝑇 𝒙 𝑁 + 𝑏 𝑦 (𝑁)
• Compute the loss function
1 1
ℒ 𝒘, 𝑏 = 2𝑁 𝒕 −𝒚 22 = 2𝑁 𝒕 −𝒚 𝑇 𝒕 −𝒚
17
18
Solving the optimization problem
• We would like to minimize the loss function!
• Recall from calculus class: the minimum of a smooth function occurs at a critical
point, i.e. point where the gradients are all 0.
ℒ(𝑤)
• Two strategies for optimization:
• Direct solution: derive a formula that sets the gradient to 0. This works only in
a handful of cases (e.g. linear regression).
• Iterative methods (e.g. gradient descent): repeatedly apply an update rule

which slightly improves the current solution. This is what we’ll do throughout
the course 19
Direct solution
• Augment the bias term into the weights
ഥ𝒘
𝒚 = 𝐗𝒘 + 𝟏𝑏 = 𝐗 ഥ
• Set gradient of the loss function with respect to weights and bias to zero
• Gradient: partial derivatives of a multivariate function with respect to its arguments
ഥ𝒘
𝒚 =𝐗 ഥ
Denominator
ഥ𝒘 layout
𝑑𝒚 𝑑 𝐗 ഥ
= ഥ𝑇
=𝐗 link
ഥ
𝑑𝒘 𝑑𝒘ഥ
1 𝑇
ℒ= 𝒕 −𝒚 𝒕 −𝒚
2𝑁
𝑑ℒ 1
=− 𝒕 −𝒚
𝑑𝒚 𝑁
20
Direct solution
𝑑ℒ 1
• Use chain rule for derivatives =− 𝒕 −𝒚
𝑑𝒚 𝑁
• Set the gradient to zero 𝑑𝒚

ഥ𝑇
=𝐗
ഥ
𝑑𝒘
ഥ
𝑑ℒ(𝒘) 𝑑ℒ 𝑑𝒚
= =𝟎
𝑑𝒘ഥ ഥ
𝑑𝒚 𝑑 𝒘
1 𝑇
⇒ ഥ 𝒕 −𝒚 =0
− 𝐗
𝑁
21
Direct solution
𝑑ℒ 1
• Use chain rule for derivatives =− 𝒕 −𝒚
𝑑𝒚 𝑁
• Set the gradient to zero 𝑑𝒚

ഥ𝑇
=𝐗
ഥ
𝑑𝒘
ഥ
𝑑ℒ(𝒘) 𝑑ℒ 𝑑𝒚
= =𝟎
𝑑𝒘ഥ ഥ
𝑑𝒚 𝑑 𝒘
1 𝑇
⇒ ഥ 𝒕 −𝒚 =0
− 𝐗
𝑁
ഥ𝑇 𝒕 − 𝒚 = 𝟎
𝐗
ഥ𝑇 𝒕 − 𝐗
𝐗 ഥ𝑇 𝐗
ഥ𝒘ഥ =𝟎
ഥ𝑇 𝐗
𝐗 ഥ𝒘 ഥ𝑇 𝒕
ഥ =𝐗
ഥ𝑇 𝐗
ഥ = 𝐗
𝒘 ഥ −𝟏 ഥ
𝐗𝑇 𝒕
22
Iterative solution using gradient descent
• Gradient descent is an iterative algorithm, which means we apply an update
repeatedly until some convergence criterion is met (e.g. loss function stops changing
much)
23
much)
• We initialize the weights to something reasonable (e.g. random weights) and

repeatedly adjust them in the direction of steepest descent
Gradient
direction →
Direction of
steepest
ascent → the
direction in
which the
function
increases
24
25
much)
• We initialize the weights to something reasonable (e.g. random weights) and

repeatedly adjust them in the direction of steepest descent
ഥ ← 𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒
𝒘
ഥ
𝑑ℒ(𝒘)
ഥ ←𝒘
𝒘 ഥ −𝛼
𝑑𝒘ഥ
𝛼 𝑇
𝒘ഥ ←𝒘 ഥ+ 𝐗 ഥ 𝒕 −𝒚
𝑁
𝛼 𝑇
𝒘ഥ ←𝒘ഥ+ 𝐗 ഥ 𝒕−𝐗ഥ𝒘
ഥ
𝑁
• 𝛼 is the learning rate. Larger values of 𝛼 implies greater change in 𝒘
• Typical values 0.01 or 0.001
• Gradient descent can get stuck in local optima; we will see more on it later 26
Iterative vs direct solution
• By setting the gradients to zero, we compute the direct (or exact) solution. With
gradient descent, we approach it gradually
• Why then would we ever prefer gradient descent (GD)?
• GD can be applied to a much broader class of models, which may not

have closed-form solution, but gradients can be computed
• GD is easy to implement than direct solutions these days, especially with

automatic differentiation software like Autograd
• For regression, the direct solution requires a matrix inverse which can be
computationally very costly for large number of features
ഥ𝑇 𝐗
ഥ = 𝐗
𝒘 ഥ −𝟏 ഥ 𝑇
𝐗 𝒕
• GD is much more efficient for regression with large number of features
27
Nonlinear feature maps
• We can convert linear models into nonlinear models using nonlinear feature maps
𝑦 = 𝒘𝑇 𝜙 𝒙 + 𝑏
• E.g., if 𝝓 𝑥 = 𝑥, 𝑥 2 , 𝑥 3 𝑇 , then 𝑦 is a polynomial in 𝑥 (no longer linear in 𝑥)

𝑦 = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑏
• This introduction of nonlinear features does 𝑦

not require changing the algorithm – just
pretend that 𝝓 𝑥 is the input vector
• These nonlinear features allow fitting

nonlinear models, but it is hard to manually
choose good nonlinear features 𝑥
28
Generalization
• We don’t just want a learning algorithm to make correct predictions on the

training examples
• We would like the algorithm to generalize to examples it hasn’t seen before
• The average error on new examples is known as the generalization error
• We would like the generalization error to be as small as possible
• So we divide the total set of examples into training set and test set
Total 𝑵 examples
Training set (e.g. 70%) Test set (e.g.30%)
Training error Generalization

error
29
Generalization
Underfitting: - Model is too simple Overfitting: - Model is too complex
- Does not fit the data - Fits perfectly, does not generalize
𝑦 𝑦
Degree 10
Degree 1 polynomial
𝑥 𝑥
Balanced - Provides reasonable

Hyperparameters are 𝑦 compromise
parameters that we Degree 3
can’t include in the The degree of
polynomial
training procedure polynomial can be
itself, but which we considered as a
need to set using some hyperparameter
other means
𝑥 30
Generalization
• The training set is used to train the model (using a preset hyperparameter)
• The validation set is used to estimate the generalization error of each

hyperparameter setting
• The test set is used at the very end, to estimate the generalization error of the final
model, once all hyperparameters have been chosen
Training set Validation set Test set
Train with degree 1 error = 2.2

Error = 2.1
Train with degree 3 error = 1.3 Test error = 1.6

Error = 1.1
Train with degree 10 error = 7.4

Error = 6.2 31

Lecture 3 - Linear Regression

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 3 - Linear Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3 - Linear Regression

Uploaded by

Copyright:

Available Formats

APL 745

• Understand the difference between regression and classification

• Formulate the linear regression task mathematically as an optimization problem

• Analyzing the generalization performance of an algorithm, and in particular the

• Objective: To find a relation or function between the inputs and outputs

• Objective: To find a relation or function between the inputs and outputs

• Objective: To find a relation or function between the inputs and outputs

• In practice, the underlying true relation 𝑓 𝒙 → 𝑡 is never known

• So we hypothesize a model ℎ(𝒙) to approximate 𝑓 𝒙

• Classification: range (output space) of 𝑓(𝒙) is categorical

• Regression: range (output space) of 𝑓(𝒙) is continuous

• In both cases, input space can be discrete or continuous

Pollution Humidity Wind Sky Enjoy

• 𝒙 = 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 is the feature vector which is input to the function 𝑓(𝒙)

• Problem: Predict the price of house based on floor area of house

Spam detection Text (set of words)

Stock price Time-series of prices

Speech recognition Audio signal

Digit recognition Images of digits

Housing valuation House features

Sensor data (images,

1. The data that we learn from

2. A model with parameters to learn from the data

4. An optimization algorithm to adjust the modelʼs parameters to

• 𝒙(𝑖) is a 𝐾-dimensional input vector, 𝑡 (𝑖) is

• Fit a relation between the scalar output 𝑡

• 𝒙(𝑖) is a 𝐾-dimensional input vector, 𝑡 (𝑖) is

• Fit a relation between the scalar output 𝑡 𝑡

• A simple approach is to choose a linear model, such that the output is a

• 𝑦 is the model prediction 𝑦

• 𝒘 is the weight vector

• 𝑏 is the bias (i.e. the intercept) A line in case

• Loss function: Take average of all errors

Goal: To choose 𝒘 and 𝑏 such that the loss ℒ 𝒘, 𝑏 is minimized

• Compute the predicted output for the entire dataset

• Compute the loss function

• Two strategies for optimization:

• Iterative methods (e.g. gradient descent): repeatedly apply an update rule

• Gradient: partial derivatives of a multivariate function with respect to its arguments

• Set the gradient to zero 𝑑𝒚

• Set the gradient to zero 𝑑𝒚

• We initialize the weights to something reasonable (e.g. random weights) and

• We initialize the weights to something reasonable (e.g. random weights) and

• Why then would we ever prefer gradient descent (GD)?

• GD can be applied to a much broader class of models, which may not

• GD is easy to implement than direct solutions these days, especially with

• GD is much more efficient for regression with large number of features

• E.g., if 𝝓 𝑥 = 𝑥, 𝑥 2 , 𝑥 3 𝑇 , then 𝑦 is a polynomial in 𝑥 (no longer linear in 𝑥)

• This introduction of nonlinear features does 𝑦

• These nonlinear features allow fitting

• We don’t just want a learning algorithm to make correct predictions on the

• We would like the algorithm to generalize to examples it hasn’t seen before

• The average error on new examples is known as the generalization error

• We would like the generalization error to be as small as possible

Training error Generalization

Balanced - Provides reasonable

• The validation set is used to estimate the generalization error of each

Training set Validation set Test set