Lecture 3 - Linear Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

APL 745

Supervised Learning:
Linear regression
Dr. Rajdip Nayek
Block V, Room 418D
Department of Applied Mechanics
Indian Institute of Technology Delhi

E-mail: rajdipn@am.iitd.ac.in
Learning goals

• Understand the difference between regression and classification

• Formulate the linear regression task mathematically as an optimization problem

• Think about the data points and the model parameters as vectors

• Solve the optimization problem using two different strategies: deriving a closed-form
solution, and applying gradient descent

• Write the algorithm in terms of linear algebra, so that we can think about it more
easily

• Analyzing the generalization performance of an algorithm, and in particular the


problems of overfitting and underfitting

• Making a linear algorithm more powerful using nonlinear basis functions, or features

2
Supervised learning
• Given a set of examples in the form (𝒙, 𝑡)
• Ex. 𝒙 is 𝐾-dimensional input, 𝑡 is scalar output

• Objective: To find a relation or function between the inputs and outputs


Domain Range
(input) (output)
𝑓(𝒙)
𝒙 𝑡

3
Supervised learning
• Given a set of examples in the form (𝒙, 𝑡)
• Ex. 𝒙 is 𝐾-dimensional input, 𝑡 is scalar output

• Objective: To find a relation or function between the inputs and outputs


Domain Range
(input) (output)
𝑓(𝒙)
𝒙 𝑡

4
Supervised learning
• Given a set of examples in the form (𝒙, 𝑡)
• Ex. 𝒙 is 𝐾-dimensional input, 𝑡 is scalar output

• Objective: To find a relation or function between the inputs and outputs


Domain Range
(input) (output)
𝑓(𝒙)
𝒙 𝑡

• In practice, the underlying true relation 𝑓 𝒙 → 𝑡 is never known

• So we hypothesize a model ℎ(𝒙) to approximate 𝑓 𝒙

5
Supervised learning
• Two types of supervised learning

• Classification: range (output space) of 𝑓(𝒙) is categorical

• Regression: range (output space) of 𝑓(𝒙) is continuous

• In both cases, input space can be discrete or continuous

Domain Range
(input) (output)
𝑓(𝒙)
𝒙 𝑡

6
An example of classification
• Problem: Will you enjoy an outdoor sport based on the weather?

Pollution Humidity Wind Sky Enjoy


• Data:
sport
Mild Normal Strong Sunny YES
High High Weak Cloudy NO
Moderate High Strong Sunny YES
Mild Normal Weak Sunny YES

• 𝒙 = 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 is the feature vector which is input to the function 𝑓(𝒙)


• The outputs are binary in this case → 𝑌𝑒𝑠, 𝑁𝑜
• Find the rule/relation between the inputs and outputs
• Again, the function 𝑓(𝒙) is unknown, and we approximate using model ℎ(𝒙)
• Possible models ℎ(𝒙)
• ℎ1 𝒙 : Sky = sunny → Enjoy Sport = Yes
• ℎ2 𝒙 : Sky = sunny and Pollution = Mild → Enjoy Sport = Yes 7
An example of regression

• Problem: Predict the price of house based on floor area of house


• "𝑥 = Floor area” is the input
• The “𝑡 = price of house” is the output which lies in ℝ+
• Data:
Floor area Price of 𝑡
house
1200 40 L
2400 60 L
3000 85 L
4000 100 L
5000 62 L
6000 180 L 𝑥

8
More examples of classification/regression
Classification/
Problem Input domain Output range
Regression

Spam detection Text (set of words)

Stock price Time-series of prices


prediction

Speech recognition Audio signal

9
More examples of classification/regression
Classification/
Problem Input domain Output range
Regression

Digit recognition Images of digits

Housing valuation House features

Sensor data (images,


Weather prediction
wind speed)

10
Key components of any ML algorithm

1. The data that we learn from

2. A model with parameters to learn from the data

3. An loss function that quantifies how well (or badly) the model is
doing

4. An optimization algorithm to adjust the modelʼs parameters to


optimize the objective function

11
Linear regression

• 𝒙(𝑖) is a 𝐾-dimensional input vector, 𝑡 (𝑖) is


a scalar output (also called the target)

(𝑖) (𝑖) 𝑁
• Given several pairs 𝒙 , 𝑡 𝑖=1

• Fit a relation between the scalar output 𝑡


as a function of the vector input 𝒙 ∈ ℝ𝐾

12
Linear regression

• 𝒙(𝑖) is a 𝐾-dimensional input vector, 𝑡 (𝑖) is


a scalar output (also called the target)

𝑁
• Given several pairs 𝒙(𝑖) , 𝑡 (𝑖) 𝑖=1

• Fit a relation between the scalar output 𝑡 𝑡


as a function of the vector input 𝒙 ∈ ℝ𝐾

• A simple approach is to choose a linear model, such that the output is a


linear function of the input vector
13
Linear regression: Problem setup

(𝑖) (𝑖) 𝑁
• Find a relation between vector 𝒙 and 𝑡, given data 𝒙 , 𝑡 𝑖=1
• A linear model with 𝐾 input features

𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏 = 𝒘𝑇 𝒙 + 𝑏

• 𝑦 is the model prediction 𝑦

• 𝒘 is the weight vector

• 𝑏 is the bias (i.e. the intercept) A line in case


of scalar
• 𝒘 and 𝑏 together are the parameters input
of the linear model 𝑥

14
Linear regression: Problem setup

(𝑖) (𝑖) 𝑁
• How good is the model fit with respect to the data pairs 𝒙 , 𝑡 𝑖=1 ?

• To assess model fit, we define errors and accumulate them using a loss function
2
• Squared error for the 𝑖 th example: 𝑡 (𝑖) − 𝑦 (𝑖)

• Loss function: Take average of all errors

𝑁 𝑁
1 (𝑖) (𝑖) 2 1 (𝑖) 𝑇 𝑖 2
ℒ 𝒘, 𝑏 = 2𝑁
෍ 𝑡 −𝑦 = 2𝑁
෍ 𝑡 −𝒘 𝒙 −𝑏
𝑖=1 𝑖=1

1 1
• The 2 factor is for convenience and 𝑁 factor is for taking average over the full dataset

Goal: To choose 𝒘 and 𝑏 such that the loss ℒ 𝒘, 𝑏 is minimized


15
Linear regression: Problem setup summary

(𝑖) (𝑖) 𝑁
• Find a relation between vector 𝒙 and 𝑡, given data 𝒙 , 𝑡 𝑖=1
• A linear model with 𝐾 input features

𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏 = 𝒘𝑇 𝒙 + 𝑏

16
Vectorization
• We can organize all the data into a matrix 𝐗 with one row per example, and all the
output predictions into a vector 𝒚
One feature across
all examples
1 𝑻
𝒙 1 3 −2 0.1 One example
𝐗= ⋮ = ⋮ ⋮ ⋮ ⋮
𝑻
𝒙 𝑁 9 2 0 −0.2
𝑁×𝐾

• Compute the predicted output for the entire dataset


1 𝑻
𝒙 𝒘+𝑏 𝒘𝑇 𝒙 1 𝑦 (1)
+𝑏
𝐗𝐰 + 𝟏𝑏 = ⋮ = ⋮ = ⋮ =𝒚
𝑻
𝒙 𝑁 𝒘+𝑏 𝒘𝑇 𝒙 𝑁 + 𝑏 𝑦 (𝑁)

• Compute the loss function

1 1
ℒ 𝒘, 𝑏 = 2𝑁 𝒕 −𝒚 22 = 2𝑁 𝒕 −𝒚 𝑇 𝒕 −𝒚
17
18
Solving the optimization problem
• We would like to minimize the loss function!

• Recall from calculus class: the minimum of a smooth function occurs at a critical
point, i.e. point where the gradients are all 0.
ℒ(𝑤)

• Two strategies for optimization:

• Direct solution: derive a formula that sets the gradient to 0. This works only in
a handful of cases (e.g. linear regression).

• Iterative methods (e.g. gradient descent): repeatedly apply an update rule


which slightly improves the current solution. This is what we’ll do throughout
the course 19
Direct solution
• Augment the bias term into the weights

ഥ𝒘
𝒚 = 𝐗𝒘 + 𝟏𝑏 = 𝐗 ഥ

• Set gradient of the loss function with respect to weights and bias to zero

• Gradient: partial derivatives of a multivariate function with respect to its arguments

ഥ𝒘
𝒚 =𝐗 ഥ
Denominator
ഥ𝒘 layout
𝑑𝒚 𝑑 𝐗 ഥ
= ഥ𝑇
=𝐗 link

𝑑𝒘 𝑑𝒘ഥ

1 𝑇
ℒ= 𝒕 −𝒚 𝒕 −𝒚
2𝑁

𝑑ℒ 1
=− 𝒕 −𝒚
𝑑𝒚 𝑁
20
Direct solution
𝑑ℒ 1
• Use chain rule for derivatives =− 𝒕 −𝒚
𝑑𝒚 𝑁

• Set the gradient to zero 𝑑𝒚


ഥ𝑇
=𝐗

𝑑𝒘

𝑑ℒ(𝒘) 𝑑ℒ 𝑑𝒚
= =𝟎
𝑑𝒘ഥ ഥ
𝑑𝒚 𝑑 𝒘

1 𝑇
⇒ ഥ 𝒕 −𝒚 =0
− 𝐗
𝑁

21
Direct solution
𝑑ℒ 1
• Use chain rule for derivatives =− 𝒕 −𝒚
𝑑𝒚 𝑁

• Set the gradient to zero 𝑑𝒚


ഥ𝑇
=𝐗

𝑑𝒘

𝑑ℒ(𝒘) 𝑑ℒ 𝑑𝒚
= =𝟎
𝑑𝒘ഥ ഥ
𝑑𝒚 𝑑 𝒘

1 𝑇
⇒ ഥ 𝒕 −𝒚 =0
− 𝐗
𝑁

ഥ𝑇 𝒕 − 𝒚 = 𝟎
𝐗
ഥ𝑇 𝒕 − 𝐗
𝐗 ഥ𝑇 𝐗
ഥ𝒘ഥ =𝟎

ഥ𝑇 𝐗
𝐗 ഥ𝒘 ഥ𝑇 𝒕
ഥ =𝐗

ഥ𝑇 𝐗
ഥ = 𝐗
𝒘 ഥ −𝟏 ഥ
𝐗𝑇 𝒕
22
Iterative solution using gradient descent
• Gradient descent is an iterative algorithm, which means we apply an update
repeatedly until some convergence criterion is met (e.g. loss function stops changing
much)

23
Iterative solution using gradient descent
• Gradient descent is an iterative algorithm, which means we apply an update
repeatedly until some convergence criterion is met (e.g. loss function stops changing
much)

• We initialize the weights to something reasonable (e.g. random weights) and


repeatedly adjust them in the direction of steepest descent

Gradient
direction →
Direction of
steepest
ascent → the
direction in
which the
function
increases

24
25
Iterative solution using gradient descent
• Gradient descent is an iterative algorithm, which means we apply an update
repeatedly until some convergence criterion is met (e.g. loss function stops changing
much)

• We initialize the weights to something reasonable (e.g. random weights) and


repeatedly adjust them in the direction of steepest descent

ഥ ← 𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒
𝒘

𝑑ℒ(𝒘)
ഥ ←𝒘
𝒘 ഥ −𝛼
𝑑𝒘ഥ
𝛼 𝑇
𝒘ഥ ←𝒘 ഥ+ 𝐗 ഥ 𝒕 −𝒚
𝑁
𝛼 𝑇
𝒘ഥ ←𝒘ഥ+ 𝐗 ഥ 𝒕−𝐗ഥ𝒘

𝑁
• 𝛼 is the learning rate. Larger values of 𝛼 implies greater change in 𝒘
• Typical values 0.01 or 0.001

• Gradient descent can get stuck in local optima; we will see more on it later 26
Iterative vs direct solution
• By setting the gradients to zero, we compute the direct (or exact) solution. With
gradient descent, we approach it gradually

• Why then would we ever prefer gradient descent (GD)?

• GD can be applied to a much broader class of models, which may not


have closed-form solution, but gradients can be computed

• GD is easy to implement than direct solutions these days, especially with


automatic differentiation software like Autograd

• For regression, the direct solution requires a matrix inverse which can be
computationally very costly for large number of features

ഥ𝑇 𝐗
ഥ = 𝐗
𝒘 ഥ −𝟏 ഥ 𝑇
𝐗 𝒕

• GD is much more efficient for regression with large number of features

27
Nonlinear feature maps
• We can convert linear models into nonlinear models using nonlinear feature maps
𝑦 = 𝒘𝑇 𝜙 𝒙 + 𝑏

• E.g., if 𝝓 𝑥 = 𝑥, 𝑥 2 , 𝑥 3 𝑇 , then 𝑦 is a polynomial in 𝑥 (no longer linear in 𝑥)


𝑦 = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑏

• This introduction of nonlinear features does 𝑦


not require changing the algorithm – just
pretend that 𝝓 𝑥 is the input vector

• These nonlinear features allow fitting


nonlinear models, but it is hard to manually
choose good nonlinear features 𝑥

28
Generalization

• We don’t just want a learning algorithm to make correct predictions on the


training examples

• We would like the algorithm to generalize to examples it hasn’t seen before

• The average error on new examples is known as the generalization error

• We would like the generalization error to be as small as possible

• So we divide the total set of examples into training set and test set

Total 𝑵 examples
Training set (e.g. 70%) Test set (e.g.30%)

Training error Generalization


error
29
Generalization
Underfitting: - Model is too simple Overfitting: - Model is too complex
- Does not fit the data - Fits perfectly, does not generalize
𝑦 𝑦
Degree 10
Degree 1 polynomial

𝑥 𝑥

Balanced - Provides reasonable


Hyperparameters are 𝑦 compromise
parameters that we Degree 3
can’t include in the The degree of
polynomial
training procedure polynomial can be
itself, but which we considered as a
need to set using some hyperparameter
other means
𝑥 30
Generalization
• The training set is used to train the model (using a preset hyperparameter)

• The validation set is used to estimate the generalization error of each


hyperparameter setting

• The test set is used at the very end, to estimate the generalization error of the final
model, once all hyperparameters have been chosen

Training set Validation set Test set

Train with degree 1 error = 2.2


Error = 2.1

Train with degree 3 error = 1.3 Test error = 1.6


Error = 1.1

Train with degree 10 error = 7.4


Error = 6.2 31

You might also like