CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)

CIS 4526: Foundations of Machine Learning
Linear Regression
(modified from Sanja Fidler)
Instructor: Kai Zhang

CIS @ Temple University, Fall 2020
Regression Problems
Regression Problems
Regression Problems
• Curve Fitting
Regression Problems
• Time Series Forecast

Regression Problem
• What
do all these problems have in common?
– Input: d-dimensional samples/vectors
– Output: continuous target value
• How to make predictions?
– A model, a function that represents the relationship between
and
– A loss or (cost, or objective function), which tells us how well
our model approximates the training examples
– Optimization, a way of finding the parameters of our model that
minimizes the loss function
Simple 1-D example
𝑦(𝑥)
Model Selection
• Model Complexity
– In what form should we parameterize the prediction
function?
– How complex should the model be?
• Example: linear, quadratic, or degree-d polynomial? (1-d case)
• Common Belief
– Simple models
• less flexible, but may be easy to solve (such as a linear model)
– Complex models
• more powerful, but difficult to solve, and prone to overfitting
• We will start from building simple, linear models
Model Selection
• We will start from building simple, linear models

Higher-dimensional Linear Regression
• Circles are training examples

• Line/Plane represent the model/hypothesis
• Red lines represent in-sample error
10
Linear Model
• Given
d-dimensional training samples
• Linear model
– Equivalent form:
′
𝒘
𝑥𝑛
– So we usually apply ``augmented’’, (d+1)-dimensional data
• More convenient to derive closed-form solutions
Training (In-Sample) Error, Ein
Loss Function
Model
𝒘= [ ¿ ]

𝑹(𝒅+𝟏)× 𝟏
Input data matrix Output
𝑹 𝑵 × 𝟏
𝑹 𝑵 ×(𝒅 +𝟏)
12
Minimizing Ein by closed-form solution
In order to minimize a function E(w)
=
=
=
We need to set its gradient to 0

=0
And then solve the resultant equation (usually easier)
Since X’X is a square matrix, we can use its inverse

Why? If we want to solve
(it is the counterpart/generalization of square We need to solve
matrix inverse in solving linear systems) is a rectangular (no inverse defined)
But applies here, as if can be inverted!
13
Pseudo Inverse by SVD
•
• Non-square m-by-n matrix typically has no inverse, two definitions
– (1) Mathematical flavored definition: Moore-Penrose generalized inverse is an n-by-m matrix such
that
– (2) Machine learning flavored definition: as the solution of the following linear system
• Pseudo-inverse can be computed by SVD.

– Using SVD then (Assume that A is of full column rank)
𝛴 𝑈 ′
A 𝑈 𝛴 𝑉
+¿¿

=
’ 𝐴 =
Properties
of U,V, and

’ (invertible)
– How to prove that SVD based is the pseudo inverse of A (by using the properties of U,V, and
above)?
• either ,
• or is the solution of in
Linear Regression Algorithm
For numerical stability, we can replace the pseudo-inverse by
15
Augmented Linear Model
• Can
we obtain both (1) closed-form solution and (2)
capacity of modelling nonlinear shapes?
• Nonlinear Data Augmentation
– If we want to use the following model
•
– Then the regression can be written as
, , ,1

.
.
¿ .
.
.
. .
.
Minimizing Loss by Gradient Descent
Compute gradient & setting it to 0

Compute gradient & iterative hill climbing (matrix form, more compact, closed-form)
= =
=
w=
Question: will they lead to the same solution?

Convexity/Optimality
• Gradient-based method vs Closed-form solutions
– Gradient based solution is easy to derive, but need to choose step
length, and iterate many times
– Closed-form solution is not always achievable, needs more math; it may
only need one step to go to the optimal solution
• For the objective in linear regression
– The two methods theoretically find the same solution
– Because the loss function is quadratic, convex
• For general optimization problems
– Gradient method is more popular/feasible
• Prone to local optimal
– Closed-form solution is usually difficult
• But you can be lucky if you can
– Derive fixed point iteration
– Or transform it to an SVD problem
Stochastic Gradient Descent
•• The Gradient is a sum of N terms (given N samples)

=
Compute gradient
(summation form)
– Can be quite expensive for large sample set
• Stochastic Gradient Descent:
– one sample at a time: sample index n can be randomly chosen
• Mini-batch Stochastic gradient

– One can choose a random subset of samples (indexed by B)
– |B| = 32, 64,…

SGD may drive the
iterations out of a
local optimal
Gradient
Stochastic
descent
Gradient
SGD leads to fluctuating

objective function
Loss
Function

CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)

Uploaded by

Copyright:

Available Formats

CIS 4526: Foundations of Machine Learning

Instructor: Kai Zhang

• Time Series Forecast

• We will start from building simple, linear models

• Circles are training examples

Input data matrix Output

Since X’X is a square matrix, we can use its inverse

• Pseudo-inverse can be computed by SVD.

For numerical stability, we can replace the pseudo-inverse by

Compute gradient & setting it to 0

Question: will they lead to the same solution?

– Can be quite expensive for large sample set

• Stochastic Gradient Descent:

– one sample at a time: sample index n can be randomly chosen

• Mini-batch Stochastic gradient

– |B| = 32, 64,…

SGD leads to fluctuating

You might also like