Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20

CIS 4526: Foundations of Machine Learning

Linear Regression
(modified from Sanja Fidler)

Instructor: Kai Zhang


CIS @ Temple University, Fall 2020
Regression Problems
Regression Problems
Regression Problems

• Curve Fitting
Regression Problems

• Time Series Forecast


Regression Problem

• What
  do all these problems have in common?
– Input: d-dimensional samples/vectors
– Output: continuous target value
• How to make predictions?
– A model, a function that represents the relationship between
and
– A loss or (cost, or objective function), which tells us how well
our model approximates the training examples
– Optimization, a way of finding the parameters of our model that
minimizes the loss function
Simple 1-D example

 𝑦(𝑥)
Model Selection

• Model Complexity
– In what form should we parameterize the prediction
function?
– How complex should the model be?
• Example: linear, quadratic, or degree-d polynomial? (1-d case)
• Common Belief
– Simple models
• less flexible, but may be easy to solve (such as a linear model)
– Complex models
• more powerful, but difficult to solve, and prone to overfitting
• We will start from building simple, linear models
Model Selection

• We will start from building simple, linear models


Higher-dimensional Linear Regression

• Circles are training examples


• Line/Plane represent the model/hypothesis
• Red lines represent in-sample error

10
Linear Model
• Given
  d-dimensional training samples
• Linear model
– Equivalent form:
  ′
𝒘
  𝑥𝑛
– So we usually apply ``augmented’’, (d+1)-dimensional data
• More convenient to derive closed-form solutions
Training (In-Sample) Error, Ein

Loss Function
Model

𝒘= [ ¿ ]
 
 𝑹(𝒅+𝟏)× 𝟏

Input data matrix Output

 𝑹 𝑵 × 𝟏
 𝑹 𝑵 ×(𝒅 +𝟏)

12
Minimizing Ein by closed-form solution
In order to minimize a function E(w)  
=
=
=
We need to set its gradient to 0
 
=0
And then solve the resultant equation (usually easier)

Since X’X is a square matrix, we can use its inverse

 
Why? If we want to solve
(it is the counterpart/generalization of square We need to solve
matrix inverse in solving linear systems) is a rectangular (no inverse defined)
But applies here, as if can be inverted!
13
Pseudo Inverse by SVD

• Non-square m-by-n matrix typically has no inverse, two definitions
 – (1) Mathematical flavored definition: Moore-Penrose generalized inverse is an n-by-m matrix such
that 

– (2) Machine learning flavored definition: as the solution of the following linear system

• Pseudo-inverse can be computed by SVD.


– Using SVD then (Assume that A is of full column rank)

𝛴  𝑈 ′
A 𝑈  𝛴 𝑉
+¿¿
 
=
   ’   𝐴 =    
Properties
  of U,V, and
 
’ (invertible)

– How to prove that SVD based is the pseudo inverse of A (by using the properties of U,V, and
above)?
• either ,
• or is the solution of in
Linear Regression Algorithm

 For numerical stability, we can replace the pseudo-inverse by

15
Augmented Linear Model

• Can
  we obtain both (1) closed-form solution and (2)
capacity of modelling nonlinear shapes?
• Nonlinear Data Augmentation
– If we want to use the following model

– Then the regression can be written as

, , ,1
 
     
.
.
 ¿ .
.
.
. .
.
Minimizing Loss by Gradient Descent

Compute gradient & setting it to 0


Compute gradient & iterative hill climbing (matrix form, more compact, closed-form)
  =   =
=

w=

Question: will they lead to the same solution?


Convexity/Optimality
• Gradient-based method vs Closed-form solutions
– Gradient based solution is easy to derive, but need to choose step
length, and iterate many times
– Closed-form solution is not always achievable, needs more math; it may
only need one step to go to the optimal solution
• For the objective in linear regression
– The two methods theoretically find the same solution
– Because the loss function is quadratic, convex
• For general optimization problems
– Gradient method is more popular/feasible
• Prone to local optimal
– Closed-form solution is usually difficult
• But you can be lucky if you can
– Derive fixed point iteration
– Or transform it to an SVD problem
Stochastic Gradient Descent
•• The Gradient is a sum of N terms (given N samples)
 

Compute gradient
(summation form)

– Can be quite expensive for large sample set

• Stochastic Gradient Descent:

– one sample at a time: sample index n can be randomly chosen

• Mini-batch Stochastic gradient


– One can choose a random subset of samples (indexed by B)

– |B| = 32, 64,…


SGD may drive the
iterations out of a
local optimal

Gradient
Stochastic
descent
Gradient

SGD leads to fluctuating


objective function

Loss
Function

You might also like