Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Regression

- Linear
- Non Linear

Pattern Recognition: Regression 1


Introduction
- The goal is to estimate (real valued) predictions from the given features
- A prediction model is tuned based on the {(𝑥 𝑖 , 𝑦 𝑖 )}𝑛𝑖=1 pairs of Training Data
- Example: predicting house price from 3 attributes

Size (𝑚2) Age (year) Region Price (106T)


100 2 5 500
80 25 3 250
… … … …

Pattern Recognition: Regression 2


Regression
• From our point of view, regression is a learning challenge, therefore to
solve this problem it needs to:

• Training Data : a set of given type data here (𝑥, 𝑦) pairs


• hypothesis space : a set of mappings functions from feature vector to target
• Learning Algorithm : optimization of cost Function
• Evaluation Criteria : it estimates how well learned model generalizes to
unseen examples

Pattern Recognition: Regression 3


Regression
• We begin by the class of linear functions :

• In many cases it can give good results


• It is often easy to extend the linear to generalized linear models and so cover
more complex regression functions

Pattern Recognition: Regression 4


Linear regression: hypothesis space
• 𝑓: 𝑹𝑑 → 𝑹,
𝑦 = 𝑤0 +𝑤1 𝑥1 +𝑤2 𝑥2 + … +𝑤𝑑 𝑥𝑑 , where:
𝑊 = [𝑤0 , 𝑤1 , … , 𝑤𝑑 ]𝑇 is a vector which needs to be set.

Pattern Recognition: Regression 5


Linear Regression
• The learning algorithm needs to define the error or cost function, where it should
be minimized.
• The cost function is defined by the sum of square of differences between the real
and the predicted values over the training data (SSD criterion).
• If the predicted model for 𝑥 𝑖 ( the 𝑖𝑡ℎ training data) be equal to 𝑓(𝑥 𝑖 , 𝑤), then:
Cost Function = 𝑛𝑖=1(𝑦 𝑖 − 𝑓(𝑥 𝑖 , 𝑤))2 .

Pattern Recognition: Regression 6


Linear regression: uni and multi variate example
• Uni Variate: The input data includes just one feature
Cost function = 𝑛𝑖=1(𝑦 𝑖 − (𝑤0 +𝑤1 𝑥1 ))2

• Multi Variate: The input data is a d-dimensional vector


Cost function = 𝑛𝑖=1(𝑦 𝑖 − (𝑤0 +𝑤1 𝑥1 +𝑤2 𝑥2 + … +𝑤𝑑 𝑥𝑑 ))2

Pattern Recognition: Regression 7


Linear regression: uni and multi variate example
• In the linear case the defined cost function is a quadratic function of
the model parameters and its shape in a (d+1)-dimension space is
convex. Therefore it has a global optimum.

Price

size of house

Pattern Recognition: Regression 8


Cost function: univariate example
Linear regression 𝐽(𝑤0 ,𝑤1 ) ∶ 𝑖𝑠 𝑎 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤0 ,𝑤1

Pattern Recognition: Regression 9


Cost function optimization: univariate
• J(W) = 𝑛𝑖=1(𝑦 𝑖 − (𝑤0 +𝑤1 𝑥1 ))2
• Necessary conditions for the “optimal” parameter values:

𝜕𝐽
= 𝑛𝑖=1 2 𝑦 𝑖 − 𝑤0 +𝑤1 𝑥1 −1 = 0
𝜕𝑤0
𝜕𝐽
= 𝑛𝑖=1 2 𝑦 𝑖 − 𝑤0 +𝑤1 𝑥1 −𝑥1 = 0
𝜕𝑤1

• A System of two unknown parameters with two linear equations

Pattern Recognition: Regression 10


Cost function optimization: multi-variate
𝑛
• J(W) = 𝑖=1 (𝑦 𝑖 − (𝑤0 +𝑤1 𝑥1𝑖 +𝑤2 𝑥2𝑖 + … +𝑤𝑑 𝑥𝑑𝑖 ))2
𝑛 𝑖
= 𝑖=1 (𝑦 − 𝑊 𝑇 𝑥 𝑖 )2
• Cost function in matrix form presentation:

Pattern Recognition: Regression 11


Minimizing Cost Function
• Optimal weight vector:
𝑊 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 𝐽(𝑊)
𝑤
𝐽 𝑊 = ∥ y - XW∥2
𝛻𝑤 𝐽 𝑊 = −2𝑋 𝑇 (y – XW)

𝛻𝑤 𝐽 𝑊 = 0 ⟹ (𝑋 𝑇 X)W = 𝑋 𝑇 y
𝑊 = (𝑋 𝑇 X)−1 𝑋 𝑇 y

Pattern Recognition: Regression 12


Minimizing Cost Function
𝑊 = (𝑋 𝑇 X)−1 𝑋 𝑇 y

𝑋 † = (𝑋 𝑇 X)−1 𝑋 𝑇

W = 𝑋†y

𝑋 † is pseudo inverse of X

Pattern Recognition: Regression 13


Another approach for optimizing the sum
squared error: Gradient decent
• iterative approach for solving the following optimization problem:

𝑛 𝑖
𝐽(W) = 𝑖=1(𝑦 − 𝑊 𝑇 𝑥 𝑖 )2

𝑊 𝑡+1 = 𝑊 𝑡 - 𝛾𝑡 𝛻𝐽(𝑊 𝑡 ) ⇒ 𝑊 𝑡+1 = 𝑊 𝑡 + 𝜂 𝑛


𝑖=1(𝑦 𝑖 − 𝑊 𝑇 𝑥 𝑖 ) (−𝑥 𝑖 )
- Steps:
0
• Start from 𝑊
• Repeat

• Update 𝑊𝑡 to 𝑊𝑡+1 in order to reduce 𝐽


• 𝑡 ← 𝑡 +1
• until we hopefully end up at a minimum

Pattern Recognition: Regression 14


Review: Gradient descent disadvantages
• Local minima problem

• However, when 𝐽 is convex, local minima is also global minima ⇒


gradient descent can converge to the global solution.
• 𝜂 too small → gradient descent can be slow.
• 𝜂 too large → gradient descent can overshoot the minimum. It may
fail to converge, or even diverge to other local minima points.

Pattern Recognition: Regression 15


Review: Problem of gradient descent with
non-convex cost functions

Pattern Recognition: Regression 16


Review: Problem of gradient descent with
non-convex cost functions

Pattern Recognition: Regression 17


𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 18


𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 19


𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 20


𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 21


𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 22


𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 23


𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 24


𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 25


𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 26


Stochastic gradient descent
• Batch techniques process the entire training set in one go
• thus they can be computationally costly for large data sets.

𝑛 𝑖
J(W) = 𝑖=1(𝑦 − 𝑊 𝑇 𝑥 𝑖 )2

However in the stochastic gradient version, instead of the summation of all


dataset points, some or even one sample can be processed in a step of iterative
training procedure:
J(W) = (𝑦 𝑖 − 𝑊 𝑇 𝑥 𝑖 )2 , 𝛻𝐽𝑤 𝑊 = 2 𝑦 𝑖 − 𝑊 𝑇 𝑥 𝑖 −𝑥 𝑖 ,
𝑊 𝑡+1 = 𝑊 𝑡 - 𝛾𝑡 𝛻𝐽(𝑊 𝑡 ) = 𝑊 𝑡 + 𝜂𝑡 𝑥 𝑖 𝑦 𝑖 − 𝑊 𝑇 𝑥 𝑖

Pattern Recognition: Regression 27


Stochastic gradient descent
• In this case, each time a random training sample is processed and the
weight vector is updated. In the repetitive iterations, each sample may
be processed several times.
• Sequential learning is also appropriate for real-time applications
• data observations are arriving in a continuous stream and
predictions must be made before seeing all of the data

Pattern Recognition: Regression 28


Non Linear Regression

• How to extend the linear regression to non-linear?


• Transform the training data to a new space using appropriate basis functions
• Learn a linear regression on the new feature vectors which are obtained by
basis functions
• Rewrite the new features, based on the relations of new and input features to
obtain the input-output relation

Pattern Recognition: Regression 29


Non Linear Regression
• Example: 𝑚𝑡ℎ order polynomial regression (f: 𝑅𝑑 ⟶ 𝑅)

𝑦 = 𝑓 𝑋 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 … + 𝑤𝑚 𝑥 𝑚
we define the new variables based on the basis transform functions as :
𝑧𝑖 = ∅𝑖 𝑥 = 𝑥 𝑖 ;
Therefore :
𝑦 = 𝑓 𝑥 = 𝑤0 + 𝑤1 ∅1 (𝑥) + … + 𝑤𝑚 ∅𝑚 (𝑥)
or : 𝑦 = 𝑔 𝑧 = 𝑤0 + 𝑤1 𝑧1 + … + 𝑤𝑚 𝑧𝑚 , 𝑧 = [ 𝑧1 , 𝑧2 , ⋯ , 𝑧𝑚 ]𝑇
we can find the linear relation between y and z, and then by replacing 𝑧𝑖
with ∅𝑖 𝑥 , the non linear relation between y and x, based on the ∅𝑖 𝑥
functions is obtained.

Pattern Recognition: Regression 30


Non Linear Regression
• Example (Continue)

• The 𝑖𝑡ℎ sample (𝑥 𝑖 , 𝑦 𝑖 ) in the new transferred space is transformed to:


𝑖 𝑖
𝑧 (𝑖) = [ 𝑧1 , … , 𝑧𝑚 ]𝑇 = ∅(𝑥 𝑖 ) = [∅1 𝑥 (𝑖) , … , ∅𝑚 𝑥 (𝑖) ]𝑇 ,
The solution in the new space, based on the linear regression is equal to
𝑊 = (𝑍 𝑇 𝑍)−1 𝑍 𝑇 𝑦 , where :
(1) (1) (1)
1 𝑧1 𝑧2 ⋯ 𝑧𝑚
𝑦1 (2) (2) (2) 𝑤0
1 𝑧1 𝑧2 ⋯ 𝑧𝑚
𝑦= ⋮ , 𝑍= , 𝑊= ⋮
𝑦𝑛 ⋮ ⋮ ⋮ ⋮ ⋮ 𝑤𝑚
(𝑛) (𝑛) (𝑛)
1 𝑧1 𝑧2 ⋯ 𝑧𝑚
Pattern Recognition: Regression 31
Non Linear Regression
• If we replace 𝑧(𝑖) by its equal as ∅(𝑥 𝑖 ), The Z matrix is replaced by ∅ matrix:

1 ∅1 (𝑥 (1) ) ∅2 (𝑥 (1) ) ⋯ ∅𝑚 (𝑥 (1) )


1 ∅1 (𝑥 (2) ) ∅2 (𝑥 (2) ) ⋯ ∅𝑚 (𝑥 (2) )
∅= ,
⋮ ⋮ ⋮ ⋮ ⋮
1 ∅1 (𝑥 (𝑛) ) ∅2 (𝑥 (𝑛) ) ⋯ ∅𝑚 (𝑥 (𝑛) )
𝑇 −1 𝑇 𝑇 −1 𝑇
After finding the unknown parameters 𝑊 = (𝑍 𝑍) 𝑍 𝑦 = (∅ ∅) ∅ 𝑦, the
final solution is equal to:
𝑦 = 𝑔 𝑧 = 𝑤0 + 𝑤1 𝑧1 + … + 𝑤𝑚 𝑧𝑚
= 𝑓 𝑥 = 𝑤0 + 𝑤1 ∅1 (𝑥) + … + 𝑤𝑚 ∅𝑚 (𝑥)
= 𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 … + 𝑤𝑚 𝑥 𝑚
Pattern Recognition: Regression 32
Non Linear Regression
• In the other examples, the same principles and equations are used. The only
difference is in the used transfer functions, where they should be defined
appropriately according to the task.
• If the input sample belong to d-dimension space, and it be transferred to a
new m-dimension space:
𝑥 = [𝑥1 , 𝑥2 , ⋯ , 𝑥𝑑 ]𝑇 Then 𝑧 = [𝑧1 , 𝑧2 , ⋯ , 𝑧𝑚 ]𝑇 ,
a set of m transfer functions are needed to perform the mapping.
𝑧1 = ∅1 (𝑥)
𝑧= ∅ 𝑥 ≡ 𝑧2 = ∅2 (𝑥)

𝑧𝑚 = ∅𝑚 (𝑥)

Pattern Recognition: Regression 33


Basis Functions
In fact the weighted sum of appropriate basis functions can
approximates many prototype functions. For this purpose, the enough
number of weighted basis functions (m) should be used. The weights
and parameters of the used (m) basis functions should be tuned.

Pattern Recognition: Regression 34


Non Linear Regression: Examples

Pattern Recognition: Regression 35


Basis Function Examples
• Two frequently used transfer functions are introduced at this section
−1
• Gaussian: 𝑧𝑗 = ∅𝑗 𝑥 = 𝑒𝑥𝑝 ∥ 𝑥 − 𝑐𝑗 ∥2 ;
2𝜎2𝑗

• Gaussian basis are more appropriate for estimation of target functions


that have values in local areas
• The parameters 𝑐𝑗 controls the center of the function, and the
parameter 𝜎𝑗2 controls the subarea of the function effect
• The training data represent the target function

Pattern Recognition: Regression 36


Basis Function Examples
• Sigmoid function:

∥𝑥−𝑐𝑗 ∥ 1
∅𝑗 𝑥 = 𝜎 , where 𝜎 𝑎 =
𝜎𝑗 1+exp (−𝑎)

This function grows from zero to one, therefore it can simulate some of
the probabilistic functions such as CMD probability function. Around
zero it is similar to linear functions, and then at the far points from
center (zero) it is strongly non linear with low gradeint.

Pattern Recognition: Regression 37

You might also like