Regression PDF

Regression
- Linear
- Non Linear
Pattern Recognition: Regression 1

Introduction
- The goal is to estimate (real valued) predictions from the given features
- A prediction model is tuned based on the {(𝑥 𝑖 , 𝑦 𝑖 )}𝑛𝑖=1 pairs of Training Data
- Example: predicting house price from 3 attributes
Size (𝑚2) Age (year) Region Price (106T)

100 2 5 500
80 25 3 250
… … … …

Regression
• From our point of view, regression is a learning challenge, therefore to
solve this problem it needs to:
• Training Data : a set of given type data here (𝑥, 𝑦) pairs

• hypothesis space : a set of mappings functions from feature vector to target
• Learning Algorithm : optimization of cost Function
• Evaluation Criteria : it estimates how well learned model generalizes to
unseen examples

Regression
• We begin by the class of linear functions :
• In many cases it can give good results

• It is often easy to extend the linear to generalized linear models and so cover
more complex regression functions

Linear regression: hypothesis space
• 𝑓: 𝑹𝑑 → 𝑹,
𝑦 = 𝑤0 +𝑤1 𝑥1 +𝑤2 𝑥2 + … +𝑤𝑑 𝑥𝑑 , where:
𝑊 = [𝑤0 , 𝑤1 , … , 𝑤𝑑 ]𝑇 is a vector which needs to be set.

Linear Regression
• The learning algorithm needs to define the error or cost function, where it should
be minimized.
• The cost function is defined by the sum of square of differences between the real
and the predicted values over the training data (SSD criterion).
• If the predicted model for 𝑥 𝑖 ( the 𝑖𝑡ℎ training data) be equal to 𝑓(𝑥 𝑖 , 𝑤), then:
Cost Function = 𝑛𝑖=1(𝑦 𝑖 − 𝑓(𝑥 𝑖 , 𝑤))2 .

Linear regression: uni and multi variate example
• Uni Variate: The input data includes just one feature
Cost function = 𝑛𝑖=1(𝑦 𝑖 − (𝑤0 +𝑤1 𝑥1 ))2
• Multi Variate: The input data is a d-dimensional vector

Cost function = 𝑛𝑖=1(𝑦 𝑖 − (𝑤0 +𝑤1 𝑥1 +𝑤2 𝑥2 + … +𝑤𝑑 𝑥𝑑 ))2

Linear regression: uni and multi variate example
• In the linear case the defined cost function is a quadratic function of
the model parameters and its shape in a (d+1)-dimension space is
convex. Therefore it has a global optimum.
Price
size of house

Cost function: univariate example
Linear regression 𝐽(𝑤0 ,𝑤1 ) ∶ 𝑖𝑠 𝑎 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤0 ,𝑤1

Cost function optimization: univariate
• J(W) = 𝑛𝑖=1(𝑦 𝑖 − (𝑤0 +𝑤1 𝑥1 ))2
• Necessary conditions for the “optimal” parameter values:
𝜕𝐽
= 𝑛𝑖=1 2 𝑦 𝑖 − 𝑤0 +𝑤1 𝑥1 −1 = 0
𝜕𝑤0
𝜕𝐽
= 𝑛𝑖=1 2 𝑦 𝑖 − 𝑤0 +𝑤1 𝑥1 −𝑥1 = 0
𝜕𝑤1
• A System of two unknown parameters with two linear equations

Cost function optimization: multi-variate
𝑛
• J(W) = 𝑖=1 (𝑦 𝑖 − (𝑤0 +𝑤1 𝑥1𝑖 +𝑤2 𝑥2𝑖 + … +𝑤𝑑 𝑥𝑑𝑖 ))2
𝑛 𝑖
= 𝑖=1 (𝑦 − 𝑊 𝑇 𝑥 𝑖 )2
• Cost function in matrix form presentation:

Minimizing Cost Function
• Optimal weight vector:
𝑊 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 𝐽(𝑊)
𝑤
𝐽 𝑊 = ∥ y - XW∥2
𝛻𝑤 𝐽 𝑊 = −2𝑋 𝑇 (y – XW)
𝛻𝑤 𝐽 𝑊 = 0 ⟹ (𝑋 𝑇 X)W = 𝑋 𝑇 y
𝑊 = (𝑋 𝑇 X)−1 𝑋 𝑇 y

Minimizing Cost Function
𝑊 = (𝑋 𝑇 X)−1 𝑋 𝑇 y
𝑋 † = (𝑋 𝑇 X)−1 𝑋 𝑇
W = 𝑋†y
𝑋 † is pseudo inverse of X

Another approach for optimizing the sum
squared error: Gradient decent
• iterative approach for solving the following optimization problem:
𝑛 𝑖
𝐽(W) = 𝑖=1(𝑦 − 𝑊 𝑇 𝑥 𝑖 )2
𝑊 𝑡+1 = 𝑊 𝑡 - 𝛾𝑡 𝛻𝐽(𝑊 𝑡 ) ⇒ 𝑊 𝑡+1 = 𝑊 𝑡 + 𝜂 𝑛

𝑖=1(𝑦 𝑖 − 𝑊 𝑇 𝑥 𝑖 ) (−𝑥 𝑖 )
- Steps:
0
• Start from 𝑊
• Repeat
• Update 𝑊𝑡 to 𝑊𝑡+1 in order to reduce 𝐽

• 𝑡 ← 𝑡 +1
• until we hopefully end up at a minimum

Review: Gradient descent disadvantages
• Local minima problem
• However, when 𝐽 is convex, local minima is also global minima ⇒

gradient descent can converge to the global solution.
• 𝜂 too small → gradient descent can be slow.
• 𝜂 too large → gradient descent can overshoot the minimum. It may
fail to converge, or even diverge to other local minima points.

Review: Problem of gradient descent with
non-convex cost functions

Review: Problem of gradient descent with
non-convex cost functions

𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2
𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

𝑛 𝑖
𝐽(𝜽) = 𝑖=1(𝑦 − 𝜽𝑇 𝑥 𝑖 )2

Stochastic gradient descent
• Batch techniques process the entire training set in one go
• thus they can be computationally costly for large data sets.
𝑛 𝑖
J(W) = 𝑖=1(𝑦 − 𝑊 𝑇 𝑥 𝑖 )2
However in the stochastic gradient version, instead of the summation of all

dataset points, some or even one sample can be processed in a step of iterative
training procedure:
J(W) = (𝑦 𝑖 − 𝑊 𝑇 𝑥 𝑖 )2 , 𝛻𝐽𝑤 𝑊 = 2 𝑦 𝑖 − 𝑊 𝑇 𝑥 𝑖 −𝑥 𝑖 ,
𝑊 𝑡+1 = 𝑊 𝑡 - 𝛾𝑡 𝛻𝐽(𝑊 𝑡 ) = 𝑊 𝑡 + 𝜂𝑡 𝑥 𝑖 𝑦 𝑖 − 𝑊 𝑇 𝑥 𝑖

Stochastic gradient descent
• In this case, each time a random training sample is processed and the
weight vector is updated. In the repetitive iterations, each sample may
be processed several times.
• Sequential learning is also appropriate for real-time applications
• data observations are arriving in a continuous stream and
predictions must be made before seeing all of the data

Non Linear Regression
• How to extend the linear regression to non-linear?

• Transform the training data to a new space using appropriate basis functions
• Learn a linear regression on the new feature vectors which are obtained by
basis functions
• Rewrite the new features, based on the relations of new and input features to
obtain the input-output relation

• Example: 𝑚𝑡ℎ order polynomial regression (f: 𝑅𝑑 ⟶ 𝑅)
𝑦 = 𝑓 𝑋 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 … + 𝑤𝑚 𝑥 𝑚
we define the new variables based on the basis transform functions as :
𝑧𝑖 = ∅𝑖 𝑥 = 𝑥 𝑖 ;
Therefore :
𝑦 = 𝑓 𝑥 = 𝑤0 + 𝑤1 ∅1 (𝑥) + … + 𝑤𝑚 ∅𝑚 (𝑥)
or : 𝑦 = 𝑔 𝑧 = 𝑤0 + 𝑤1 𝑧1 + … + 𝑤𝑚 𝑧𝑚 , 𝑧 = [ 𝑧1 , 𝑧2 , ⋯ , 𝑧𝑚 ]𝑇
we can find the linear relation between y and z, and then by replacing 𝑧𝑖
with ∅𝑖 𝑥 , the non linear relation between y and x, based on the ∅𝑖 𝑥
functions is obtained.

• Example (Continue)
• The 𝑖𝑡ℎ sample (𝑥 𝑖 , 𝑦 𝑖 ) in the new transferred space is transformed to:

𝑖 𝑖
𝑧 (𝑖) = [ 𝑧1 , … , 𝑧𝑚 ]𝑇 = ∅(𝑥 𝑖 ) = [∅1 𝑥 (𝑖) , … , ∅𝑚 𝑥 (𝑖) ]𝑇 ,
The solution in the new space, based on the linear regression is equal to
𝑊 = (𝑍 𝑇 𝑍)−1 𝑍 𝑇 𝑦 , where :
(1) (1) (1)
1 𝑧1 𝑧2 ⋯ 𝑧𝑚
𝑦1 (2) (2) (2) 𝑤0
1 𝑧1 𝑧2 ⋯ 𝑧𝑚
𝑦= ⋮ , 𝑍= , 𝑊= ⋮
𝑦𝑛 ⋮ ⋮ ⋮ ⋮ ⋮ 𝑤𝑚
(𝑛) (𝑛) (𝑛)
1 𝑧1 𝑧2 ⋯ 𝑧𝑚
• If we replace 𝑧(𝑖) by its equal as ∅(𝑥 𝑖 ), The Z matrix is replaced by ∅ matrix:
1 ∅1 (𝑥 (1) ) ∅2 (𝑥 (1) ) ⋯ ∅𝑚 (𝑥 (1) )

1 ∅1 (𝑥 (2) ) ∅2 (𝑥 (2) ) ⋯ ∅𝑚 (𝑥 (2) )
∅= ,
⋮ ⋮ ⋮ ⋮ ⋮
1 ∅1 (𝑥 (𝑛) ) ∅2 (𝑥 (𝑛) ) ⋯ ∅𝑚 (𝑥 (𝑛) )
𝑇 −1 𝑇 𝑇 −1 𝑇
After finding the unknown parameters 𝑊 = (𝑍 𝑍) 𝑍 𝑦 = (∅ ∅) ∅ 𝑦, the
final solution is equal to:
𝑦 = 𝑔 𝑧 = 𝑤0 + 𝑤1 𝑧1 + … + 𝑤𝑚 𝑧𝑚
= 𝑓 𝑥 = 𝑤0 + 𝑤1 ∅1 (𝑥) + … + 𝑤𝑚 ∅𝑚 (𝑥)
= 𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 … + 𝑤𝑚 𝑥 𝑚
• In the other examples, the same principles and equations are used. The only
difference is in the used transfer functions, where they should be defined
appropriately according to the task.
• If the input sample belong to d-dimension space, and it be transferred to a
new m-dimension space:
𝑥 = [𝑥1 , 𝑥2 , ⋯ , 𝑥𝑑 ]𝑇 Then 𝑧 = [𝑧1 , 𝑧2 , ⋯ , 𝑧𝑚 ]𝑇 ,
a set of m transfer functions are needed to perform the mapping.
𝑧1 = ∅1 (𝑥)
𝑧= ∅ 𝑥 ≡ 𝑧2 = ∅2 (𝑥)
⋯
𝑧𝑚 = ∅𝑚 (𝑥)

Basis Functions
In fact the weighted sum of appropriate basis functions can
approximates many prototype functions. For this purpose, the enough
number of weighted basis functions (m) should be used. The weights
and parameters of the used (m) basis functions should be tuned.

Non Linear Regression: Examples

Basis Function Examples
• Two frequently used transfer functions are introduced at this section
−1
• Gaussian: 𝑧𝑗 = ∅𝑗 𝑥 = 𝑒𝑥𝑝 ∥ 𝑥 − 𝑐𝑗 ∥2 ;
2𝜎2𝑗
• Gaussian basis are more appropriate for estimation of target functions

that have values in local areas
• The parameters 𝑐𝑗 controls the center of the function, and the
parameter 𝜎𝑗2 controls the subarea of the function effect
• The training data represent the target function

Basis Function Examples
• Sigmoid function:
∥𝑥−𝑐𝑗 ∥ 1
∅𝑗 𝑥 = 𝜎 , where 𝜎 𝑎 =
𝜎𝑗 1+exp (−𝑎)
This function grows from zero to one, therefore it can simulate some of
the probabilistic functions such as CMD probability function. Around
zero it is similar to linear functions, and then at the far points from
center (zero) it is strongly non linear with low gradeint.

Regression PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression PDF

Uploaded by

Copyright:

Available Formats

Regression

Pattern Recognition: Regression 1

Size (𝑚2) Age (year) Region Price (106T)

Pattern Recognition: Regression 2

• Training Data : a set of given type data here (𝑥, 𝑦) pairs

Pattern Recognition: Regression 3

• In many cases it can give good results

Pattern Recognition: Regression 4

Pattern Recognition: Regression 5

Pattern Recognition: Regression 6

• Multi Variate: The input data is a d-dimensional vector

Pattern Recognition: Regression 7

Pattern Recognition: Regression 8

Pattern Recognition: Regression 9

• A System of two unknown parameters with two linear equations

Pattern Recognition: Regression 10

Pattern Recognition: Regression 11

Pattern Recognition: Regression 12

Pattern Recognition: Regression 13

𝑊 𝑡+1 = 𝑊 𝑡 - 𝛾𝑡 𝛻𝐽(𝑊 𝑡 ) ⇒ 𝑊 𝑡+1 = 𝑊 𝑡 + 𝜂 𝑛

• Update 𝑊𝑡 to 𝑊𝑡+1 in order to reduce 𝐽

Pattern Recognition: Regression 14

• However, when 𝐽 is convex, local minima is also global minima ⇒

Pattern Recognition: Regression 15

Pattern Recognition: Regression 16

Pattern Recognition: Regression 17

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 18

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 19

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 20

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 21

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 22

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 23

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 24

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 25

𝑓 𝑥; 𝜽 = 𝜃0 + 𝜃1 x (function of two parameters (𝜃0 , 𝜃1 ))

Pattern Recognition: Regression 26

However in the stochastic gradient version, instead of the summation of all

Pattern Recognition: Regression 27

Pattern Recognition: Regression 28

• How to extend the linear regression to non-linear?

Pattern Recognition: Regression 29

Pattern Recognition: Regression 30

• The 𝑖𝑡ℎ sample (𝑥 𝑖 , 𝑦 𝑖 ) in the new transferred space is transformed to:

1 ∅1 (𝑥 (1) ) ∅2 (𝑥 (1) ) ⋯ ∅𝑚 (𝑥 (1) )

Pattern Recognition: Regression 33

Pattern Recognition: Regression 34

Pattern Recognition: Regression 35

• Gaussian basis are more appropriate for estimation of target functions

Pattern Recognition: Regression 36

Pattern Recognition: Regression 37

You might also like