03 Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Random Signal Processing

Random Signal Processing: Regression Models

May 1, 2022

Outline
.- Definition
.- Linear Regression
.- Multilinear regression
.- Regularization

1
Random Signal Processing

1 Building relationships between interacting data

input ξ Black box dependency output η


- -

𝜂 = 𝑓 (𝜉)

2 Dependency 𝑓 (·) : Relationship (function) between two sets 𝑋 ∈ 𝜉 , 𝑌 ∈ 𝜂 , that interact with
3 each other.
4 𝑓 (·) - real-valued black-box function ruled by uncertainty.
5 Input. Independent variables → predictors, 𝑋 ∈ ℝ𝑠× 𝑀
6 Output. Dependent variables → responses, 𝑌 ∈ ℝ𝑟× 𝑀
7 Function of dependency with statistical meaning.
𝑓 : 𝑋 ∈ ℝ𝑠× 𝑀 ↦→ 𝑌 ∈ ℝ𝑟× 𝑀

2
Random Signal Processing

8 Assumptions upon the regression models

 𝑥11 . . . 𝑥1 𝑀   𝑦11 . . . 𝑦1 𝑀 
   
9 𝑋 =  . . . . . . . . .  , 𝑌 =  . . . . . . . . . 
𝑥 𝑠𝑀 . . . 𝑥 𝑠𝑀   𝑦𝑟𝑀 . . . 𝑦𝑟𝑀 
10    
11 𝑠 = 1, single model, 𝑠 ≥ 2 multiple
12 .- Independent and Identically Distributed (i.i.d.) observations
13 .- A constant variance value of residuals (Homoscedasticity).

14 Issues to be solved for building an statistical dependency


15 – What class of uncertainty measures will be used to evaluate the relationship (if any)?
16 – Is there a relationship associating the whole set of involved variables, or just part of
17 them is contributing? How to assess the contribution of variables to the model?
18 – How to evaluate the effectiveness of the relationship model? Does the build model supply
19 an explainable association in determining the state of an object under consideration?

3
Random Signal Processing

20 Experimental [Statistical] Dependency: Formulation


21 Let 𝑑 ( ỹ , y ) be a metric in ℝ𝑞, so that there is a function 𝑓 : ℝ 𝑝 ↦→ ℝ𝑞, for which the conditional
22 mean reaches its minimum value, resulting in the minimizing solution, termed regression of
23 𝑦 on 𝑥 , defined as follows:

24 ỹ = 𝔼 {y = 𝑌 | x = 𝑋 } (1)
25
26
s.t: 𝑑 ( ỹ , y ) − 𝜖 = 0, 𝜖 ∈ ℝ+
27 where 𝑥 ∈ ℝ 𝑝 and 𝑦 ∈ ℝ𝑞 are the random vectors with the corresponding observation sets
28 { 𝑥𝑛 }⊂ 𝑋 and { 𝑦𝑛 }⊂𝑌 , ỹ = 𝑓 ( x = 𝑋 ) assessment of the dependence.
29 The assumed relationship results in the regression model:

30 y = 𝑓 (x = 𝑋 ) + 𝜀 (x = 𝑋 ) (2)
31
32
s.t.: 𝔼 {𝜀𝜀 | x = 𝑋 } = 0

33 Function 𝑓 is hardly known. Two approaches to include a pairwise relationship in the model Eq. (2):
34 – An approximating function is assumed, relying on certain empirical evidence,
35 – 𝑓 (·) is learned from an observation set using data-driven approaches.

4
Random Signal Processing

36 Linear Regression Model


37 A linear function is assumed:
38 .-Better generalization ability
39 .-Models highly interpretable
40 Measured noisy values:

41 y = 𝑓 ( x; 𝜃 ) + 𝜎𝜀𝜀 , 𝜎 ∈ ℝ+, 𝜃 = [𝜃 𝑘 : 𝑘 ∈ 𝐾 ]
42
43
= 𝜃 1x + 𝜃 0 + 𝜎𝜀 , 𝜃 = [𝜃 0, 𝜃 1] , 𝐾 = 2
44 y - dependent variable .9

45 x - independent variable
46 𝜃 1x + 𝜃 0 - linear componet: 𝜃 1 slope, 𝜃 0 - intercept
47 𝜎𝜀𝜀 - random component (error)
48 Visual inspection for Linearity and Homoscedasticity. notebook: 03aVisualCheck

5
Random Signal Processing

Parametric Estimation of 𝜃 - Moments method

49 dependence without uncertainity


𝑦 [ 𝑡 ] − 𝑦 [ 𝑡 ′]
50 𝜃1 = , 𝜃 0 = 𝑦 [ 𝑡 ′] − 𝜃 1 𝑥 [ 𝑡 ′]
𝑥 [ 𝑡] − 𝑥 [ 𝑡 ]

51 Estimation by statistical descriptors of pairwise observations
52
53
𝜃˜ 𝑘 = 𝔼 {𝜃 𝑘 ( 𝑡 − 𝑡′)} → a first assumption notebook: 03bMomentsEst

6
Random Signal Processing

54 Gaussian Regression Models

55 y = 𝑓 ( x = 𝑋 ) + 𝜀 ( x = 𝑋 ) , s.t.: 𝔼 {𝜀𝜀 | x = 𝑋 } = 0
56 Assume: 𝔼 {y = 𝑌 | x = 𝑋 } ∼ 𝑝 ( x = 𝑋 ; θ )
57

58 𝑝 (𝑌 ) ⇒ 𝑝 ( 𝑋 ; θ̃ ) + 𝑝 (𝜀𝜀 ( 𝑋 ))
59 s.t.: θ̃ = min{𝜀𝜀 ( 𝑋 )}
𝜽
60
61
Assume: 𝑝 (𝑌 ) = N 𝑦 { 𝑚1 𝑦 , 𝜎𝑦 } , 𝑝 ( 𝑋 ) = N𝑥 { 𝑚1𝑥 , 𝜎𝑥 }
62 Then, it can be shown that:
1 1
63 𝑝 (𝑌 , 𝑋 ) = exp{- (𝜎𝑥2 𝑦¯ 2-2𝜎𝑦𝑥
2
𝑦¯ 𝑥¯+𝜎𝑦2 𝑥¯ 2)}
2 (𝜎𝑦2 𝜎𝑥2-𝜎𝑦𝑥
2
√︃
2𝜋 (𝜎𝑦2 𝜎𝑥2-𝜎𝑦𝑥
2
) )
1 1 2 2
exp{− 𝑥¯ − 2𝜌 𝑥¯ 𝑦¯ + 𝑦¯ },

64 =
2 ( 1 − 𝜌 2)
√︁
2𝜋 1 − 𝜌2
√︃
65 where 𝜉¯ = 𝜉 − 𝑚1𝜉 , 𝜌= 2
𝜎𝑦𝑥 / 𝜎𝑦2 𝜎𝑥2, 𝜌 ∈ ℝ[−1, 1]
66
67

7
Random Signal Processing

68 Conditional Gaussian pdf


69 Since 𝑝 (𝑌 | 𝑋 ) = 𝑝 (𝑌 , 𝑋 )/ 𝑝 ( 𝑋 ) , then, it can be proved that the conditional density functions
70 defined over jointly Gaussian variables is also Gaussian:

𝜎𝑋2
  
𝜌𝜎𝑋 𝜎𝑌
71 𝑝 (𝑌 | 𝑋 ) = N𝑌 | 𝑋 ( 𝑚1 𝑋 , 𝑚1𝑌 ) ,
𝜌𝜎𝑋 𝜎𝑌 𝜎𝑌2

72 Furthermore, if 𝑋 and 𝑌 (with the respective parameters 𝑚1 𝑋 , 𝜎𝑋2 , 𝑚1𝑌 , 𝜎𝑌2 ) are jointly Gaus-
73 sian random variables, then it can be proved that:
𝑥 − 𝑚1 𝑋
74 𝔼 {𝑌 | 𝑋 = 𝑥 } = 𝑚1𝑌 + 𝜌𝜎𝑌 ,
𝜎𝑋
75
76
var (𝑌 | 𝑋 = 𝑥) = ( 1 − 𝜌 2)𝜎𝑌2 .

77 • If 𝑋 and 𝑌 are Gaussian variables, 𝔼 {𝑌 | 𝑋 = 𝑥 } (regression 𝑌 on 𝑋 !) is linearly depen-


78 dent on value 𝑥 ∈ 𝑋 . Therefore, dependency 𝑓 ( .) is linear.
79 • Assessment of the regression between 𝑋 and 𝑌 becomes 𝜌
80 • The conditional variance of 𝑌 does not depend on 𝑋 .

8
Random Signal Processing

derivation of 𝔼 {𝑌 | 𝑋 }
( 𝑥−𝜇 𝑋 ) 2 ( 𝑦−𝜇𝑌 ) 2
 h i
2𝜌( 𝑥−𝜇 𝑋 ) ( 𝑦−𝜇𝑌 )
∫∞ 1
−∞
𝑥 exp − 2 ( 1−𝜌 2) 𝜎𝑋2
+ 𝜎𝑌2
− 𝜎𝑌 𝜎𝑋 d𝑥
𝔼[ 𝑋 |𝑌 = 𝑦] = ∫ ∞
( 𝑥−𝜇 𝑋 ) 2 ( 𝑦−𝜇𝑌 ) 2
 h i
1 2𝜌( 𝑥−𝜇 𝑋 ) ( 𝑦−𝜇𝑌 )
−∞
exp − 2 ( 1−𝜌 2) 𝜎𝑋2
+ 𝜎𝑌2
− 𝜎𝑌 𝜎𝑋 d𝑥
( 𝑥−𝜇 𝑋 ) 2 𝜌 2 ( 𝑦−𝜇𝑌 ) 2
 h i
2𝜌( 𝑥−𝜇 𝑋 ) ( 𝑦−𝜇𝑌 )
∫∞ 1
−∞
𝑥 exp − 2 (1−𝜌 2)
𝜎𝑋2
− 𝜎𝑋 𝜎𝑌 + 𝜎𝑌2
d𝑥
= ∫∞
( 𝑥−𝜇 𝑋 ) 2 𝜌 2 ( 𝑦−𝜇𝑌 ) 2
 h i
1 2𝜌( 𝑥−𝜇 𝑋 )( 𝑦−𝜇𝑌 )
−∞
exp − 2 ( 1−𝜌 2) 𝜎𝑋2
− 𝜎𝑋 𝜎𝑌 + 𝜎𝑌2
d𝑥
 h i 2
𝑥−𝜇 𝑋 𝑦−𝜇𝑌
∫∞ 1
−∞
𝑥 exp − 2 ( 1−𝜌 2 ) 𝜎𝑋 −𝜌 𝜎𝑌 d𝑥
=  h i 2
𝑥−𝜇 𝑋 𝑦−𝜇𝑌
∫∞ 1
−∞
exp − 2 ( 1−𝜌 2 ) 𝜎𝑋 −𝜌 𝜎𝑌 d𝑥
∫∞  𝑦−𝜇𝑌
 
1 𝑥2

𝑥 + 𝜇 𝑋 + 𝜎𝑋 𝜌 exp − 2 ( 1−𝜌 2) 𝜎2 d𝑥
−∞ 𝜎𝑌
𝑋 𝑦 − 𝜇𝑌
= ∫∞ 
2
 = 𝜇 𝑋 + 𝜎𝑋 𝜌
1 𝑥 𝜎𝑌
−∞
exp − 2 ( 1−𝜌 2) 2
𝜎
d𝑥
𝑋
 
𝑥2
∫∞ 1
81 since it holds that −∞
𝑥 exp − 2 ( 1−𝜌 2) 𝜎2 d𝑥 = 0
𝑋

9
Random Signal Processing

82 MLE Estimation of 𝜃

𝑝 ( 𝑥) = N𝑥 (𝜇 𝑥 , 𝜎𝑥2) , then 𝑝 ( 𝑦) = 𝑝 ( 𝑥) | 𝑑𝑦/ 𝑑𝑥 |= N𝑥 ( , )/| 𝜃 1 |


where 𝑚1 𝑦 = 𝜃 1 𝑚1𝑥 + 𝜃 0, 𝜎𝑦 =| 𝜃 1 | 𝜎𝑥
⇒ 𝑝 ( 𝑦) = N 𝑦 (𝜇 𝑦 , 𝜎𝑦2) = N 𝑦 (𝜃𝜃 ⊤x, 𝜎𝑦2)

𝜃ˆ 𝑀 𝐿𝐸 = arg max N 𝑦 ( ŷ | 𝜃 ⊤x, 𝜎𝑦2)


𝜃
∼ arg max ln N 𝑦 ( ŷ | 𝜃 ⊤x, 𝜎𝑦2)
𝜃
𝜃ˆ 𝑀 𝐿𝐸 = arg max 12 (𝜃𝜃 ⊤x − ŷ ) 2
𝜃
= min || 𝜃 ⊤x − ŷ || 22,
1
2
𝜃

83 Hence, the MLE parameters match that of the OLS estimation.


What if the random variables are not Gaussian?
∑︁
𝑓 (𝛩
𝛩, 𝑋 ) = θ𝑖 𝑋𝑖 , 𝑋0 = 1
𝑖=0,..

10
Random Signal Processing

84 Nonparametric Estimation of 𝜃
85 Ordinary Least Squares (OLS) Estimator: Assume a linear statistical model:
86 y = X 𝜃 + 𝜀 , : s.t.: 𝔼 {𝜀𝜀 | x = 𝑋 } → 0
𝜃˜ = min 𝜀 = 𝔼 ∥ y − X 𝜃 ∥ 2 , ℓ2 − norm

87
88 𝜃

89 Solving for 𝜀˜ 2 yields:


90 𝜀 ⊤𝜀 = ( y − X 𝜃ˆ ) ⊤ ( y − X 𝜃ˆ )
91 = y ⊤y − 𝜃ˆ X ⊤ 𝑦 − 𝑦⊤X 𝜃ˆ + 𝜃ˆ X ⊤X 𝜃ˆ
92
93
= y ⊤y − 2𝜃ˆ X ⊤ 𝑦 + 𝜃ˆ X ⊤X 𝜃ˆ
94 Taking the partial derivative of 𝜀˜ 2 and setting it equal to 0:
𝜕𝜀 ⊤𝜀
95 = −2X ⊤y + 2X ⊤X 𝜃ˆ = 0
𝜕𝜃ˆ
96 = −X ⊤y + X ⊤X 𝜃ˆ = 0
97
98
X ⊤X 𝜃ˆ = X ⊤y

99 OLS estimator is unbiased and has minimum variance

11
Random Signal Processing

100

101 Residual Sum of Squares


2
𝑦 𝑥 ) − ( 𝑥)( 𝑥𝑦)
Í Í Í Í
( )(
102 𝜃˜ 1 = Í 2 Í 2 , 𝑀 − sample size
𝑀 ( 𝑥 ) − ( 𝑥 )
𝑀 ( 𝑥𝑦) − ( 𝑥)( 𝑦)
Í Í Í
103 𝜃˜ 0 =
𝑀 ( 𝑥2 ) − ( 𝑥2 )
Í Í
104

105 Issue: Least Squared algoritms are sensible to outliers. Robust approaches of estimation can
106 be used. notebook: 03gRobustLSE

12
Random Signal Processing

107 Multiple Linear Regression

 𝑥11 . . . 𝑥1 𝑀 
 
𝑋 𝜃 = y : 𝑋 =  . . . . . . . . .  , y = y1 . . . 𝑦 𝑀 , θ = 𝜃 1 . . . 𝜃 𝑀 ,
   
108

𝑥 𝑀 ′ 𝑀 . . . 𝑥 𝑀 ′ 𝑀 
109  
110 Assumptions: Errors 𝜖 𝑚 have nothing to do with one another, x𝑚 rows are measured without
111 error, all the error occurs in the vertical direction.
112 𝑋 ⊤ 𝑋 is singular if 𝑀 ′ = 𝑀 . Otherwise, 𝑋 ⊤ 𝑋 becomes non-singular if 𝑀 ′ ≠ 𝑀 ?.
113 First case, let’s assume 𝑠< 𝑀 .

114 𝑋𝜃 = y
115 Pre-multiple by 𝑋 ⊤ 𝑋 : ( 𝑋 ⊤ 𝑋 ) −1 ( 𝑋 ⊤ 𝑋 )𝜃ˆ = ( 𝑋 ⊤ 𝑋 ) −1 𝑋 ⊤y
116 By SVD decomposition: 𝑋 = 𝑈 𝛴𝑉 ⊤, then, → 𝑈 𝛴𝑉 ⊤𝜃 = y
117 𝑉 𝛴 −1𝑈 ⊤𝑈 𝛴𝑉 ⊤𝜃ˆ = 𝑈 𝛴𝑉 ⊤y ,
118 𝜃ˆ = 𝑈 𝛴𝑉 ⊤y
119
120
𝜃ˆ ∼ 𝑋 †y , Moore Penrose Pseudoinverse
121 notebook: 03ePenrose notebook: 03fMultiReg

13
Random Signal Processing

122 Shrinked Ordinary Least Squares


123 – Ill-conditioned 𝑋 : Because the LS estimates depend upon ( 𝑋 ⊤ 𝑋 ) −1, there are problems
124 in computing 𝜃 if ( 𝑋 ⊤ 𝑋 ) is singular or nearly singular. In those cases, small changes to
125 the elements of 𝑥 can lead to large variations in ( 𝑋 ⊤ 𝑋 ) −1.
126 – too many predictors: It is not unusual to see the number of input variables greatly exceed
127 the number of observations, and fitting the full model without input selection (penaliza-
128 tion) will result in large prediction intervals, so the LS regression estimator may not
129 uniquely exist.
130 Assume 𝑋 ∈ ℝ𝑁 × 𝑃 and y ∈ ℝ𝑁 be centered arrangements, so that there is no need for a
131 constant term in the regression. Then,

𝜃 = ( 𝑋 𝑋 ) 𝑋 y ∼ ( 𝑋 𝑋 +𝜆 𝐼 𝑃 ) 𝑋 y ,
⊤ ⊤ ⊤ ⊤
132
ˆ −1 −1
𝜆 ∈ ℝ+
∑︁ ∑︁ ∑︁
2
133 min ( 𝑦𝑝 − 𝑥𝑛𝑝 𝜃 𝑝) + 𝜆 𝜃 2𝑝
∀𝜃 𝑝
𝑛∈ 𝑁 𝑝∈ 𝑃 𝑝∈ 𝑃
∑︁ ∑︁ ∑︁
2
134 min ( 𝑦𝑝 − 𝑥𝑛𝑝 𝜃 𝑝) , s.t.: 𝜃 2𝑝 < 𝜀 ∈ ℝ+
∀𝜃 𝑝
135
𝑛∈ 𝑁 𝑝∈ 𝑃 𝑝∈ 𝑃

14
Random Signal Processing

136 Ridge Regression


137 Therefore, the ridge [shrinking] coefficients minimize a penalized residual sum of squares.
138 The regularization [complexity] parameter 𝜆 controls the amount of shrinkage. The larger
139 the value of 𝜆, the greater the amount of shrinkage, and thus the coefficients become more
140 robust to collinearity.
141 How to find an optimum value for 𝜆?. A cross-Validation strategy is used. 𝜆 𝑜𝑝𝑡 is the one that
142 yields the smallest cross-validation prediction error.
143 Properties of Ridge Estimator:
144 – θ̂𝑟𝑖𝑑𝑔𝑒 is a biased estimator of θ, but it has a low variance
145 – It is better than the Least Squares method when there are too many 𝑡ℎ𝑒𝑡𝑎 parameters.
146 – It offers a solution against multidimensionality when having the number of variables
147 greater than the number of observations
148 – Issue - there is a high correlation between the independent variables.
149 notebook: 03gRidgenotebook: 03hRegularizationFull

15
Random Signal Processing

150 Lab: Multiple Linear Regression Models


151 1. Compute heat matrix after transforming the input data by the following rules:
152 a) 𝑥˜ = ln ( 𝑥) . b) 𝑧 = ( 𝑥 − mean ( 𝑥))/var ( 𝑥) . c)
153 2. For either prediction matrix 𝑋 , 𝑋˜ , calculate the variations in parameter set after per-
154 turbing the target y by N𝜂 ( 0, 𝜎) , with 𝜎 = [ 0, 1, Δ𝜎 = 0.1, 1] .
155 3. Compare different model estimators in term of the prediction error, 𝑟𝑚𝑠𝑒, achieved fixing
156 𝜎 = [ 0.1, Δ𝜎 = 0.1, 1] notebook: 03fMultiReg.
157 4. compute 1), 2). Show variations in the reglarization parameter 𝜆 for Ridge, Lasso, and
158 elasctic net algorithms.
159 5. Compare the performance of the multiple regression estimators.

16
Random Signal Processing

160 Evaluation of linear regression models


161 Estimation of the loss between predicted and actual:

𝐷 [ 𝑦 | 𝑥] 𝜎2 ˆ2
𝜎 𝑆𝑆𝑟𝑒𝑠 /𝑛
162 Coefficient of determination (R2) = 1 − =1− 2 =1− 2 =1−
𝐷 [ 𝑦] 𝜎𝑦 ˆ𝑦
𝜎 𝑆𝑆 𝑡𝑜𝑡 /𝑛
Í𝑛 2
𝑖 =1 𝑒 𝑖
163 = 1 − Í𝑛 2
𝑖=1 ( 𝑦𝑖 − 𝑦)
Í𝑛
𝑖=1 ( 𝑦𝑖 − 𝑦ˆ 𝑖) 2
164 =1− 2
, 𝑦ˆ 𝑖 = 𝑚𝑒𝑎𝑛 ( 𝑦)
𝑛𝜎 ˆ𝑦
165 Mean Squared Error (MSE)= 𝔼 {∥ y − ỹ ∥ 2 }
166
167
Mean Absolute Error (MAE)= 𝔼 {∥ y − ỹ ∥ 1 }

168 R-square value is the percentage of variation in the dependent variable explained by the
169 independent predictor. The higher the R-square value better the model.
170 Disadvantages. i) MAE and MSE Measures can be highly affected by Outliers. ii) The value of
171 R2 increases if we add more variables to the model, irrespective of the variable contributing
172 to the model or not. This issue is a disadvantage of using R2. notebook: 3iEvalRegression

17
Random Signal Processing

173 Nonlinear Regression


174 • Linear approximation:
175 Issue. the learnt model will not be able to handle the non-linear relationship between
176 data and target since linear models assume the relationship between data and target to
177 be linear.
178 • Intrinsically linear models: using a proper transformation they can be transformed into
179 linear regression models, by instance, using ln ( 𝑥) mapping:
180 𝑦 = 𝑏 exp ( 𝑎𝑥) 𝑢 ( 𝑥) : ln 𝑦 = ln 𝑏 + 𝑎𝑥
181 ∼ 𝑦′ = 𝑎𝑥 + 𝑏′; 𝑥 > 0
∑︁ ∑︁ ∑︁ ∑︁
𝑛 𝑛
182 𝑦= 𝑎𝑛 𝑥 : ln ( 𝑎𝑛 𝑥 ) = ln ( 𝑎𝑛) + ln ( 𝑥) 𝑛,
∀𝑛 ∑︁∀𝑛 ∀𝑛∈ 𝑁∑︁ ∀𝑛∈ 𝑁
183 ∼ 𝑎𝑛 + antilog ( 𝑛) ln ( 𝑥) , 𝑁 ≤ 3
∀𝑛∈ 𝑁 ∑︁ ∀𝑛∈ 𝑁
184 ∼ 148 ln ( 𝑥) + 𝑎𝑛 + abs ( min ( 𝑦)) , 𝑁 = 3,
185 ∀𝑛∈ 𝑁
186 Issue. Fixing properly the scales of projection is challeging. How to determine whether
187 a model meets the assumptions impossed?

18
Random Signal Processing

188 • Piecewise linearization. A model that can natively deal with non-linearity. Decision-
189 TreeRegressor, KBinsDiscretizerRegressor
190 • Polynomial feature expansion. A richer set of features is built by including expert knowl-
191 edge which can be directly used by a simple linear model. PolynomialFeatures
192 • Locally-based decision functions. A kernel is used to have a locally-based decision func-
193 tion instead of a global linear decision function. Nystroem, SVR, Gaussian
194 notebook: 03jNonLinearRegression

195 • Fitting in nonlinear models previously given.


196 Michaelis-Menten ( 𝑦 = 𝑎𝑥/( 1 + 𝑏𝑥) ), Humped Curves, S-shaped Functions and Logistic
regression ( 𝑦 = 1/( 1 + exp (- ∀𝑚 𝜃 𝑚 𝑥 𝑚)) ) to model categorical responses with two (binary)
Í
197

198 or several (Multinomial) outcomes.

19

You might also like