Professional Documents
Culture Documents
03 Regression
03 Regression
03 Regression
May 1, 2022
Outline
.- Definition
.- Linear Regression
.- Multilinear regression
.- Regularization
1
Random Signal Processing
𝜂 = 𝑓 (𝜉)
2 Dependency 𝑓 (·) : Relationship (function) between two sets 𝑋 ∈ 𝜉 , 𝑌 ∈ 𝜂 , that interact with
3 each other.
4 𝑓 (·) - real-valued black-box function ruled by uncertainty.
5 Input. Independent variables → predictors, 𝑋 ∈ ℝ𝑠× 𝑀
6 Output. Dependent variables → responses, 𝑌 ∈ ℝ𝑟× 𝑀
7 Function of dependency with statistical meaning.
𝑓 : 𝑋 ∈ ℝ𝑠× 𝑀 ↦→ 𝑌 ∈ ℝ𝑟× 𝑀
2
Random Signal Processing
𝑥11 . . . 𝑥1 𝑀 𝑦11 . . . 𝑦1 𝑀
9 𝑋 = . . . . . . . . . , 𝑌 = . . . . . . . . .
𝑥 𝑠𝑀 . . . 𝑥 𝑠𝑀 𝑦𝑟𝑀 . . . 𝑦𝑟𝑀
10
11 𝑠 = 1, single model, 𝑠 ≥ 2 multiple
12 .- Independent and Identically Distributed (i.i.d.) observations
13 .- A constant variance value of residuals (Homoscedasticity).
3
Random Signal Processing
24 ỹ = 𝔼 {y = 𝑌 | x = 𝑋 } (1)
25
26
s.t: 𝑑 ( ỹ , y ) − 𝜖 = 0, 𝜖 ∈ ℝ+
27 where 𝑥 ∈ ℝ 𝑝 and 𝑦 ∈ ℝ𝑞 are the random vectors with the corresponding observation sets
28 { 𝑥𝑛 }⊂ 𝑋 and { 𝑦𝑛 }⊂𝑌 , ỹ = 𝑓 ( x = 𝑋 ) assessment of the dependence.
29 The assumed relationship results in the regression model:
30 y = 𝑓 (x = 𝑋 ) + 𝜀 (x = 𝑋 ) (2)
31
32
s.t.: 𝔼 {𝜀𝜀 | x = 𝑋 } = 0
33 Function 𝑓 is hardly known. Two approaches to include a pairwise relationship in the model Eq. (2):
34 – An approximating function is assumed, relying on certain empirical evidence,
35 – 𝑓 (·) is learned from an observation set using data-driven approaches.
4
Random Signal Processing
41 y = 𝑓 ( x; 𝜃 ) + 𝜎𝜀𝜀 , 𝜎 ∈ ℝ+, 𝜃 = [𝜃 𝑘 : 𝑘 ∈ 𝐾 ]
42
43
= 𝜃 1x + 𝜃 0 + 𝜎𝜀 , 𝜃 = [𝜃 0, 𝜃 1] , 𝐾 = 2
44 y - dependent variable .9
45 x - independent variable
46 𝜃 1x + 𝜃 0 - linear componet: 𝜃 1 slope, 𝜃 0 - intercept
47 𝜎𝜀𝜀 - random component (error)
48 Visual inspection for Linearity and Homoscedasticity. notebook: 03aVisualCheck
5
Random Signal Processing
6
Random Signal Processing
55 y = 𝑓 ( x = 𝑋 ) + 𝜀 ( x = 𝑋 ) , s.t.: 𝔼 {𝜀𝜀 | x = 𝑋 } = 0
56 Assume: 𝔼 {y = 𝑌 | x = 𝑋 } ∼ 𝑝 ( x = 𝑋 ; θ )
57
58 𝑝 (𝑌 ) ⇒ 𝑝 ( 𝑋 ; θ̃ ) + 𝑝 (𝜀𝜀 ( 𝑋 ))
59 s.t.: θ̃ = min{𝜀𝜀 ( 𝑋 )}
𝜽
60
61
Assume: 𝑝 (𝑌 ) = N 𝑦 { 𝑚1 𝑦 , 𝜎𝑦 } , 𝑝 ( 𝑋 ) = N𝑥 { 𝑚1𝑥 , 𝜎𝑥 }
62 Then, it can be shown that:
1 1
63 𝑝 (𝑌 , 𝑋 ) = exp{- (𝜎𝑥2 𝑦¯ 2-2𝜎𝑦𝑥
2
𝑦¯ 𝑥¯+𝜎𝑦2 𝑥¯ 2)}
2 (𝜎𝑦2 𝜎𝑥2-𝜎𝑦𝑥
2
√︃
2𝜋 (𝜎𝑦2 𝜎𝑥2-𝜎𝑦𝑥
2
) )
1 1 2 2
exp{− 𝑥¯ − 2𝜌 𝑥¯ 𝑦¯ + 𝑦¯ },
64 =
2 ( 1 − 𝜌 2)
√︁
2𝜋 1 − 𝜌2
√︃
65 where 𝜉¯ = 𝜉 − 𝑚1𝜉 , 𝜌= 2
𝜎𝑦𝑥 / 𝜎𝑦2 𝜎𝑥2, 𝜌 ∈ ℝ[−1, 1]
66
67
7
Random Signal Processing
𝜎𝑋2
𝜌𝜎𝑋 𝜎𝑌
71 𝑝 (𝑌 | 𝑋 ) = N𝑌 | 𝑋 ( 𝑚1 𝑋 , 𝑚1𝑌 ) ,
𝜌𝜎𝑋 𝜎𝑌 𝜎𝑌2
72 Furthermore, if 𝑋 and 𝑌 (with the respective parameters 𝑚1 𝑋 , 𝜎𝑋2 , 𝑚1𝑌 , 𝜎𝑌2 ) are jointly Gaus-
73 sian random variables, then it can be proved that:
𝑥 − 𝑚1 𝑋
74 𝔼 {𝑌 | 𝑋 = 𝑥 } = 𝑚1𝑌 + 𝜌𝜎𝑌 ,
𝜎𝑋
75
76
var (𝑌 | 𝑋 = 𝑥) = ( 1 − 𝜌 2)𝜎𝑌2 .
8
Random Signal Processing
derivation of 𝔼 {𝑌 | 𝑋 }
( 𝑥−𝜇 𝑋 ) 2 ( 𝑦−𝜇𝑌 ) 2
h i
2𝜌( 𝑥−𝜇 𝑋 ) ( 𝑦−𝜇𝑌 )
∫∞ 1
−∞
𝑥 exp − 2 ( 1−𝜌 2) 𝜎𝑋2
+ 𝜎𝑌2
− 𝜎𝑌 𝜎𝑋 d𝑥
𝔼[ 𝑋 |𝑌 = 𝑦] = ∫ ∞
( 𝑥−𝜇 𝑋 ) 2 ( 𝑦−𝜇𝑌 ) 2
h i
1 2𝜌( 𝑥−𝜇 𝑋 ) ( 𝑦−𝜇𝑌 )
−∞
exp − 2 ( 1−𝜌 2) 𝜎𝑋2
+ 𝜎𝑌2
− 𝜎𝑌 𝜎𝑋 d𝑥
( 𝑥−𝜇 𝑋 ) 2 𝜌 2 ( 𝑦−𝜇𝑌 ) 2
h i
2𝜌( 𝑥−𝜇 𝑋 ) ( 𝑦−𝜇𝑌 )
∫∞ 1
−∞
𝑥 exp − 2 (1−𝜌 2)
𝜎𝑋2
− 𝜎𝑋 𝜎𝑌 + 𝜎𝑌2
d𝑥
= ∫∞
( 𝑥−𝜇 𝑋 ) 2 𝜌 2 ( 𝑦−𝜇𝑌 ) 2
h i
1 2𝜌( 𝑥−𝜇 𝑋 )( 𝑦−𝜇𝑌 )
−∞
exp − 2 ( 1−𝜌 2) 𝜎𝑋2
− 𝜎𝑋 𝜎𝑌 + 𝜎𝑌2
d𝑥
h i 2
𝑥−𝜇 𝑋 𝑦−𝜇𝑌
∫∞ 1
−∞
𝑥 exp − 2 ( 1−𝜌 2 ) 𝜎𝑋 −𝜌 𝜎𝑌 d𝑥
= h i 2
𝑥−𝜇 𝑋 𝑦−𝜇𝑌
∫∞ 1
−∞
exp − 2 ( 1−𝜌 2 ) 𝜎𝑋 −𝜌 𝜎𝑌 d𝑥
∫∞ 𝑦−𝜇𝑌
1 𝑥2
𝑥 + 𝜇 𝑋 + 𝜎𝑋 𝜌 exp − 2 ( 1−𝜌 2) 𝜎2 d𝑥
−∞ 𝜎𝑌
𝑋 𝑦 − 𝜇𝑌
= ∫∞
2
= 𝜇 𝑋 + 𝜎𝑋 𝜌
1 𝑥 𝜎𝑌
−∞
exp − 2 ( 1−𝜌 2) 2
𝜎
d𝑥
𝑋
𝑥2
∫∞ 1
81 since it holds that −∞
𝑥 exp − 2 ( 1−𝜌 2) 𝜎2 d𝑥 = 0
𝑋
9
Random Signal Processing
82 MLE Estimation of 𝜃
10
Random Signal Processing
84 Nonparametric Estimation of 𝜃
85 Ordinary Least Squares (OLS) Estimator: Assume a linear statistical model:
86 y = X 𝜃 + 𝜀 , : s.t.: 𝔼 {𝜀𝜀 | x = 𝑋 } → 0
𝜃˜ = min 𝜀 = 𝔼 ∥ y − X 𝜃 ∥ 2 , ℓ2 − norm
87
88 𝜃
11
Random Signal Processing
100
105 Issue: Least Squared algoritms are sensible to outliers. Robust approaches of estimation can
106 be used. notebook: 03gRobustLSE
12
Random Signal Processing
𝑥11 . . . 𝑥1 𝑀
𝑋 𝜃 = y : 𝑋 = . . . . . . . . . , y = y1 . . . 𝑦 𝑀 , θ = 𝜃 1 . . . 𝜃 𝑀 ,
108
𝑥 𝑀 ′ 𝑀 . . . 𝑥 𝑀 ′ 𝑀
109
110 Assumptions: Errors 𝜖 𝑚 have nothing to do with one another, x𝑚 rows are measured without
111 error, all the error occurs in the vertical direction.
112 𝑋 ⊤ 𝑋 is singular if 𝑀 ′ = 𝑀 . Otherwise, 𝑋 ⊤ 𝑋 becomes non-singular if 𝑀 ′ ≠ 𝑀 ?.
113 First case, let’s assume 𝑠< 𝑀 .
114 𝑋𝜃 = y
115 Pre-multiple by 𝑋 ⊤ 𝑋 : ( 𝑋 ⊤ 𝑋 ) −1 ( 𝑋 ⊤ 𝑋 )𝜃ˆ = ( 𝑋 ⊤ 𝑋 ) −1 𝑋 ⊤y
116 By SVD decomposition: 𝑋 = 𝑈 𝛴𝑉 ⊤, then, → 𝑈 𝛴𝑉 ⊤𝜃 = y
117 𝑉 𝛴 −1𝑈 ⊤𝑈 𝛴𝑉 ⊤𝜃ˆ = 𝑈 𝛴𝑉 ⊤y ,
118 𝜃ˆ = 𝑈 𝛴𝑉 ⊤y
119
120
𝜃ˆ ∼ 𝑋 †y , Moore Penrose Pseudoinverse
121 notebook: 03ePenrose notebook: 03fMultiReg
13
Random Signal Processing
𝜃 = ( 𝑋 𝑋 ) 𝑋 y ∼ ( 𝑋 𝑋 +𝜆 𝐼 𝑃 ) 𝑋 y ,
⊤ ⊤ ⊤ ⊤
132
ˆ −1 −1
𝜆 ∈ ℝ+
∑︁ ∑︁ ∑︁
2
133 min ( 𝑦𝑝 − 𝑥𝑛𝑝 𝜃 𝑝) + 𝜆 𝜃 2𝑝
∀𝜃 𝑝
𝑛∈ 𝑁 𝑝∈ 𝑃 𝑝∈ 𝑃
∑︁ ∑︁ ∑︁
2
134 min ( 𝑦𝑝 − 𝑥𝑛𝑝 𝜃 𝑝) , s.t.: 𝜃 2𝑝 < 𝜀 ∈ ℝ+
∀𝜃 𝑝
135
𝑛∈ 𝑁 𝑝∈ 𝑃 𝑝∈ 𝑃
14
Random Signal Processing
15
Random Signal Processing
16
Random Signal Processing
𝐷 [ 𝑦 | 𝑥] 𝜎2 ˆ2
𝜎 𝑆𝑆𝑟𝑒𝑠 /𝑛
162 Coefficient of determination (R2) = 1 − =1− 2 =1− 2 =1−
𝐷 [ 𝑦] 𝜎𝑦 ˆ𝑦
𝜎 𝑆𝑆 𝑡𝑜𝑡 /𝑛
Í𝑛 2
𝑖 =1 𝑒 𝑖
163 = 1 − Í𝑛 2
𝑖=1 ( 𝑦𝑖 − 𝑦)
Í𝑛
𝑖=1 ( 𝑦𝑖 − 𝑦ˆ 𝑖) 2
164 =1− 2
, 𝑦ˆ 𝑖 = 𝑚𝑒𝑎𝑛 ( 𝑦)
𝑛𝜎 ˆ𝑦
165 Mean Squared Error (MSE)= 𝔼 {∥ y − ỹ ∥ 2 }
166
167
Mean Absolute Error (MAE)= 𝔼 {∥ y − ỹ ∥ 1 }
168 R-square value is the percentage of variation in the dependent variable explained by the
169 independent predictor. The higher the R-square value better the model.
170 Disadvantages. i) MAE and MSE Measures can be highly affected by Outliers. ii) The value of
171 R2 increases if we add more variables to the model, irrespective of the variable contributing
172 to the model or not. This issue is a disadvantage of using R2. notebook: 3iEvalRegression
17
Random Signal Processing
18
Random Signal Processing
188 • Piecewise linearization. A model that can natively deal with non-linearity. Decision-
189 TreeRegressor, KBinsDiscretizerRegressor
190 • Polynomial feature expansion. A richer set of features is built by including expert knowl-
191 edge which can be directly used by a simple linear model. PolynomialFeatures
192 • Locally-based decision functions. A kernel is used to have a locally-based decision func-
193 tion instead of a global linear decision function. Nystroem, SVR, Gaussian
194 notebook: 03jNonLinearRegression
19