Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

12

24 704: Probability and Estimation Methods for Engineering Systems

Lec. 22

Multivariate Linear Regression

instructor: Matteo Pozzi

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 1
Example of Linear Regression
2
R = 19 %
80

75
𝑛𝑛 = 40 75

𝑓𝑓̂ = 𝜃𝜃̂0 + 𝜃𝜃̂1 𝑥𝑥


70 70

65 65

𝜃𝜃̂0 = −274.4 MPa

strength [MPa]
strength [MPa]

60 60

MPa m3
55 55

𝜃𝜃̂1 = 0.1368

𝑦𝑦
Kg
50 50

45 45

2400 2420 2440 2460 2480 2400 2420 2440 2460 2480 2500

𝑥𝑥
3 3
density [kg/m ] density [kg/m ]

𝛼𝛼 = 5% ⇒ 1 − 𝛼𝛼 = 95% confidence
𝐶𝐶𝐶𝐶𝑛𝑛 𝜃𝜃0 = 𝜃𝜃̂0 − 𝑡𝑡𝛼𝛼/2,𝑛𝑛−2 se
� 0 ; 𝜃𝜃̂0 + 𝑡𝑡𝛼𝛼/2,𝑛𝑛−2 se
� 0 = −500.9; −47.9 MPa
MPa m3
𝐶𝐶𝐶𝐶𝑛𝑛 𝜃𝜃1 = 𝜃𝜃̂1 − 𝑡𝑡𝛼𝛼/2,𝑛𝑛−2 se
� 1 ; 𝜃𝜃̂1 + 𝑡𝑡𝛼𝛼/2,𝑛𝑛−2 se
�1 = 0.0442; 0.2295
Kg
𝑡𝑡𝛼𝛼/2,𝑛𝑛−2 : value for t-distr. with 𝑛𝑛 − 2 0 ∉ 𝐶𝐶𝐶𝐶𝑛𝑛 𝜃𝜃1 ⇒ reject 𝐻𝐻0 : 𝜃𝜃1 = 0 ,
degrees of freedom. with significance 𝛼𝛼 = 5%,

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 2
Further topics on Regression

After performing regression, we assess how much of the uncertainty of the output is
explained by conditioning on the input.
This is quantified by the Coefficient of Determination 𝑅𝑅2 .

Linear regression can be extended to multiple inputs (and one output):


this is multiple regression.
We can assess the importance of each input in predicting the output.

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 3
again, MultiVariate Normal RVs: conditional variance
The conditional distribution is normal, with - Cond. Mean 𝜇𝜇𝑌𝑌|𝑥𝑥 linearly varying with 𝑥𝑥 ;
2 - Cond. Variance 𝜎𝜎𝑌𝑌2|𝑋𝑋 invariant respect to 𝑥𝑥.
𝑌𝑌| 𝑋𝑋 = 𝑥𝑥 ~𝒩𝒩 𝜇𝜇𝑌𝑌|𝑥𝑥 , 𝜎𝜎𝑌𝑌|𝑋𝑋
𝜎𝜎𝑌𝑌
𝔼𝔼𝑌𝑌|𝑥𝑥 𝑌𝑌|𝑥𝑥 = 𝜇𝜇𝑌𝑌|𝑥𝑥 = 𝜇𝜇𝑌𝑌 + 𝜌𝜌 𝑥𝑥 − 𝜇𝜇𝑋𝑋
𝜎𝜎𝑋𝑋 Marginal variance:
𝕍𝕍𝑌𝑌|𝑥𝑥 𝑌𝑌|𝑥𝑥 = 𝜎𝜎𝑌𝑌2|𝑋𝑋 = 1 − 𝜌𝜌2 𝜎𝜎𝑌𝑌2 𝕍𝕍𝑌𝑌 𝑌𝑌 = 𝜎𝜎𝑌𝑌2

0.75

0.5

y
0.25

95% conf. int.


0
0 0.25 0.5 0.75 1

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 4
Contributions to total variance

Law of total variance: 𝜎𝜎𝑌𝑌2 = 𝔼𝔼𝑋𝑋 𝜎𝜎𝑌𝑌2|𝑥𝑥 + 𝕍𝕍𝑋𝑋 𝜇𝜇𝑌𝑌|𝑥𝑥

𝜇𝜇𝑌𝑌 + 2𝜎𝜎𝑌𝑌 For any pair of RVs, we can define marginal and
conditional moments, and the law says that
𝜇𝜇𝑌𝑌|𝑥𝑥 the total uncertainty about 𝑌𝑌 is the sum of
𝜇𝜇𝑌𝑌|𝑥𝑥 + 2𝜎𝜎𝑌𝑌|𝑥𝑥 the explained and the unexplained uncertainties.

𝜇𝜇𝑌𝑌 𝑃𝑃𝑌𝑌|𝑋𝑋 𝑦𝑦 𝑥𝑥

𝜇𝜇𝑌𝑌|𝑥𝑥 − 2𝜎𝜎𝑌𝑌|𝑥𝑥
For MVN, 𝜎𝜎𝑌𝑌2|𝑥𝑥 does not change with 𝑥𝑥, so

𝜇𝜇𝑌𝑌 − 2𝜎𝜎𝑌𝑌 𝔼𝔼𝑋𝑋 𝜎𝜎𝑌𝑌2|𝑥𝑥 = 𝜎𝜎𝑌𝑌2|𝑋𝑋

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 5
Univariate ANOVA

Law of total variance: 𝜎𝜎𝑌𝑌2 = 𝔼𝔼𝑋𝑋 𝜎𝜎𝑌𝑌2|𝑥𝑥 + 𝕍𝕍𝑋𝑋 𝜇𝜇𝑌𝑌|𝑥𝑥

Prior uncertainty of 𝑌𝑌 Left unexplained Explained by observing 𝑥𝑥


𝜎𝜎𝑌𝑌2 = 𝜎𝜎𝑌𝑌2|𝑋𝑋 + 𝕍𝕍𝑋𝑋 𝑓𝑓 𝑥𝑥

1 − 𝜌𝜌2 𝜎𝜎𝑌𝑌2 𝜌𝜌2 𝜎𝜎𝑌𝑌2


𝑛𝑛
1
Sample estimators: 𝜎𝜎𝑌𝑌2 ≅ 𝑉𝑉�𝑌𝑌,𝑛𝑛 = � 𝑦𝑦𝑖𝑖 − 𝑌𝑌�𝑛𝑛 2
𝜎𝜎𝑌𝑌2
𝑛𝑛 − 1
𝑖𝑖=1
𝜎𝜎𝑌𝑌2|𝑋𝑋 𝕍𝕍𝑋𝑋 𝑓𝑓 𝑥𝑥
𝑛𝑛
1 2
𝕍𝕍𝑋𝑋 𝑓𝑓 𝑥𝑥 ≅ 𝑉𝑉�𝑓𝑓,𝑛𝑛 = � 𝑓𝑓̂𝑖𝑖 − 𝑌𝑌�𝑛𝑛 0 1 𝜌𝜌
2
𝑛𝑛 − 1
𝑖𝑖=1
𝑛𝑛 𝑛𝑛 𝑛𝑛 𝑛𝑛
2 2 2
� 𝑓𝑓̂𝑖𝑖 − 𝑌𝑌�𝑛𝑛 = � 𝜃𝜃̂0 + 𝜃𝜃̂1 𝑥𝑥𝑖𝑖 − 𝑌𝑌�𝑛𝑛 = � 𝑌𝑌�𝑛𝑛 − 𝜃𝜃̂1 𝑋𝑋�𝑛𝑛 + 𝜃𝜃̂1 𝑥𝑥𝑖𝑖 − 𝑌𝑌�𝑛𝑛 = 𝜃𝜃̂12 � 𝑥𝑥𝑖𝑖 − 𝑋𝑋�𝑛𝑛 2

𝑖𝑖=1 𝑖𝑖=1 𝑖𝑖=1 𝑖𝑖=1

𝕍𝕍𝑋𝑋 𝑓𝑓 𝑥𝑥 ≅ 𝜃𝜃̂12 𝑉𝑉�𝑋𝑋,𝑛𝑛 = 𝐶𝐶̂𝑋𝑋,𝑌𝑌,𝑛𝑛


2
/𝑉𝑉�𝑋𝑋,𝑛𝑛 = 𝜌𝜌� 2 𝑉𝑉�𝑌𝑌,𝑛𝑛

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 6
Univariate ANOVA, cont.
𝑛𝑛
1
Sample estimators: 𝜎𝜎𝑌𝑌2 ≅ 𝑉𝑉�𝑌𝑌,𝑛𝑛 = � 𝑦𝑦𝑖𝑖 − 𝑌𝑌�𝑛𝑛 2
prior uncertainty of 𝑌𝑌
𝑛𝑛 − 1
𝑖𝑖=1

𝕍𝕍𝑋𝑋 𝑓𝑓 𝑥𝑥 ≅ 𝑉𝑉�𝑓𝑓,𝑛𝑛 = 𝜌𝜌� 2 𝑉𝑉�𝑌𝑌,𝑛𝑛 explained by observing 𝑥𝑥


𝑛𝑛
1
𝜎𝜎𝑌𝑌2|𝑋𝑋 = 𝜎𝜎𝜀𝜀2 ≅ 𝜎𝜎�𝜀𝜀2 = 𝑉𝑉�𝜀𝜀,𝑛𝑛 = � 𝑟𝑟𝑖𝑖2 left unexplained
𝑛𝑛 − 1
𝑖𝑖=1

Law of total variance: 𝑉𝑉�𝑌𝑌,𝑛𝑛 = 𝑉𝑉�𝑓𝑓,𝑛𝑛 + 𝑉𝑉�𝜀𝜀,𝑛𝑛 𝑉𝑉�𝑌𝑌,𝑛𝑛


prior uncertainty explained by left
of 𝑌𝑌 observing 𝑥𝑥 unexplained 𝑉𝑉�𝑓𝑓,𝑛𝑛 𝑉𝑉�𝜀𝜀,𝑛𝑛
𝑉𝑉�𝑓𝑓,𝑛𝑛
Coefficient of determination: 2
𝑅𝑅 = = 𝜌𝜌� 2
𝑉𝑉�𝑌𝑌,𝑛𝑛
𝑅𝑅2 estimates the (square of the) correlation between 𝑋𝑋 and 𝑌𝑌,
and it describes how much of 𝑌𝑌 can be explained by 𝑋𝑋.
0 ≤ 𝑅𝑅2 ≤ 1 Limit cases: 𝑅𝑅2 = 0 ⇒ 𝑋𝑋 explains nothing about 𝑌𝑌
𝑅𝑅2 = 1 ⇒ 𝑋𝑋 explains everything about 𝑌𝑌
12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 7
Coefficient of determination

𝑉𝑉�𝑓𝑓,𝑛𝑛 𝑉𝑉�𝜀𝜀,𝑛𝑛 ∑𝑛𝑛𝑖𝑖=1 𝑟𝑟𝑖𝑖2


Classical definition: 𝑅𝑅2 = =1− = 1 − 𝑛𝑛
𝑉𝑉�𝑌𝑌,𝑛𝑛 𝑉𝑉�𝑌𝑌,𝑛𝑛 ∑𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑌𝑌�𝑛𝑛 2
[e.g. Excel used this]
2
∑𝑛𝑛𝑖𝑖=1 𝑓𝑓̂𝑖𝑖 − 𝑌𝑌�𝑛𝑛
= 𝑛𝑛
∑𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑌𝑌�𝑛𝑛 2

𝑛𝑛
1
Adjusted definition: 𝜎𝜎𝑌𝑌2|𝑋𝑋 = 𝜎𝜎𝜀𝜀2 ≅ 𝑉𝑉�𝜀𝜀,𝑛𝑛 = � 𝑟𝑟𝑖𝑖2 [unbiased]
𝑛𝑛 − 𝑚𝑚 − 1
𝑖𝑖=1

𝑉𝑉�𝜀𝜀,𝑛𝑛 ∑ 𝑛𝑛
𝑖𝑖=1 𝑟𝑟 2
𝑖𝑖 / 𝑛𝑛 − 𝑚𝑚 − 1
𝑅𝑅� 2 = 1 − = 1 − 𝑛𝑛
𝑉𝑉�𝑌𝑌,𝑛𝑛 ∑𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑌𝑌�𝑛𝑛 2 / 𝑛𝑛 − 1

𝑉𝑉�𝜀𝜀,𝑛𝑛 > 𝑉𝑉�𝜀𝜀,𝑛𝑛 ⇒ 𝑅𝑅� 2 < 𝑅𝑅2

The adjusted coeff. of determination 𝑅𝑅� 2 is less than the classical one 𝑅𝑅2 .

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 8
Multiple features
outcome variable 𝑚𝑚 features: (aka independent variables,
(aka dependent variable) predictors, covariates)

𝑦𝑦 𝑥𝑥�0 𝑥𝑥�1 𝑥𝑥�2 ⋯ 𝑥𝑥�𝑚𝑚


𝑦𝑦1 1 𝑥𝑥11 𝑥𝑥12 ⋯ 𝑥𝑥1𝑚𝑚 𝐱𝐱1
dataset:
𝑦𝑦2 1 𝑥𝑥21 𝑥𝑥22 ⋯ 𝑥𝑥2𝑚𝑚 𝐱𝐱 2
⋮ 1 ⋮ ⋮ ⋱ ⋮ ⋮
𝑛𝑛 joint samples 𝑦𝑦𝑛𝑛 1 𝑥𝑥𝑛𝑛𝑛 𝑥𝑥𝑛𝑛𝑛 ⋯ 𝑥𝑥𝑛𝑛𝑛𝑛 𝐱𝐱 𝑛𝑛
𝜃𝜃0 𝜃𝜃1 𝜃𝜃2 ⋯ 𝜃𝜃𝑚𝑚
How to predict 𝑦𝑦 as a function of 𝑚𝑚 features 𝑥𝑥�1 , 𝑥𝑥�2 , … , 𝑥𝑥�𝑚𝑚 ?

Parametric form, with 𝑚𝑚 + 1 parameters 𝜃𝜃0 , 𝜃𝜃1 , … , 𝜃𝜃𝑚𝑚 .


𝑚𝑚 𝑚𝑚
Linear function: 𝑓𝑓𝑖𝑖 𝐱𝐱 𝑖𝑖 , 𝛉𝛉 = 𝐱𝐱 𝑖𝑖 𝛉𝛉 = � 𝑥𝑥𝑖𝑖𝑖𝑖 𝜃𝜃𝑗𝑗 = 𝜃𝜃0 + � 𝑥𝑥𝑖𝑖𝑖𝑖 𝜃𝜃𝑗𝑗
𝑗𝑗=0 𝑗𝑗=1

∀𝑖𝑖 ∈ 1,2, … , 𝑛𝑛 : 𝐱𝐱 𝑖𝑖 = 1 𝑥𝑥𝑖𝑖𝑖 𝑥𝑥𝑖𝑖𝑖 ⋯ 𝑥𝑥𝑖𝑖𝑖𝑖

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 9
Multivariate linear regression: basics

𝐱𝐱1 𝑦𝑦1
Matrix of 𝐱𝐱 2 size: Vector of 𝑦𝑦2 size:
𝐗𝐗 = ⋮ 𝐲𝐲 = ⋮
regressors: 𝑛𝑛 × 𝑚𝑚 + 1 outputs: 𝑛𝑛 × 1
𝐱𝐱 𝑛𝑛 𝑦𝑦𝑛𝑛

𝑓𝑓1 𝜃𝜃0
Vector of 𝑓𝑓2 size: Vector of 𝜃𝜃1 size:
predictions: 𝐟𝐟 = parameters: 𝛉𝛉 =
⋮ 𝑛𝑛 × 1 ⋮ 𝑚𝑚 + 1 × 1
𝑓𝑓𝑛𝑛 𝜃𝜃𝑚𝑚

Linear predictions: 𝐟𝐟 𝐗𝐗, 𝛉𝛉 = 𝐗𝐗 𝛉𝛉 Noise: 𝛆𝛆~𝒩𝒩 0, 𝜎𝜎𝜀𝜀2 𝐈𝐈

Basic equation: 𝐲𝐲 = 𝐟𝐟 + 𝛆𝛆 = 𝐗𝐗 𝛉𝛉 + 𝛆𝛆 independent noise


𝑟𝑟1
𝑟𝑟2 size:
Residuals: 𝐫𝐫 = 𝐲𝐲 − 𝐗𝐗 𝛉𝛉 = ⋮
𝑛𝑛 × 1
𝑟𝑟𝑛𝑛
12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 10
Multivariate linear regression: least squares

Residual sum of squares, it is a quadratic form of 𝛉𝛉:


𝑛𝑛
rss𝑛𝑛 𝛉𝛉 = � 𝑟𝑟𝑖𝑖2 = 𝐫𝐫 T 𝐫𝐫 = 𝐲𝐲 − 𝐗𝐗 𝛉𝛉 T 𝐲𝐲 − 𝐗𝐗 𝛉𝛉
𝑖𝑖=1
= 𝐲𝐲 T 𝐲𝐲 − 2𝐲𝐲 T 𝐗𝐗 𝛉𝛉 + 𝛉𝛉T 𝐗𝐗 T 𝐗𝐗 𝛉𝛉
To minimize rss𝑛𝑛 , we compute its gradient, and put it to zero:
∇rss 𝛉𝛉 = −2𝐗𝐗 T 𝐲𝐲 + 2𝐗𝐗 T 𝐗𝐗 𝛉𝛉
� = 𝐗𝐗 T 𝐲𝐲
∇rss 𝛉𝛉 = 𝟎𝟎 ⇔ 𝐗𝐗 T 𝐗𝐗 𝛉𝛉
−1 −1 T
∃ 𝐗𝐗 T 𝐗𝐗 � = 𝐗𝐗 T 𝐗𝐗
: 𝛉𝛉 𝐗𝐗 𝐲𝐲
� = 𝐗𝐗 + 𝐲𝐲
𝛉𝛉
−1 T
Matrix 𝐗𝐗 + = 𝐗𝐗 T 𝐗𝐗 𝐗𝐗 , of size 𝑚𝑚 + 1 × 𝑛𝑛, is the pseudo-inverse of 𝐗𝐗.
� = 𝐗𝐗 + 𝐲𝐲, we conclude that 𝛉𝛉
From formula 𝛉𝛉 � is a linear estimator.

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 11
MLE for multiple regression with normal noise

LH function: 𝐘𝐘�𝐗𝐗, 𝛉𝛉, 𝜎𝜎𝜀𝜀2 ~𝒩𝒩 𝐟𝐟 𝐗𝐗, 𝛉𝛉 , 𝜎𝜎𝜀𝜀2 𝐈𝐈 = 𝒩𝒩 𝐗𝐗 𝛉𝛉, 𝜎𝜎𝜀𝜀2 𝐈𝐈


𝑝𝑝𝐘𝐘|𝐗𝐗,𝛉𝛉,𝜎𝜎𝜀𝜀2 𝐘𝐘 𝐗𝐗, 𝛉𝛉, 𝜎𝜎𝜀𝜀2 = 𝒩𝒩 𝐘𝐘; 𝐗𝐗 𝛉𝛉, 𝜎𝜎𝜀𝜀2 𝐈𝐈

= 𝒩𝒩 𝐗𝐗 𝛉𝛉; 𝐲𝐲, 𝜎𝜎𝜀𝜀2 𝐈𝐈


(conditional)
𝑛𝑛 1
Log-LH function: log 𝑝𝑝𝐘𝐘|𝐗𝐗,𝛉𝛉,𝜎𝜎𝜀𝜀2 𝐘𝐘 𝐗𝐗, 𝛉𝛉, 𝜎𝜎𝜀𝜀2 = − log 𝜎𝜎𝜀𝜀2 − 2 𝐲𝐲 − 𝐗𝐗 𝛉𝛉 T 𝐲𝐲 − 𝐗𝐗 𝛉𝛉
2 2𝜎𝜎𝜀𝜀
2
𝑛𝑛 2
1
𝑙𝑙𝑛𝑛 𝛉𝛉, 𝜎𝜎𝜀𝜀 = − log 𝜎𝜎𝜀𝜀 − 2 rss𝑛𝑛 𝛉𝛉
2 2𝜎𝜎𝜀𝜀

rss
� 𝑛𝑛 �
To maximize Log-LH, we minimize rss𝑛𝑛 : � = 𝐗𝐗 + 𝐲𝐲; 𝜎𝜎�𝜀𝜀2 =
𝛉𝛉 � 𝑛𝑛 ≜ rss𝑛𝑛 𝛉𝛉
; rss
𝑛𝑛 − 𝑚𝑚 − 1

Unbiased estimator, after 𝑚𝑚 + 1 calibrated parameters.

[Actually, this is the formula for the unbiased estimate of 𝜎𝜎𝜀𝜀2 ,


while “pure” MLE estimates 𝜎𝜎𝜀𝜀2 as rss
� 𝑛𝑛 /𝑛𝑛 ]

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 12
Uncertainty in the estimation of regression parameters

−1 T
Estimation: � = 𝐗𝐗 T 𝐗𝐗
𝛉𝛉 𝐗𝐗 𝐲𝐲 𝐘𝐘 = 𝐗𝐗 𝛉𝛉 + 𝛆𝛆 𝛆𝛆~𝒩𝒩 𝟎𝟎, 𝜎𝜎𝜀𝜀2 𝐈𝐈
−1 T −1 T
� = 𝐗𝐗 T 𝐗𝐗
𝛉𝛉 𝐗𝐗 𝐗𝐗 𝛉𝛉 + 𝐗𝐗 T 𝐗𝐗 𝐗𝐗 𝛆𝛆 = 𝛉𝛉 + 𝐗𝐗 + 𝛆𝛆 error
𝐗𝐗 � 𝐗𝐗, 𝜎𝜎𝜀𝜀2 = 𝛉𝛉 unbiased est. (as error is zero mean)
𝔼𝔼 𝛆𝛆 = 𝟎𝟎 ⇒ 𝔼𝔼 𝛉𝛉�𝛉𝛉,
𝛉𝛉 �
𝛉𝛉 −1 T −1
� 𝐗𝐗, 𝜎𝜎𝜀𝜀2 = 𝜎𝜎𝜀𝜀2 𝐗𝐗 + 𝐗𝐗 + T = 𝜎𝜎𝜀𝜀2 𝐗𝐗 T 𝐗𝐗
𝕍𝕍 𝛉𝛉�𝛉𝛉, 𝐗𝐗 𝐗𝐗 𝐗𝐗 T 𝐗𝐗
𝐘𝐘
−1
𝜎𝜎𝜀𝜀2 𝜎𝜎�𝜀𝜀2 = 𝜎𝜎𝜀𝜀2 𝐗𝐗 T 𝐗𝐗 ≜ 𝚺𝚺Θ
−1
≅ 𝜎𝜎�𝜀𝜀2 𝐗𝐗 T 𝐗𝐗 �Θ
≜ 𝚺𝚺
𝐗𝐗
𝛉𝛉, 𝜎𝜎𝜀𝜀2 � is proportional 𝜎𝜎𝜀𝜀2 ,
� 𝜎𝜎�𝜀𝜀2 The matrix of uncertainty of 𝛉𝛉
𝛉𝛉, −1
while 𝐗𝐗 T 𝐗𝐗 tends to the zero matrix for large 𝑛𝑛.

� 𝛉𝛉, 𝜎𝜎𝜀𝜀2 ~ 𝒩𝒩 𝛉𝛉, 𝚺𝚺Θ


𝛉𝛉�𝐗𝐗, Normally distributed knowing 𝜎𝜎𝜀𝜀2 .
�|𝐗𝐗, 𝛉𝛉 ≈ 𝒩𝒩 𝛉𝛉, 𝚺𝚺
𝛉𝛉 �Θ We can estimate 𝜎𝜎𝜀𝜀2 ← 𝜎𝜎�𝜀𝜀2 from data.

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 13
Uncertainty in the estimation of noise level

Stand. error estimating 𝜃𝜃𝑗𝑗 : se


� 𝑗𝑗 ≜ se
� Θ𝑗𝑗 = �Θ
𝚺𝚺 𝑗𝑗+1,𝑗𝑗+1

Conf. interval for 𝜃𝜃𝑗𝑗 : 𝐶𝐶𝐶𝐶𝑗𝑗 = 𝜃𝜃̂𝑗𝑗 − 𝑧𝑧𝛼𝛼/2 se


� 𝑗𝑗 ; 𝜃𝜃̂𝑗𝑗 + 𝑧𝑧𝛼𝛼/2 se
� 𝑗𝑗
𝐗𝐗
for low 𝑛𝑛, we should use
𝛉𝛉 �
𝛉𝛉 𝑡𝑡𝛼𝛼/2,𝑛𝑛−𝑚𝑚−1 instead of 𝑧𝑧𝛼𝛼/2 .
𝐘𝐘 Estimator of the noise level:
𝜎𝜎𝜀𝜀2 𝜎𝜎�𝜀𝜀2
𝜎𝜎�𝜀𝜀2 2 2
𝑛𝑛 − 𝑚𝑚 − 1 2 �𝜎𝜎𝜀𝜀 ~𝜒𝜒𝑛𝑛−𝑚𝑚−1
𝜎𝜎𝜀𝜀
𝐗𝐗
𝛉𝛉, 𝜎𝜎𝜀𝜀2 We can define confidence intervals and
� 𝜎𝜎�𝜀𝜀2
𝛉𝛉, test hypotheses on the noise level.

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 14
Example of Multivariate Regression

Dataset of Criminal activity, 𝑛𝑛 = 47 datapoints, 𝑚𝑚 = 10 regressors.

j name 𝜃𝜃̂𝑗𝑗 se
� 𝑗𝑗 𝑡𝑡𝑗𝑗 𝒫𝒫𝑗𝑗
0 (Intercept) -589.4 167.6 -3.59 0.0012 ** If 𝒫𝒫𝑗𝑗 is small, we are
1 Age 1.041 0.446 2.331 0.0255 * confident that 𝜃𝜃𝑗𝑗
2 South State 11.29 13.24 0.853 0.3994 should not be zero.
3 Education 1.178 0.681 1.728 0.0926
4 Expenditures 0.964 0.249 3.861 0.0005 ***
5 Labor 0.106 0.153 0.692 0.4935
6 Numb. Males 0.303 0.222 1.363 0.1813
7 Population 0.090 0.138 0.652 0.5185
8 Unempl. (14-24) -0.682 0.480 -1.418 0.1648
9 Unempl. (25-39) 2.150 0.950 2.262 0.0299 *
10 Wealth -0.083 0.091 -0.913 0.3672
𝑡𝑡𝑗𝑗 = 𝜃𝜃̂𝑗𝑗 /se
� 𝑗𝑗
𝒫𝒫𝑗𝑗 = 2Φ − 𝑡𝑡𝑗𝑗
We use Φ under the normal approx., and the Student’s t CDF for small 𝑛𝑛.
12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 15
Prediction of new values of 𝑓𝑓 and 𝑦𝑦, given 𝐱𝐱
How to predict 𝑦𝑦∗ , the outcome at features 𝐱𝐱 ∗ ?
True value: 𝑓𝑓∗ = 𝑓𝑓 𝐱𝐱 ∗ , 𝛉𝛉 = 𝐱𝐱 ∗ 𝛉𝛉 estimator: 𝑓𝑓̂∗ = 𝐱𝐱 ∗ 𝛉𝛉

𝑦𝑦∗ = 𝑓𝑓∗ + 𝜀𝜀∗ = 𝐱𝐱 ∗ 𝛉𝛉 + 𝜀𝜀∗ 𝑦𝑦�∗ = 𝑓𝑓̂∗

Uncertainty in the estimator: 𝑓𝑓̂∗ |𝐱𝐱 ∗ , 𝐗𝐗, 𝛉𝛉 ≈ 𝒩𝒩 𝑓𝑓∗ , 𝐱𝐱 ∗ 𝚺𝚺Θ� 𝐱𝐱∗T


−1 T
� 2 𝑓𝑓̂∗ = 𝐱𝐱 ∗ 𝚺𝚺Θ� 𝐱𝐱∗T = 𝜎𝜎�𝜀𝜀2 𝐱𝐱∗ 𝐗𝐗 T 𝐗𝐗
Standard error: se 𝐱𝐱 ∗

Conf. Int.: � 𝑓𝑓∗ ; 𝑓𝑓̂∗ + 𝑧𝑧𝛼𝛼/2 se


𝐶𝐶𝐶𝐶 𝑓𝑓∗ = 𝑓𝑓̂∗ − 𝑧𝑧𝛼𝛼/2 se � 𝑓𝑓∗

−1 T
Including noise in 𝑦𝑦∗ : � 2 𝑦𝑦∗ = 𝜎𝜎�𝜀𝜀2 𝐱𝐱 ∗ 𝐗𝐗 T 𝐗𝐗
se 𝐱𝐱 ∗ +1

𝐶𝐶𝐶𝐶 𝑦𝑦∗ = 𝑦𝑦�∗ − 𝑧𝑧𝛼𝛼/2 se


� 𝑦𝑦∗ ; 𝑦𝑦�∗ + 𝑧𝑧𝛼𝛼/2 se
� 𝑦𝑦∗

For low , we should use 𝑡𝑡𝛼𝛼/2,𝑛𝑛−𝑚𝑚 instead of 𝑧𝑧𝛼𝛼/2 .


12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 16
Geometry of least squares
Basic equation: 𝐲𝐲 = 𝐟𝐟 + 𝛆𝛆
𝑦𝑦 𝑥𝑥�0 𝑥𝑥�1 𝑥𝑥�2 ⋯ 𝑥𝑥�𝑚𝑚
prediction: 𝐟𝐟 = 𝐗𝐗 𝛉𝛉 𝑦𝑦1 1 𝑥𝑥11 𝑥𝑥12 ⋯ 𝑥𝑥1𝑚𝑚
𝑚𝑚
𝑦𝑦2 1 𝑥𝑥21 𝑥𝑥22 ⋯ 𝑥𝑥2𝑚𝑚
= � 𝐱𝐱 �𝑖𝑖 𝜃𝜃𝑖𝑖
⋮ 1 ⋮ ⋮ ⋱ ⋮
𝑖𝑖=0
𝑦𝑦𝑛𝑛 1 𝑥𝑥𝑛𝑛𝑛 𝑥𝑥𝑛𝑛𝑛 ⋯ 𝑥𝑥𝑛𝑛𝑛𝑛
𝐱𝐱 �0 𝐱𝐱 �1 𝐱𝐱 �2 ⋯ 𝐱𝐱 �𝑚𝑚
Dataset 𝐲𝐲 ∈ ℝ𝑛𝑛 is point in 𝑛𝑛 dimensional space.
Prediction vector 𝐟𝐟 spans a subspace of dimension 𝑚𝑚 + 1 .
In that subspace, point 𝐟𝐟̂ = 𝐗𝐗 𝛉𝛉
� is the closest to 𝐲𝐲.
𝐲𝐲
𝐲𝐲 = 𝐟𝐟̂ + 𝐫𝐫 with 𝐫𝐫 ⊥ 𝐟𝐟.
̂
𝐫𝐫 𝐱𝐱 �2
residual is orthogonal to prediction 𝐟𝐟̂ 𝐱𝐱 �1

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 17
Proof of orthogonality

We can derive the MLE estimator for from the condition that the residual
vector is orthogonal respect to prediction vector:

𝐟𝐟 = 𝐗𝐗 𝛉𝛉 prediction
𝛅𝛅 = 𝐲𝐲 − 𝐟𝐟 = 𝐲𝐲 − 𝐗𝐗 𝛉𝛉 residual funct. of 𝛉𝛉
T

𝐟𝐟 ⊥ 𝛅𝛅 ⇔ 0 = 𝐟𝐟 T 𝛅𝛅 = 𝐗𝐗 𝛉𝛉 �
𝐲𝐲 − 𝐗𝐗 𝛉𝛉
�T 𝐗𝐗 T 𝐲𝐲 − 𝐗𝐗 T 𝐗𝐗 𝛉𝛉
⇔ 0 = 𝛉𝛉 �

⟸ 𝟎𝟎 = �
𝐗𝐗 T 𝐲𝐲 − 𝐗𝐗 T 𝐗𝐗 𝛉𝛉 the bi-direction inference ⇔ is not certain.
−1 T
� = 𝐗𝐗 T 𝐗𝐗
⇔ 𝛉𝛉 𝐗𝐗 𝐲𝐲 = 𝐗𝐗 + 𝐲𝐲

The pseudo-inverse matrix can be derived from the orthogonality condition.


� is the value so that corresponding prediction 𝐟𝐟̂ = 𝐗𝐗 𝛉𝛉
MLE 𝛉𝛉 � is orthogonal to
̂ Also, 𝐫𝐫 is the shortest possible residual vec.:
corresponding residual 𝐫𝐫 = 𝐲𝐲 − 𝐟𝐟.
𝐫𝐫 = min 𝛅𝛅

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 18
Regression, depending on the feature set, I

Time series: strain data collected by a sensor.


20

15 We can “explain” them using multiple inputs.

10
Feature: 𝑋𝑋0 : constant, 𝑋𝑋1 : linear time trend,
𝑋𝑋2 : temperature, 𝑋𝑋3 : load.
We assume measures are generated as follows:
5

0
𝑓𝑓 𝐗𝐗, 𝛉𝛉 = 𝑋𝑋0 𝜃𝜃0 + 𝑋𝑋1 𝜃𝜃1 + ⋯ + 𝑋𝑋𝑚𝑚 𝜃𝜃𝑚𝑚 , 𝑌𝑌 = 𝑓𝑓 + 𝜀𝜀 .
-5 � 𝜎𝜎�𝜀𝜀2 from data analysis.
We identify parameters 𝛉𝛉,
-10 The outcome depends on the par.s set 𝑆𝑆 included.
0 5 10 15

-1

0 5 10 15

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 19
Regression, depending on the feature set, II

Time series: strain data collected by a sensor.


20

15 We can “explain” them using multiple inputs.


Feature: 𝑋𝑋0 : constant, 𝑋𝑋1 : linear time trend,
10

𝑋𝑋2 : temperature, 𝑋𝑋3 : load.


We assume measures are generated as follows:
5

0
𝑓𝑓 𝐗𝐗, 𝛉𝛉 = 𝑋𝑋0 𝜃𝜃0 + 𝑋𝑋1 𝜃𝜃1 + ⋯ + 𝑋𝑋𝑚𝑚 𝜃𝜃𝑚𝑚 , 𝑌𝑌 = 𝑓𝑓 + 𝜀𝜀 .
-5 � 𝜎𝜎�𝜀𝜀2 from data analysis.
We identify parameters 𝛉𝛉,
-10
The outcome depends on the par.s set 𝑆𝑆 included.
0 5 10 15

With only 𝑋𝑋1 , data are norm. distr., indep. of


time: fluctuation are intended as high noise.
1

-1
95% 𝐶𝐶𝐶𝐶𝑛𝑛 true
𝜎𝜎�𝜀𝜀 = 5.99 𝜃𝜃0 5.8 7.7 -2.5
0 5 10 15

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 20
Regression, depending on the feature set, III

Time series: strain data collected by a sensor.


20

15 We can “explain” them using multiple inputs.

10
Feature: 𝑋𝑋0 : constant, 𝑋𝑋1 : linear time trend,
𝑋𝑋2 : temperature, 𝑋𝑋3 : load.
We assume measures are generated as follows:
5

0
𝑓𝑓 𝐗𝐗, 𝛉𝛉 = 𝑋𝑋0 𝜃𝜃0 + 𝑋𝑋1 𝜃𝜃1 + ⋯ + 𝑋𝑋𝑚𝑚 𝜃𝜃𝑚𝑚 , 𝑌𝑌 = 𝑓𝑓 + 𝜀𝜀 .
-5 � 𝜎𝜎�𝜀𝜀2 from data analysis.
We identify parameters 𝛉𝛉,
-10 The outcome depends on the par.s set 𝑆𝑆 included.
0 5 10 15

With 𝑋𝑋1 and 𝑋𝑋2 , regression function is


a straight line.
1

-1 95% 𝐶𝐶𝐶𝐶𝑛𝑛 true


𝜎𝜎�𝜀𝜀 = 2.34 𝜃𝜃0 -3.4 17.6 -2.5
0 5 10 15

𝜃𝜃1 -2.0 20.2 4.0


𝜌𝜌 𝜃𝜃̂0 , 𝜃𝜃̂1 = −86%.

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 21
Regression, depending on the feature set, IV

Time series: strain data collected by a sensor.


20

15 We can “explain” them using multiple inputs.


Feature: 𝑋𝑋0 : constant, 𝑋𝑋1 : linear time trend,
10

𝑋𝑋2 : temperature, 𝑋𝑋3 : load.


We assume measures are generated as follows:
5

0
𝑓𝑓 𝐗𝐗, 𝛉𝛉 = 𝑋𝑋0 𝜃𝜃0 + 𝑋𝑋1 𝜃𝜃1 + ⋯ + 𝑋𝑋𝑚𝑚 𝜃𝜃𝑚𝑚 , 𝑌𝑌 = 𝑓𝑓 + 𝜀𝜀 .
-5 � 𝜎𝜎�𝜀𝜀2 from data analysis.
We identify parameters 𝛉𝛉,
-10
The outcome depends on the par.s set 𝑆𝑆 included.
0 5 10 15

With 𝑋𝑋1 , 𝑋𝑋2 and 𝑋𝑋3 , temperature


explains part of the fluctuation.
1

-1 95% 𝐶𝐶𝐶𝐶𝑛𝑛 true


𝜎𝜎�𝜀𝜀 = 2.25 𝜃𝜃0 -3.5 -2.0 -2.5
0 5 10 15

𝜃𝜃1 17.8 20.3 4.0


𝜌𝜌 𝜃𝜃̂0 , 𝜃𝜃̂1 = −86%, 𝜌𝜌 𝜃𝜃̂0 , 𝜃𝜃̂2 = −4.3%, 𝜌𝜌 𝜃𝜃̂1 , 𝜃𝜃̂2 = 5.0%. 𝜃𝜃2 0.46 1.5 1.1

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 22
Regression, depending on the feature set, V
20

The linear time trend 𝑋𝑋1 is similar to load 𝑋𝑋3 ,


15
hence correlation 𝜌𝜌 𝜃𝜃̂1 , 𝜃𝜃̂3 is high.
10 This is an example of the co-linearity issue.
Hence, it is hard to identify 𝜃𝜃̂1 and 𝜃𝜃̂3
(only their sum can be identified).
5

actual regr. funct. 𝑓𝑓


-5
is mostly inside
conf. int. for 𝑓𝑓.
-10

0 5 10 15

With 𝑋𝑋1 , 𝑋𝑋2 , 𝑋𝑋3 , 𝑋𝑋4 , the model is consistent


with the generative one, so noise level is low.
1

-1 95% 𝐶𝐶𝐶𝐶𝑛𝑛 true


𝜎𝜎�𝜀𝜀 = 1.89 𝜃𝜃0 -2.9 -1.7 -2.5
0 5 10 15

𝜃𝜃1 -0.80 7.3 4.0


𝜌𝜌 𝜃𝜃̂0 , 𝜃𝜃̂1 = −39%, 𝜌𝜌 𝜃𝜃̂0 , 𝜃𝜃̂2 = −5.0%, 𝜌𝜌 𝜃𝜃̂0 , 𝜃𝜃̂3 = 17.7%, 𝜃𝜃2 0.47 1.3 1.1
𝜌𝜌 𝜃𝜃̂1 , 𝜃𝜃̂2 = 5.0%, 𝜌𝜌 𝜃𝜃̂1 , 𝜃𝜃̂3 = −97%, 𝜌𝜌 𝜃𝜃̂2 , 𝜃𝜃̂3 = −3.7%. 𝜃𝜃3 9.6 16.0 12.5

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 23
Building a base of functions for regression

Multiple regression can integrate the linear effects of


3

multiple distinct inputs (e.g.: temperature and humidity


for predicting human comfort).
2

1 𝑚𝑚 = 0 However, new inputs can be generated by non-linear


transformations of a set of original inputs.
0

0 0.5 1 1.5 2
e.g., we can define : 𝑋𝑋1 = 𝑋𝑋, 𝑋𝑋2 = 𝑋𝑋 2 , 𝑋𝑋3 = 𝑋𝑋 3 , …
polynomial fitting.
40 Polynomial regression function:
35 𝑓𝑓 𝐱𝐱, 𝛉𝛉 = 𝜃𝜃0 + 𝜃𝜃1 𝑋𝑋 + 𝜃𝜃2 𝑋𝑋 2 + ⋯ + 𝜃𝜃𝑚𝑚 𝑋𝑋 𝑚𝑚
30
𝑌𝑌 = 𝑓𝑓 + 𝜀𝜀 𝑚𝑚: polynomial degree
25

20 What is important in linear regression is the relation


between 𝛉𝛉 and 𝐟𝐟 is linear.
15

(NOT that that between the original 𝐱𝐱 and 𝐟𝐟).


10

0 0.5 1 1.5 2

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 24
Building a base of functions for regression, II

Dataset generated from a parabolic regression function.


3

1 𝑚𝑚 = 0

0 0.5 1 1.5 2

40 Actual regression function:


35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥 + 𝜃𝜃2 𝑥𝑥 2 , 𝜀𝜀~𝒩𝒩 0, 1.52
30
true
25
𝜃𝜃0 10
20 𝜃𝜃1 38
𝜃𝜃2 -15
15

10

0 0.5 1 1.5 2

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 25
Building a base of functions for regression, III

Dataset generated from a parabolic regression function.


3

2
If the assumed polynomial is a constant (𝑚𝑚 = 0),
estimating 𝜃𝜃0 is estimating 𝜇𝜇𝑌𝑌 .
1 𝑚𝑚 = 0

0 0.5 1 1.5 2

40 Assumed regression function:


35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0
30
95% 𝐶𝐶𝐶𝐶𝑛𝑛 true
25
𝜃𝜃0 31.1 32.3 10
20 𝜃𝜃1 38
𝜃𝜃2 -15
15

10

0 0.5 1 1.5 2

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 26
Building a base of functions for regression, IV

Dataset generated from a parabolic regression function.


3

2
If the assumed polynomial is a linear (𝑚𝑚 = 1),
dataset is still underfitted.
1 𝑚𝑚 = 0

0 0.5 1 1.5 2

40 Assumed regression function:


35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥
30
95% 𝐶𝐶𝐶𝐶𝑛𝑛 true
25
𝜃𝜃0 22.7 25.3 10
20 𝜃𝜃1 6.4 9.0 38
𝜃𝜃2 -15
15

10

0 0.5 1 1.5 2

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 27
Building a base of functions for regression, V

Dataset generated from a parabolic regression function.


3

2
If the assumed polynomial is a parabolic (𝑚𝑚 = 2),
the assumption is correct, and the actual parameters are
1 𝑚𝑚 = 0 in the estimated conf. intervals.

0 0.5 1 1.5 2

40 Assumed regression function:


35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥 + 𝜃𝜃2 𝑥𝑥 2
30
95% 𝐶𝐶𝐶𝐶𝑛𝑛 true
25
𝜃𝜃0 7.4 15.3 10
20 𝜃𝜃1 27.2 44.0 38
𝜃𝜃2 -18.2 -9.9 -15
15

10

0 0.5 1 1.5 2

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 28
Building a base of functions for regression, VI

Dataset generated from a parabolic regression function.


3

2
If the assumed polynomial is a cubic (𝑚𝑚 = 3),
the cubic coefficient is uncertain,
1 𝑚𝑚 = 0 and so the prediction is also highly uncertain.

0 0.5 1 1.5 2

40 Assumed regression function:


35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥 + 𝜃𝜃2 𝑥𝑥 2 + 𝜃𝜃3 𝑥𝑥 3
30
95% 𝐶𝐶𝐶𝐶𝑛𝑛 true
25
𝜃𝜃0 -9.5 20.0 10
20 𝜃𝜃1 7.6 103.8 38
𝜃𝜃2 -84.6 14.6 -15
𝜃𝜃3 -9.4 23.3 0
15

10

0 0.5 1 1.5 2

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 29
Building a base of functions for regression, VII

Dataset generated from a parabolic regression function.


3

2
If the assumed polynomial is of order four (𝑚𝑚 = 4),
also the coefficient of power four is uncertain,
1 𝑚𝑚 = 0 and so the prediction is even more uncertain.

0 0.5 1 1.5 2

40 Assumed regression function:


35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥 + 𝜃𝜃2 𝑥𝑥 2 + 𝜃𝜃3 𝑥𝑥 3 + 𝜃𝜃4 𝑥𝑥 4
30
95% 𝐶𝐶𝐶𝐶𝑛𝑛 true
25
𝜃𝜃0 -32.7 85.7 10
20 𝜃𝜃1 -302.2 223.0 38
𝜃𝜃2 -301.3 539.5 -15
𝜃𝜃3 -388.1 189.2 0
15

10 𝜃𝜃4 -45.4 98.5 0


0 0.5 1 1.5 2

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 30
Building a base of functions for regression, VIII

Dataset generated from a parabolic regression function.


3

2
If the assumed polynomial is of order four (𝑚𝑚 = 12),
the optimal coefficients cannot be identified, because of
1 𝑚𝑚 = 0 numerical issues related to co-linearity.

0 0.5 1 1.5 2

40 Assumed regression function:


35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥 + 𝜃𝜃2 𝑥𝑥 2 + ⋯ + 𝜃𝜃12 𝑥𝑥 12
30

25

20

15

10

0 0.5 1 1.5 2

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 31
Summary

Multiple regression is computationally simple, as it is related to linear algebra:


linear system: � = 𝐗𝐗 T 𝐲𝐲
𝐗𝐗 T 𝐗𝐗 𝛉𝛉
� = 𝐛𝐛 ⇒ computing 𝛉𝛉
𝐀𝐀 𝛉𝛉 � is solving a linear system.

Uncertainty of the estimation and the predictions can also be computed.


If the set of inputs (i.e. of features) can be selected, the selection process must find
a trade-off between under and over-fitting, i.e. selecting the right complexity.

Remarks:
1. What is the “meaning” of regression parameter 𝜃𝜃𝑖𝑖 ?
It represents how regression function 𝑓𝑓 changes under variation 𝑥𝑥𝑖𝑖 → 𝑥𝑥𝑖𝑖 + 1.
Δ𝑓𝑓 = … + 𝜃𝜃𝑖𝑖 𝑥𝑥𝑖𝑖 + 1 + ⋯ − … + 𝜃𝜃𝑖𝑖 𝑥𝑥 + ⋯ = 𝜃𝜃𝑖𝑖

2. What null hypothesis are we testing, in standard multiple regression?


𝒫𝒫𝑗𝑗 is the p-value for the null-hypothesis that 𝜃𝜃𝑖𝑖 is zero (with no restriction on the other
parameters). So we have 𝑚𝑚 + 1 tests.

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 32
References and readings

Baron, chapters: 11.3


Wasserman, chapters: 13.5
Kottegoda, Rosso, chapters: 6.2

https://en.wikipedia.org/wiki/Linear_regression

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 33
Proof of error formulas for straight-line regression

We derive the formulas for the cov. matrix in the two-par. est. of straight line,
−1
starting from the general formulas in vector-matrix notation: 𝚺𝚺Θ = 𝜎𝜎𝜀𝜀2 𝐗𝐗 T 𝐗𝐗

Vector-matrix notation:
𝐲𝐲 = 𝜃𝜃0 𝐱𝐱 + 𝜃𝜃1 𝟏𝟏 = 𝐱𝐱 𝟏𝟏 𝛉𝛉 = 𝐗𝐗𝛉𝛉 with 𝐗𝐗 = 𝐱𝐱 𝟏𝟏

T T 𝐱𝐱
𝐱𝐱 T 𝟏𝟏 = 𝑛𝑛 𝑋𝑋𝑛𝑛2 𝑛𝑛 𝑋𝑋�𝑛𝑛 𝑋𝑋𝑛𝑛2 𝑋𝑋�𝑛𝑛
𝐗𝐗 T 𝐗𝐗 = 𝐱𝐱 T 𝐱𝐱 𝟏𝟏 = 𝐱𝐱 = 𝑛𝑛
𝟏𝟏 𝐱𝐱 T 𝟏𝟏 𝟏𝟏T 𝟏𝟏 𝑛𝑛 𝑋𝑋�𝑛𝑛 𝑛𝑛 𝑋𝑋�𝑛𝑛 1
Matrix inversion:
−1
𝑎𝑎 𝑏𝑏
−1 1 𝑑𝑑 −𝑏𝑏 𝑋𝑋𝑛𝑛2 𝑋𝑋�𝑛𝑛 1 1 −𝑋𝑋�𝑛𝑛
= ⇒ =
𝑐𝑐 𝑑𝑑 𝑎𝑎𝑎𝑎 − 𝑏𝑏𝑏𝑏 −𝑐𝑐 𝑎𝑎 𝑋𝑋�𝑛𝑛 1 𝑋𝑋𝑛𝑛2 − 𝑋𝑋�𝑛𝑛2 −𝑋𝑋�𝑛𝑛 𝑋𝑋𝑛𝑛2

𝑉𝑉�𝑋𝑋,𝑛𝑛

−1 𝜎𝜎𝜀𝜀2 1 −𝑋𝑋�𝑛𝑛
Hence: 𝚺𝚺Θ = 𝜎𝜎𝜀𝜀2 𝐗𝐗 T 𝐗𝐗 = as reported in past lecture.
𝑛𝑛 𝑉𝑉�𝑋𝑋,𝑛𝑛 −𝑋𝑋�𝑛𝑛 𝑋𝑋𝑛𝑛2

12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 34

You might also like