Multivariate Linear Regression

12
24 704: Probability and Estimation Methods for Engineering Systems
Lec. 22
Multivariate Linear Regression
instructor: Matteo Pozzi
12
24 704: Prob Est Eng Sys Lec. 22 Multivariate Linear Regression 1
Example of Linear Regression
2
R = 19 %
80
75
𝑛𝑛 = 40 75
𝑓𝑓̂ = 𝜃𝜃̂0 + 𝜃𝜃̂1 𝑥𝑥

70 70
65 65
𝜃𝜃̂0 = −274.4 MPa
strength [MPa]
strength [MPa]
60 60
MPa m3
55 55
𝜃𝜃̂1 = 0.1368
𝑦𝑦
Kg
50 50
45 45
2400 2420 2440 2460 2480 2400 2420 2440 2460 2480 2500
𝑥𝑥
3 3
density [kg/m ] density [kg/m ]
𝛼𝛼 = 5% ⇒ 1 − 𝛼𝛼 = 95% confidence
𝐶𝐶𝐶𝐶𝑛𝑛 𝜃𝜃0 = 𝜃𝜃̂0 − 𝑡𝑡𝛼𝛼/2,𝑛𝑛−2 se
� 0 ; 𝜃𝜃̂0 + 𝑡𝑡𝛼𝛼/2,𝑛𝑛−2 se
� 0 = −500.9; −47.9 MPa
MPa m3
𝐶𝐶𝐶𝐶𝑛𝑛 𝜃𝜃1 = 𝜃𝜃̂1 − 𝑡𝑡𝛼𝛼/2,𝑛𝑛−2 se
� 1 ; 𝜃𝜃̂1 + 𝑡𝑡𝛼𝛼/2,𝑛𝑛−2 se
�1 = 0.0442; 0.2295
Kg
𝑡𝑡𝛼𝛼/2,𝑛𝑛−2 : value for t-distr. with 𝑛𝑛 − 2 0 ∉ 𝐶𝐶𝐶𝐶𝑛𝑛 𝜃𝜃1 ⇒ reject 𝐻𝐻0 : 𝜃𝜃1 = 0 ,
degrees of freedom. with significance 𝛼𝛼 = 5%,
12
Further topics on Regression
After performing regression, we assess how much of the uncertainty of the output is
explained by conditioning on the input.
This is quantified by the Coefficient of Determination 𝑅𝑅2 .
Linear regression can be extended to multiple inputs (and one output):

this is multiple regression.
We can assess the importance of each input in predicting the output.
12
again, MultiVariate Normal RVs: conditional variance
The conditional distribution is normal, with - Cond. Mean 𝜇𝜇𝑌𝑌|𝑥𝑥 linearly varying with 𝑥𝑥 ;
2 - Cond. Variance 𝜎𝜎𝑌𝑌2|𝑋𝑋 invariant respect to 𝑥𝑥.
𝑌𝑌| 𝑋𝑋 = 𝑥𝑥 ~𝒩𝒩 𝜇𝜇𝑌𝑌|𝑥𝑥 , 𝜎𝜎𝑌𝑌|𝑋𝑋
𝜎𝜎𝑌𝑌
𝔼𝔼𝑌𝑌|𝑥𝑥 𝑌𝑌|𝑥𝑥 = 𝜇𝜇𝑌𝑌|𝑥𝑥 = 𝜇𝜇𝑌𝑌 + 𝜌𝜌 𝑥𝑥 − 𝜇𝜇𝑋𝑋
𝜎𝜎𝑋𝑋 Marginal variance:
𝕍𝕍𝑌𝑌|𝑥𝑥 𝑌𝑌|𝑥𝑥 = 𝜎𝜎𝑌𝑌2|𝑋𝑋 = 1 − 𝜌𝜌2 𝜎𝜎𝑌𝑌2 𝕍𝕍𝑌𝑌 𝑌𝑌 = 𝜎𝜎𝑌𝑌2
0.75
0.5
y
0.25
95% conf. int.

0
0 0.25 0.5 0.75 1
12
Contributions to total variance
Law of total variance: 𝜎𝜎𝑌𝑌2 = 𝔼𝔼𝑋𝑋 𝜎𝜎𝑌𝑌2|𝑥𝑥 + 𝕍𝕍𝑋𝑋 𝜇𝜇𝑌𝑌|𝑥𝑥
𝜇𝜇𝑌𝑌 + 2𝜎𝜎𝑌𝑌 For any pair of RVs, we can define marginal and
conditional moments, and the law says that
𝜇𝜇𝑌𝑌|𝑥𝑥 the total uncertainty about 𝑌𝑌 is the sum of
𝜇𝜇𝑌𝑌|𝑥𝑥 + 2𝜎𝜎𝑌𝑌|𝑥𝑥 the explained and the unexplained uncertainties.
𝜇𝜇𝑌𝑌 𝑃𝑃𝑌𝑌|𝑋𝑋 𝑦𝑦 𝑥𝑥
𝜇𝜇𝑌𝑌|𝑥𝑥 − 2𝜎𝜎𝑌𝑌|𝑥𝑥
For MVN, 𝜎𝜎𝑌𝑌2|𝑥𝑥 does not change with 𝑥𝑥, so
𝜇𝜇𝑌𝑌 − 2𝜎𝜎𝑌𝑌 𝔼𝔼𝑋𝑋 𝜎𝜎𝑌𝑌2|𝑥𝑥 = 𝜎𝜎𝑌𝑌2|𝑋𝑋
12
Univariate ANOVA
Law of total variance: 𝜎𝜎𝑌𝑌2 = 𝔼𝔼𝑋𝑋 𝜎𝜎𝑌𝑌2|𝑥𝑥 + 𝕍𝕍𝑋𝑋 𝜇𝜇𝑌𝑌|𝑥𝑥
Prior uncertainty of 𝑌𝑌 Left unexplained Explained by observing 𝑥𝑥

𝜎𝜎𝑌𝑌2 = 𝜎𝜎𝑌𝑌2|𝑋𝑋 + 𝕍𝕍𝑋𝑋 𝑓𝑓 𝑥𝑥
1 − 𝜌𝜌2 𝜎𝜎𝑌𝑌2 𝜌𝜌2 𝜎𝜎𝑌𝑌2

𝑛𝑛
1
Sample estimators: 𝜎𝜎𝑌𝑌2 ≅ 𝑉𝑉�𝑌𝑌,𝑛𝑛 = � 𝑦𝑦𝑖𝑖 − 𝑌𝑌�𝑛𝑛 2
𝜎𝜎𝑌𝑌2
𝑛𝑛 − 1
𝑖𝑖=1
𝜎𝜎𝑌𝑌2|𝑋𝑋 𝕍𝕍𝑋𝑋 𝑓𝑓 𝑥𝑥
𝑛𝑛
1 2
𝕍𝕍𝑋𝑋 𝑓𝑓 𝑥𝑥 ≅ 𝑉𝑉�𝑓𝑓,𝑛𝑛 = � 𝑓𝑓̂𝑖𝑖 − 𝑌𝑌�𝑛𝑛 0 1 𝜌𝜌
2
𝑛𝑛 − 1
𝑖𝑖=1
𝑛𝑛 𝑛𝑛 𝑛𝑛 𝑛𝑛
2 2 2
� 𝑓𝑓̂𝑖𝑖 − 𝑌𝑌�𝑛𝑛 = � 𝜃𝜃̂0 + 𝜃𝜃̂1 𝑥𝑥𝑖𝑖 − 𝑌𝑌�𝑛𝑛 = � 𝑌𝑌�𝑛𝑛 − 𝜃𝜃̂1 𝑋𝑋�𝑛𝑛 + 𝜃𝜃̂1 𝑥𝑥𝑖𝑖 − 𝑌𝑌�𝑛𝑛 = 𝜃𝜃̂12 � 𝑥𝑥𝑖𝑖 − 𝑋𝑋�𝑛𝑛 2
𝑖𝑖=1 𝑖𝑖=1 𝑖𝑖=1 𝑖𝑖=1
𝕍𝕍𝑋𝑋 𝑓𝑓 𝑥𝑥 ≅ 𝜃𝜃̂12 𝑉𝑉�𝑋𝑋,𝑛𝑛 = 𝐶𝐶̂𝑋𝑋,𝑌𝑌,𝑛𝑛

2
/𝑉𝑉�𝑋𝑋,𝑛𝑛 = 𝜌𝜌� 2 𝑉𝑉�𝑌𝑌,𝑛𝑛
12
Univariate ANOVA, cont.
𝑛𝑛
1
Sample estimators: 𝜎𝜎𝑌𝑌2 ≅ 𝑉𝑉�𝑌𝑌,𝑛𝑛 = � 𝑦𝑦𝑖𝑖 − 𝑌𝑌�𝑛𝑛 2
prior uncertainty of 𝑌𝑌
𝑛𝑛 − 1
𝑖𝑖=1
𝕍𝕍𝑋𝑋 𝑓𝑓 𝑥𝑥 ≅ 𝑉𝑉�𝑓𝑓,𝑛𝑛 = 𝜌𝜌� 2 𝑉𝑉�𝑌𝑌,𝑛𝑛 explained by observing 𝑥𝑥

𝑛𝑛
1
𝜎𝜎𝑌𝑌2|𝑋𝑋 = 𝜎𝜎𝜀𝜀2 ≅ 𝜎𝜎�𝜀𝜀2 = 𝑉𝑉�𝜀𝜀,𝑛𝑛 = � 𝑟𝑟𝑖𝑖2 left unexplained
𝑛𝑛 − 1
𝑖𝑖=1
Law of total variance: 𝑉𝑉�𝑌𝑌,𝑛𝑛 = 𝑉𝑉�𝑓𝑓,𝑛𝑛 + 𝑉𝑉�𝜀𝜀,𝑛𝑛 𝑉𝑉�𝑌𝑌,𝑛𝑛

prior uncertainty explained by left
of 𝑌𝑌 observing 𝑥𝑥 unexplained 𝑉𝑉�𝑓𝑓,𝑛𝑛 𝑉𝑉�𝜀𝜀,𝑛𝑛
𝑉𝑉�𝑓𝑓,𝑛𝑛
Coefficient of determination: 2
𝑅𝑅 = = 𝜌𝜌� 2
𝑉𝑉�𝑌𝑌,𝑛𝑛
𝑅𝑅2 estimates the (square of the) correlation between 𝑋𝑋 and 𝑌𝑌,
and it describes how much of 𝑌𝑌 can be explained by 𝑋𝑋.
0 ≤ 𝑅𝑅2 ≤ 1 Limit cases: 𝑅𝑅2 = 0 ⇒ 𝑋𝑋 explains nothing about 𝑌𝑌
𝑅𝑅2 = 1 ⇒ 𝑋𝑋 explains everything about 𝑌𝑌
12
Coefficient of determination
𝑉𝑉�𝑓𝑓,𝑛𝑛 𝑉𝑉�𝜀𝜀,𝑛𝑛 ∑𝑛𝑛𝑖𝑖=1 𝑟𝑟𝑖𝑖2

Classical definition: 𝑅𝑅2 = =1− = 1 − 𝑛𝑛
𝑉𝑉�𝑌𝑌,𝑛𝑛 𝑉𝑉�𝑌𝑌,𝑛𝑛 ∑𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑌𝑌�𝑛𝑛 2
[e.g. Excel used this]
2
∑𝑛𝑛𝑖𝑖=1 𝑓𝑓̂𝑖𝑖 − 𝑌𝑌�𝑛𝑛
= 𝑛𝑛
∑𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑌𝑌�𝑛𝑛 2
𝑛𝑛
1
Adjusted definition: 𝜎𝜎𝑌𝑌2|𝑋𝑋 = 𝜎𝜎𝜀𝜀2 ≅ 𝑉𝑉�𝜀𝜀,𝑛𝑛 = � 𝑟𝑟𝑖𝑖2 [unbiased]
𝑛𝑛 − 𝑚𝑚 − 1
𝑖𝑖=1
𝑉𝑉�𝜀𝜀,𝑛𝑛 ∑ 𝑛𝑛
𝑖𝑖=1 𝑟𝑟 2
𝑖𝑖 / 𝑛𝑛 − 𝑚𝑚 − 1
𝑅𝑅� 2 = 1 − = 1 − 𝑛𝑛
𝑉𝑉�𝑌𝑌,𝑛𝑛 ∑𝑖𝑖=1 𝑦𝑦𝑖𝑖 − 𝑌𝑌�𝑛𝑛 2 / 𝑛𝑛 − 1
𝑉𝑉�𝜀𝜀,𝑛𝑛 > 𝑉𝑉�𝜀𝜀,𝑛𝑛 ⇒ 𝑅𝑅� 2 < 𝑅𝑅2
The adjusted coeff. of determination 𝑅𝑅� 2 is less than the classical one 𝑅𝑅2 .
12
Multiple features
outcome variable 𝑚𝑚 features: (aka independent variables,
(aka dependent variable) predictors, covariates)
𝑦𝑦 𝑥𝑥�0 𝑥𝑥�1 𝑥𝑥�2 ⋯ 𝑥𝑥�𝑚𝑚

𝑦𝑦1 1 𝑥𝑥11 𝑥𝑥12 ⋯ 𝑥𝑥1𝑚𝑚 𝐱𝐱1
dataset:
𝑦𝑦2 1 𝑥𝑥21 𝑥𝑥22 ⋯ 𝑥𝑥2𝑚𝑚 𝐱𝐱 2
⋮ 1 ⋮ ⋮ ⋱ ⋮ ⋮
𝑛𝑛 joint samples 𝑦𝑦𝑛𝑛 1 𝑥𝑥𝑛𝑛𝑛 𝑥𝑥𝑛𝑛𝑛 ⋯ 𝑥𝑥𝑛𝑛𝑛𝑛 𝐱𝐱 𝑛𝑛
𝜃𝜃0 𝜃𝜃1 𝜃𝜃2 ⋯ 𝜃𝜃𝑚𝑚
How to predict 𝑦𝑦 as a function of 𝑚𝑚 features 𝑥𝑥�1 , 𝑥𝑥�2 , … , 𝑥𝑥�𝑚𝑚 ?
Parametric form, with 𝑚𝑚 + 1 parameters 𝜃𝜃0 , 𝜃𝜃1 , … , 𝜃𝜃𝑚𝑚 .

𝑚𝑚 𝑚𝑚
Linear function: 𝑓𝑓𝑖𝑖 𝐱𝐱 𝑖𝑖 , 𝛉𝛉 = 𝐱𝐱 𝑖𝑖 𝛉𝛉 = � 𝑥𝑥𝑖𝑖𝑖𝑖 𝜃𝜃𝑗𝑗 = 𝜃𝜃0 + � 𝑥𝑥𝑖𝑖𝑖𝑖 𝜃𝜃𝑗𝑗
𝑗𝑗=0 𝑗𝑗=1
∀𝑖𝑖 ∈ 1,2, … , 𝑛𝑛 : 𝐱𝐱 𝑖𝑖 = 1 𝑥𝑥𝑖𝑖𝑖 𝑥𝑥𝑖𝑖𝑖 ⋯ 𝑥𝑥𝑖𝑖𝑖𝑖
12
Multivariate linear regression: basics
𝐱𝐱1 𝑦𝑦1
Matrix of 𝐱𝐱 2 size: Vector of 𝑦𝑦2 size:
𝐗𝐗 = ⋮ 𝐲𝐲 = ⋮
regressors: 𝑛𝑛 × 𝑚𝑚 + 1 outputs: 𝑛𝑛 × 1
𝐱𝐱 𝑛𝑛 𝑦𝑦𝑛𝑛
𝑓𝑓1 𝜃𝜃0
Vector of 𝑓𝑓2 size: Vector of 𝜃𝜃1 size:
predictions: 𝐟𝐟 = parameters: 𝛉𝛉 =
⋮ 𝑛𝑛 × 1 ⋮ 𝑚𝑚 + 1 × 1
𝑓𝑓𝑛𝑛 𝜃𝜃𝑚𝑚
Linear predictions: 𝐟𝐟 𝐗𝐗, 𝛉𝛉 = 𝐗𝐗 𝛉𝛉 Noise: 𝛆𝛆~𝒩𝒩 0, 𝜎𝜎𝜀𝜀2 𝐈𝐈
Basic equation: 𝐲𝐲 = 𝐟𝐟 + 𝛆𝛆 = 𝐗𝐗 𝛉𝛉 + 𝛆𝛆 independent noise

𝑟𝑟1
𝑟𝑟2 size:
Residuals: 𝐫𝐫 = 𝐲𝐲 − 𝐗𝐗 𝛉𝛉 = ⋮
𝑛𝑛 × 1
𝑟𝑟𝑛𝑛
12
Multivariate linear regression: least squares
Residual sum of squares, it is a quadratic form of 𝛉𝛉:

𝑛𝑛
rss𝑛𝑛 𝛉𝛉 = � 𝑟𝑟𝑖𝑖2 = 𝐫𝐫 T 𝐫𝐫 = 𝐲𝐲 − 𝐗𝐗 𝛉𝛉 T 𝐲𝐲 − 𝐗𝐗 𝛉𝛉
𝑖𝑖=1
= 𝐲𝐲 T 𝐲𝐲 − 2𝐲𝐲 T 𝐗𝐗 𝛉𝛉 + 𝛉𝛉T 𝐗𝐗 T 𝐗𝐗 𝛉𝛉
To minimize rss𝑛𝑛 , we compute its gradient, and put it to zero:
∇rss 𝛉𝛉 = −2𝐗𝐗 T 𝐲𝐲 + 2𝐗𝐗 T 𝐗𝐗 𝛉𝛉
� = 𝐗𝐗 T 𝐲𝐲
∇rss 𝛉𝛉 = 𝟎𝟎 ⇔ 𝐗𝐗 T 𝐗𝐗 𝛉𝛉
−1 −1 T
∃ 𝐗𝐗 T 𝐗𝐗 � = 𝐗𝐗 T 𝐗𝐗
: 𝛉𝛉 𝐗𝐗 𝐲𝐲
� = 𝐗𝐗 + 𝐲𝐲
𝛉𝛉
−1 T
Matrix 𝐗𝐗 + = 𝐗𝐗 T 𝐗𝐗 𝐗𝐗 , of size 𝑚𝑚 + 1 × 𝑛𝑛, is the pseudo-inverse of 𝐗𝐗.
� = 𝐗𝐗 + 𝐲𝐲, we conclude that 𝛉𝛉
From formula 𝛉𝛉 � is a linear estimator.
12
MLE for multiple regression with normal noise
LH function: 𝐘𝐘�𝐗𝐗, 𝛉𝛉, 𝜎𝜎𝜀𝜀2 ~𝒩𝒩 𝐟𝐟 𝐗𝐗, 𝛉𝛉 , 𝜎𝜎𝜀𝜀2 𝐈𝐈 = 𝒩𝒩 𝐗𝐗 𝛉𝛉, 𝜎𝜎𝜀𝜀2 𝐈𝐈

𝑝𝑝𝐘𝐘|𝐗𝐗,𝛉𝛉,𝜎𝜎𝜀𝜀2 𝐘𝐘 𝐗𝐗, 𝛉𝛉, 𝜎𝜎𝜀𝜀2 = 𝒩𝒩 𝐘𝐘; 𝐗𝐗 𝛉𝛉, 𝜎𝜎𝜀𝜀2 𝐈𝐈
= 𝒩𝒩 𝐗𝐗 𝛉𝛉; 𝐲𝐲, 𝜎𝜎𝜀𝜀2 𝐈𝐈

(conditional)
𝑛𝑛 1
Log-LH function: log 𝑝𝑝𝐘𝐘|𝐗𝐗,𝛉𝛉,𝜎𝜎𝜀𝜀2 𝐘𝐘 𝐗𝐗, 𝛉𝛉, 𝜎𝜎𝜀𝜀2 = − log 𝜎𝜎𝜀𝜀2 − 2 𝐲𝐲 − 𝐗𝐗 𝛉𝛉 T 𝐲𝐲 − 𝐗𝐗 𝛉𝛉
2 2𝜎𝜎𝜀𝜀
2
𝑛𝑛 2
1
𝑙𝑙𝑛𝑛 𝛉𝛉, 𝜎𝜎𝜀𝜀 = − log 𝜎𝜎𝜀𝜀 − 2 rss𝑛𝑛 𝛉𝛉
2 2𝜎𝜎𝜀𝜀
rss
� 𝑛𝑛 �
To maximize Log-LH, we minimize rss𝑛𝑛 : � = 𝐗𝐗 + 𝐲𝐲; 𝜎𝜎�𝜀𝜀2 =
𝛉𝛉 � 𝑛𝑛 ≜ rss𝑛𝑛 𝛉𝛉
; rss
𝑛𝑛 − 𝑚𝑚 − 1
Unbiased estimator, after 𝑚𝑚 + 1 calibrated parameters.
[Actually, this is the formula for the unbiased estimate of 𝜎𝜎𝜀𝜀2 ,

while “pure” MLE estimates 𝜎𝜎𝜀𝜀2 as rss
� 𝑛𝑛 /𝑛𝑛 ]
12
Uncertainty in the estimation of regression parameters
−1 T
Estimation: � = 𝐗𝐗 T 𝐗𝐗
𝛉𝛉 𝐗𝐗 𝐲𝐲 𝐘𝐘 = 𝐗𝐗 𝛉𝛉 + 𝛆𝛆 𝛆𝛆~𝒩𝒩 𝟎𝟎, 𝜎𝜎𝜀𝜀2 𝐈𝐈
−1 T −1 T
� = 𝐗𝐗 T 𝐗𝐗
𝛉𝛉 𝐗𝐗 𝐗𝐗 𝛉𝛉 + 𝐗𝐗 T 𝐗𝐗 𝐗𝐗 𝛆𝛆 = 𝛉𝛉 + 𝐗𝐗 + 𝛆𝛆 error
𝐗𝐗 � 𝐗𝐗, 𝜎𝜎𝜀𝜀2 = 𝛉𝛉 unbiased est. (as error is zero mean)
𝔼𝔼 𝛆𝛆 = 𝟎𝟎 ⇒ 𝔼𝔼 𝛉𝛉�𝛉𝛉,
𝛉𝛉 �
𝛉𝛉 −1 T −1
� 𝐗𝐗, 𝜎𝜎𝜀𝜀2 = 𝜎𝜎𝜀𝜀2 𝐗𝐗 + 𝐗𝐗 + T = 𝜎𝜎𝜀𝜀2 𝐗𝐗 T 𝐗𝐗
𝕍𝕍 𝛉𝛉�𝛉𝛉, 𝐗𝐗 𝐗𝐗 𝐗𝐗 T 𝐗𝐗
𝐘𝐘
−1
𝜎𝜎𝜀𝜀2 𝜎𝜎�𝜀𝜀2 = 𝜎𝜎𝜀𝜀2 𝐗𝐗 T 𝐗𝐗 ≜ 𝚺𝚺Θ
−1
≅ 𝜎𝜎�𝜀𝜀2 𝐗𝐗 T 𝐗𝐗 �Θ
≜ 𝚺𝚺
𝐗𝐗
𝛉𝛉, 𝜎𝜎𝜀𝜀2 � is proportional 𝜎𝜎𝜀𝜀2 ,
� 𝜎𝜎�𝜀𝜀2 The matrix of uncertainty of 𝛉𝛉
𝛉𝛉, −1
while 𝐗𝐗 T 𝐗𝐗 tends to the zero matrix for large 𝑛𝑛.
� 𝛉𝛉, 𝜎𝜎𝜀𝜀2 ~ 𝒩𝒩 𝛉𝛉, 𝚺𝚺Θ

𝛉𝛉�𝐗𝐗, Normally distributed knowing 𝜎𝜎𝜀𝜀2 .
�|𝐗𝐗, 𝛉𝛉 ≈ 𝒩𝒩 𝛉𝛉, 𝚺𝚺
𝛉𝛉 �Θ We can estimate 𝜎𝜎𝜀𝜀2 ← 𝜎𝜎�𝜀𝜀2 from data.
12
Uncertainty in the estimation of noise level
Stand. error estimating 𝜃𝜃𝑗𝑗 : se

� 𝑗𝑗 ≜ se
� Θ𝑗𝑗 = �Θ
𝚺𝚺 𝑗𝑗+1,𝑗𝑗+1
Conf. interval for 𝜃𝜃𝑗𝑗 : 𝐶𝐶𝐶𝐶𝑗𝑗 = 𝜃𝜃̂𝑗𝑗 − 𝑧𝑧𝛼𝛼/2 se

� 𝑗𝑗 ; 𝜃𝜃̂𝑗𝑗 + 𝑧𝑧𝛼𝛼/2 se
� 𝑗𝑗
𝐗𝐗
for low 𝑛𝑛, we should use
𝛉𝛉 �
𝛉𝛉 𝑡𝑡𝛼𝛼/2,𝑛𝑛−𝑚𝑚−1 instead of 𝑧𝑧𝛼𝛼/2 .
𝐘𝐘 Estimator of the noise level:
𝜎𝜎𝜀𝜀2 𝜎𝜎�𝜀𝜀2
𝜎𝜎�𝜀𝜀2 2 2
𝑛𝑛 − 𝑚𝑚 − 1 2 �𝜎𝜎𝜀𝜀 ~𝜒𝜒𝑛𝑛−𝑚𝑚−1
𝜎𝜎𝜀𝜀
𝐗𝐗
𝛉𝛉, 𝜎𝜎𝜀𝜀2 We can define confidence intervals and
� 𝜎𝜎�𝜀𝜀2
𝛉𝛉, test hypotheses on the noise level.
12
Example of Multivariate Regression
Dataset of Criminal activity, 𝑛𝑛 = 47 datapoints, 𝑚𝑚 = 10 regressors.
j name 𝜃𝜃̂𝑗𝑗 se
� 𝑗𝑗 𝑡𝑡𝑗𝑗 𝒫𝒫𝑗𝑗
0 (Intercept) -589.4 167.6 -3.59 0.0012 ** If 𝒫𝒫𝑗𝑗 is small, we are
1 Age 1.041 0.446 2.331 0.0255 * confident that 𝜃𝜃𝑗𝑗
2 South State 11.29 13.24 0.853 0.3994 should not be zero.
3 Education 1.178 0.681 1.728 0.0926
4 Expenditures 0.964 0.249 3.861 0.0005 ***
5 Labor 0.106 0.153 0.692 0.4935
6 Numb. Males 0.303 0.222 1.363 0.1813
7 Population 0.090 0.138 0.652 0.5185
8 Unempl. (14-24) -0.682 0.480 -1.418 0.1648
9 Unempl. (25-39) 2.150 0.950 2.262 0.0299 *
10 Wealth -0.083 0.091 -0.913 0.3672
𝑡𝑡𝑗𝑗 = 𝜃𝜃̂𝑗𝑗 /se
� 𝑗𝑗
𝒫𝒫𝑗𝑗 = 2Φ − 𝑡𝑡𝑗𝑗
We use Φ under the normal approx., and the Student’s t CDF for small 𝑛𝑛.
12
Prediction of new values of 𝑓𝑓 and 𝑦𝑦, given 𝐱𝐱
How to predict 𝑦𝑦∗ , the outcome at features 𝐱𝐱 ∗ ?
True value: 𝑓𝑓∗ = 𝑓𝑓 𝐱𝐱 ∗ , 𝛉𝛉 = 𝐱𝐱 ∗ 𝛉𝛉 estimator: 𝑓𝑓̂∗ = 𝐱𝐱 ∗ 𝛉𝛉
�
𝑦𝑦∗ = 𝑓𝑓∗ + 𝜀𝜀∗ = 𝐱𝐱 ∗ 𝛉𝛉 + 𝜀𝜀∗ 𝑦𝑦�∗ = 𝑓𝑓̂∗
Uncertainty in the estimator: 𝑓𝑓̂∗ |𝐱𝐱 ∗ , 𝐗𝐗, 𝛉𝛉 ≈ 𝒩𝒩 𝑓𝑓∗ , 𝐱𝐱 ∗ 𝚺𝚺Θ� 𝐱𝐱∗T

−1 T
� 2 𝑓𝑓̂∗ = 𝐱𝐱 ∗ 𝚺𝚺Θ� 𝐱𝐱∗T = 𝜎𝜎�𝜀𝜀2 𝐱𝐱∗ 𝐗𝐗 T 𝐗𝐗
Standard error: se 𝐱𝐱 ∗
Conf. Int.: � 𝑓𝑓∗ ; 𝑓𝑓̂∗ + 𝑧𝑧𝛼𝛼/2 se

𝐶𝐶𝐶𝐶 𝑓𝑓∗ = 𝑓𝑓̂∗ − 𝑧𝑧𝛼𝛼/2 se � 𝑓𝑓∗
−1 T
Including noise in 𝑦𝑦∗ : � 2 𝑦𝑦∗ = 𝜎𝜎�𝜀𝜀2 𝐱𝐱 ∗ 𝐗𝐗 T 𝐗𝐗
se 𝐱𝐱 ∗ +1
𝐶𝐶𝐶𝐶 𝑦𝑦∗ = 𝑦𝑦�∗ − 𝑧𝑧𝛼𝛼/2 se

� 𝑦𝑦∗ ; 𝑦𝑦�∗ + 𝑧𝑧𝛼𝛼/2 se
� 𝑦𝑦∗
For low , we should use 𝑡𝑡𝛼𝛼/2,𝑛𝑛−𝑚𝑚 instead of 𝑧𝑧𝛼𝛼/2 .

12
Geometry of least squares
Basic equation: 𝐲𝐲 = 𝐟𝐟 + 𝛆𝛆
𝑦𝑦 𝑥𝑥�0 𝑥𝑥�1 𝑥𝑥�2 ⋯ 𝑥𝑥�𝑚𝑚
prediction: 𝐟𝐟 = 𝐗𝐗 𝛉𝛉 𝑦𝑦1 1 𝑥𝑥11 𝑥𝑥12 ⋯ 𝑥𝑥1𝑚𝑚
𝑚𝑚
𝑦𝑦2 1 𝑥𝑥21 𝑥𝑥22 ⋯ 𝑥𝑥2𝑚𝑚
= � 𝐱𝐱 �𝑖𝑖 𝜃𝜃𝑖𝑖
⋮ 1 ⋮ ⋮ ⋱ ⋮
𝑖𝑖=0
𝑦𝑦𝑛𝑛 1 𝑥𝑥𝑛𝑛𝑛 𝑥𝑥𝑛𝑛𝑛 ⋯ 𝑥𝑥𝑛𝑛𝑛𝑛
𝐱𝐱 �0 𝐱𝐱 �1 𝐱𝐱 �2 ⋯ 𝐱𝐱 �𝑚𝑚
Dataset 𝐲𝐲 ∈ ℝ𝑛𝑛 is point in 𝑛𝑛 dimensional space.
Prediction vector 𝐟𝐟 spans a subspace of dimension 𝑚𝑚 + 1 .
In that subspace, point 𝐟𝐟̂ = 𝐗𝐗 𝛉𝛉
� is the closest to 𝐲𝐲.
𝐲𝐲
𝐲𝐲 = 𝐟𝐟̂ + 𝐫𝐫 with 𝐫𝐫 ⊥ 𝐟𝐟.
̂
𝐫𝐫 𝐱𝐱 �2
residual is orthogonal to prediction 𝐟𝐟̂ 𝐱𝐱 �1
12
Proof of orthogonality
We can derive the MLE estimator for from the condition that the residual
vector is orthogonal respect to prediction vector:
𝐟𝐟 = 𝐗𝐗 𝛉𝛉 prediction
𝛅𝛅 = 𝐲𝐲 − 𝐟𝐟 = 𝐲𝐲 − 𝐗𝐗 𝛉𝛉 residual funct. of 𝛉𝛉
T
�
𝐟𝐟 ⊥ 𝛅𝛅 ⇔ 0 = 𝐟𝐟 T 𝛅𝛅 = 𝐗𝐗 𝛉𝛉 �
𝐲𝐲 − 𝐗𝐗 𝛉𝛉
�T 𝐗𝐗 T 𝐲𝐲 − 𝐗𝐗 T 𝐗𝐗 𝛉𝛉
⇔ 0 = 𝛉𝛉 �
⟸ 𝟎𝟎 = �
𝐗𝐗 T 𝐲𝐲 − 𝐗𝐗 T 𝐗𝐗 𝛉𝛉 the bi-direction inference ⇔ is not certain.
−1 T
� = 𝐗𝐗 T 𝐗𝐗
⇔ 𝛉𝛉 𝐗𝐗 𝐲𝐲 = 𝐗𝐗 + 𝐲𝐲
The pseudo-inverse matrix can be derived from the orthogonality condition.

� is the value so that corresponding prediction 𝐟𝐟̂ = 𝐗𝐗 𝛉𝛉
MLE 𝛉𝛉 � is orthogonal to
̂ Also, 𝐫𝐫 is the shortest possible residual vec.:
corresponding residual 𝐫𝐫 = 𝐲𝐲 − 𝐟𝐟.
𝐫𝐫 = min 𝛅𝛅
12
Regression, depending on the feature set, I
Time series: strain data collected by a sensor.

20
15 We can “explain” them using multiple inputs.
10
Feature: 𝑋𝑋0 : constant, 𝑋𝑋1 : linear time trend,
𝑋𝑋2 : temperature, 𝑋𝑋3 : load.
We assume measures are generated as follows:
5
0
𝑓𝑓 𝐗𝐗, 𝛉𝛉 = 𝑋𝑋0 𝜃𝜃0 + 𝑋𝑋1 𝜃𝜃1 + ⋯ + 𝑋𝑋𝑚𝑚 𝜃𝜃𝑚𝑚 , 𝑌𝑌 = 𝑓𝑓 + 𝜀𝜀 .
-5 � 𝜎𝜎�𝜀𝜀2 from data analysis.
We identify parameters 𝛉𝛉,
-10 The outcome depends on the par.s set 𝑆𝑆 included.
0 5 10 15
-1
0 5 10 15
12
Regression, depending on the feature set, II

20

10

5
0
-10
The outcome depends on the par.s set 𝑆𝑆 included.
0 5 10 15
With only 𝑋𝑋1 , data are norm. distr., indep. of

time: fluctuation are intended as high noise.
1
-1
95% 𝐶𝐶𝐶𝐶𝑛𝑛 true
𝜎𝜎�𝜀𝜀 = 5.99 𝜃𝜃0 5.8 7.7 -2.5
0 5 10 15
12
Regression, depending on the feature set, III

20
10
5
0
-10 The outcome depends on the par.s set 𝑆𝑆 included.
0 5 10 15
With 𝑋𝑋1 and 𝑋𝑋2 , regression function is

a straight line.
1
-1 95% 𝐶𝐶𝐶𝐶𝑛𝑛 true

𝜎𝜎�𝜀𝜀 = 2.34 𝜃𝜃0 -3.4 17.6 -2.5
0 5 10 15
𝜃𝜃1 -2.0 20.2 4.0

𝜌𝜌 𝜃𝜃̂0 , 𝜃𝜃̂1 = −86%.
12
Regression, depending on the feature set, IV

20

10

5
0
-10
The outcome depends on the par.s set 𝑆𝑆 included.
0 5 10 15
With 𝑋𝑋1 , 𝑋𝑋2 and 𝑋𝑋3 , temperature

explains part of the fluctuation.
1

𝜎𝜎�𝜀𝜀 = 2.25 𝜃𝜃0 -3.5 -2.0 -2.5
0 5 10 15
𝜃𝜃1 17.8 20.3 4.0

𝜌𝜌 𝜃𝜃̂0 , 𝜃𝜃̂1 = −86%, 𝜌𝜌 𝜃𝜃̂0 , 𝜃𝜃̂2 = −4.3%, 𝜌𝜌 𝜃𝜃̂1 , 𝜃𝜃̂2 = 5.0%. 𝜃𝜃2 0.46 1.5 1.1
12
Regression, depending on the feature set, V
20
The linear time trend 𝑋𝑋1 is similar to load 𝑋𝑋3 ,

15
hence correlation 𝜌𝜌 𝜃𝜃̂1 , 𝜃𝜃̂3 is high.
10 This is an example of the co-linearity issue.
Hence, it is hard to identify 𝜃𝜃̂1 and 𝜃𝜃̂3
(only their sum can be identified).
5
actual regr. funct. 𝑓𝑓

-5
is mostly inside
conf. int. for 𝑓𝑓.
-10
0 5 10 15
With 𝑋𝑋1 , 𝑋𝑋2 , 𝑋𝑋3 , 𝑋𝑋4 , the model is consistent

with the generative one, so noise level is low.
1

𝜎𝜎�𝜀𝜀 = 1.89 𝜃𝜃0 -2.9 -1.7 -2.5
0 5 10 15
𝜃𝜃1 -0.80 7.3 4.0

𝜌𝜌 𝜃𝜃̂0 , 𝜃𝜃̂1 = −39%, 𝜌𝜌 𝜃𝜃̂0 , 𝜃𝜃̂2 = −5.0%, 𝜌𝜌 𝜃𝜃̂0 , 𝜃𝜃̂3 = 17.7%, 𝜃𝜃2 0.47 1.3 1.1
𝜌𝜌 𝜃𝜃̂1 , 𝜃𝜃̂2 = 5.0%, 𝜌𝜌 𝜃𝜃̂1 , 𝜃𝜃̂3 = −97%, 𝜌𝜌 𝜃𝜃̂2 , 𝜃𝜃̂3 = −3.7%. 𝜃𝜃3 9.6 16.0 12.5
12
Building a base of functions for regression
Multiple regression can integrate the linear effects of

3
multiple distinct inputs (e.g.: temperature and humidity

for predicting human comfort).
2
1 𝑚𝑚 = 0 However, new inputs can be generated by non-linear

transformations of a set of original inputs.
0
0 0.5 1 1.5 2
e.g., we can define : 𝑋𝑋1 = 𝑋𝑋, 𝑋𝑋2 = 𝑋𝑋 2 , 𝑋𝑋3 = 𝑋𝑋 3 , …
polynomial fitting.
40 Polynomial regression function:
35 𝑓𝑓 𝐱𝐱, 𝛉𝛉 = 𝜃𝜃0 + 𝜃𝜃1 𝑋𝑋 + 𝜃𝜃2 𝑋𝑋 2 + ⋯ + 𝜃𝜃𝑚𝑚 𝑋𝑋 𝑚𝑚
30
𝑌𝑌 = 𝑓𝑓 + 𝜀𝜀 𝑚𝑚: polynomial degree
25
20 What is important in linear regression is the relation

between 𝛉𝛉 and 𝐟𝐟 is linear.
15
(NOT that that between the original 𝐱𝐱 and 𝐟𝐟).

10
0 0.5 1 1.5 2
12
Building a base of functions for regression, II
Dataset generated from a parabolic regression function.

3
1 𝑚𝑚 = 0
0 0.5 1 1.5 2
40 Actual regression function:

35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥 + 𝜃𝜃2 𝑥𝑥 2 , 𝜀𝜀~𝒩𝒩 0, 1.52
30
true
25
𝜃𝜃0 10
20 𝜃𝜃1 38
𝜃𝜃2 -15
15
10
0 0.5 1 1.5 2
12
Building a base of functions for regression, III

3
2
If the assumed polynomial is a constant (𝑚𝑚 = 0),
estimating 𝜃𝜃0 is estimating 𝜇𝜇𝑌𝑌 .
1 𝑚𝑚 = 0
0 0.5 1 1.5 2
40 Assumed regression function:

35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0
30
25
𝜃𝜃0 31.1 32.3 10
20 𝜃𝜃1 38
𝜃𝜃2 -15
15
10
0 0.5 1 1.5 2
12
Building a base of functions for regression, IV

3
2
If the assumed polynomial is a linear (𝑚𝑚 = 1),
dataset is still underfitted.
1 𝑚𝑚 = 0
0 0.5 1 1.5 2

35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥
30
25
𝜃𝜃0 22.7 25.3 10
20 𝜃𝜃1 6.4 9.0 38
𝜃𝜃2 -15
15
10
0 0.5 1 1.5 2
12
Building a base of functions for regression, V

3
2
If the assumed polynomial is a parabolic (𝑚𝑚 = 2),
the assumption is correct, and the actual parameters are
1 𝑚𝑚 = 0 in the estimated conf. intervals.
0 0.5 1 1.5 2

35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥 + 𝜃𝜃2 𝑥𝑥 2
30
25
𝜃𝜃0 7.4 15.3 10
20 𝜃𝜃1 27.2 44.0 38
𝜃𝜃2 -18.2 -9.9 -15
15
10
0 0.5 1 1.5 2
12
Building a base of functions for regression, VI

3
2
If the assumed polynomial is a cubic (𝑚𝑚 = 3),
the cubic coefficient is uncertain,
1 𝑚𝑚 = 0 and so the prediction is also highly uncertain.
0 0.5 1 1.5 2

35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥 + 𝜃𝜃2 𝑥𝑥 2 + 𝜃𝜃3 𝑥𝑥 3
30
25
𝜃𝜃0 -9.5 20.0 10
20 𝜃𝜃1 7.6 103.8 38
𝜃𝜃2 -84.6 14.6 -15
𝜃𝜃3 -9.4 23.3 0
15
10
0 0.5 1 1.5 2
12
Building a base of functions for regression, VII

3
2
If the assumed polynomial is of order four (𝑚𝑚 = 4),
also the coefficient of power four is uncertain,
1 𝑚𝑚 = 0 and so the prediction is even more uncertain.
0 0.5 1 1.5 2

35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥 + 𝜃𝜃2 𝑥𝑥 2 + 𝜃𝜃3 𝑥𝑥 3 + 𝜃𝜃4 𝑥𝑥 4
30
25
𝜃𝜃0 -32.7 85.7 10
20 𝜃𝜃1 -302.2 223.0 38
𝜃𝜃2 -301.3 539.5 -15
𝜃𝜃3 -388.1 189.2 0
15
10 𝜃𝜃4 -45.4 98.5 0

0 0.5 1 1.5 2
12
Building a base of functions for regression, VIII

3
2
If the assumed polynomial is of order four (𝑚𝑚 = 12),
the optimal coefficients cannot be identified, because of
1 𝑚𝑚 = 0 numerical issues related to co-linearity.
0 0.5 1 1.5 2

35 𝑓𝑓 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥 + 𝜃𝜃2 𝑥𝑥 2 + ⋯ + 𝜃𝜃12 𝑥𝑥 12
30
25
20
15
10
0 0.5 1 1.5 2
12
Summary
Multiple regression is computationally simple, as it is related to linear algebra:

linear system: � = 𝐗𝐗 T 𝐲𝐲
𝐗𝐗 T 𝐗𝐗 𝛉𝛉
� = 𝐛𝐛 ⇒ computing 𝛉𝛉
𝐀𝐀 𝛉𝛉 � is solving a linear system.
Uncertainty of the estimation and the predictions can also be computed.

If the set of inputs (i.e. of features) can be selected, the selection process must find
a trade-off between under and over-fitting, i.e. selecting the right complexity.
Remarks:
1. What is the “meaning” of regression parameter 𝜃𝜃𝑖𝑖 ?
It represents how regression function 𝑓𝑓 changes under variation 𝑥𝑥𝑖𝑖 → 𝑥𝑥𝑖𝑖 + 1.
Δ𝑓𝑓 = … + 𝜃𝜃𝑖𝑖 𝑥𝑥𝑖𝑖 + 1 + ⋯ − … + 𝜃𝜃𝑖𝑖 𝑥𝑥 + ⋯ = 𝜃𝜃𝑖𝑖
2. What null hypothesis are we testing, in standard multiple regression?

𝒫𝒫𝑗𝑗 is the p-value for the null-hypothesis that 𝜃𝜃𝑖𝑖 is zero (with no restriction on the other
parameters). So we have 𝑚𝑚 + 1 tests.
12
References and readings
Baron, chapters: 11.3

Wasserman, chapters: 13.5
Kottegoda, Rosso, chapters: 6.2
https://en.wikipedia.org/wiki/Linear_regression
12
Proof of error formulas for straight-line regression
We derive the formulas for the cov. matrix in the two-par. est. of straight line,
−1
starting from the general formulas in vector-matrix notation: 𝚺𝚺Θ = 𝜎𝜎𝜀𝜀2 𝐗𝐗 T 𝐗𝐗
Vector-matrix notation:
𝐲𝐲 = 𝜃𝜃0 𝐱𝐱 + 𝜃𝜃1 𝟏𝟏 = 𝐱𝐱 𝟏𝟏 𝛉𝛉 = 𝐗𝐗𝛉𝛉 with 𝐗𝐗 = 𝐱𝐱 𝟏𝟏
T T 𝐱𝐱
𝐱𝐱 T 𝟏𝟏 = 𝑛𝑛 𝑋𝑋𝑛𝑛2 𝑛𝑛 𝑋𝑋�𝑛𝑛 𝑋𝑋𝑛𝑛2 𝑋𝑋�𝑛𝑛
𝐗𝐗 T 𝐗𝐗 = 𝐱𝐱 T 𝐱𝐱 𝟏𝟏 = 𝐱𝐱 = 𝑛𝑛
𝟏𝟏 𝐱𝐱 T 𝟏𝟏 𝟏𝟏T 𝟏𝟏 𝑛𝑛 𝑋𝑋�𝑛𝑛 𝑛𝑛 𝑋𝑋�𝑛𝑛 1
Matrix inversion:
−1
𝑎𝑎 𝑏𝑏
−1 1 𝑑𝑑 −𝑏𝑏 𝑋𝑋𝑛𝑛2 𝑋𝑋�𝑛𝑛 1 1 −𝑋𝑋�𝑛𝑛
= ⇒ =
𝑐𝑐 𝑑𝑑 𝑎𝑎𝑎𝑎 − 𝑏𝑏𝑏𝑏 −𝑐𝑐 𝑎𝑎 𝑋𝑋�𝑛𝑛 1 𝑋𝑋𝑛𝑛2 − 𝑋𝑋�𝑛𝑛2 −𝑋𝑋�𝑛𝑛 𝑋𝑋𝑛𝑛2
𝑉𝑉�𝑋𝑋,𝑛𝑛
−1 𝜎𝜎𝜀𝜀2 1 −𝑋𝑋�𝑛𝑛
Hence: 𝚺𝚺Θ = 𝜎𝜎𝜀𝜀2 𝐗𝐗 T 𝐗𝐗 = as reported in past lecture.
𝑛𝑛 𝑉𝑉�𝑋𝑋,𝑛𝑛 −𝑋𝑋�𝑛𝑛 𝑋𝑋𝑛𝑛2
12

Multivariate Linear Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multivariate Linear Regression

Uploaded by

Copyright:

Available Formats

12

24 704: Probability and Estimation Methods for Engineering Systems

Multivariate Linear Regression

instructor: Matteo Pozzi

𝑓𝑓̂ = 𝜃𝜃̂0 + 𝜃𝜃̂1 𝑥𝑥

𝜃𝜃̂0 = −274.4 MPa

Linear regression can be extended to multiple inputs (and one output):

95% conf. int.

Law of total variance: 𝜎𝜎𝑌𝑌2 = 𝔼𝔼𝑋𝑋 𝜎𝜎𝑌𝑌2|𝑥𝑥 + 𝕍𝕍𝑋𝑋 𝜇𝜇𝑌𝑌|𝑥𝑥

𝜇𝜇𝑌𝑌 − 2𝜎𝜎𝑌𝑌 𝔼𝔼𝑋𝑋 𝜎𝜎𝑌𝑌2|𝑥𝑥 = 𝜎𝜎𝑌𝑌2|𝑋𝑋

Law of total variance: 𝜎𝜎𝑌𝑌2 = 𝔼𝔼𝑋𝑋 𝜎𝜎𝑌𝑌2|𝑥𝑥 + 𝕍𝕍𝑋𝑋 𝜇𝜇𝑌𝑌|𝑥𝑥

Prior uncertainty of 𝑌𝑌 Left unexplained Explained by observing 𝑥𝑥

1 − 𝜌𝜌2 𝜎𝜎𝑌𝑌2 𝜌𝜌2 𝜎𝜎𝑌𝑌2

𝑖𝑖=1 𝑖𝑖=1 𝑖𝑖=1 𝑖𝑖=1

𝕍𝕍𝑋𝑋 𝑓𝑓 𝑥𝑥 ≅ 𝜃𝜃̂12 𝑉𝑉�𝑋𝑋,𝑛𝑛 = 𝐶𝐶̂𝑋𝑋,𝑌𝑌,𝑛𝑛

𝕍𝕍𝑋𝑋 𝑓𝑓 𝑥𝑥 ≅ 𝑉𝑉�𝑓𝑓,𝑛𝑛 = 𝜌𝜌� 2 𝑉𝑉�𝑌𝑌,𝑛𝑛 explained by observing 𝑥𝑥

Law of total variance: 𝑉𝑉�𝑌𝑌,𝑛𝑛 = 𝑉𝑉�𝑓𝑓,𝑛𝑛 + 𝑉𝑉�𝜀𝜀,𝑛𝑛 𝑉𝑉�𝑌𝑌,𝑛𝑛

𝑉𝑉�𝑓𝑓,𝑛𝑛 𝑉𝑉�𝜀𝜀,𝑛𝑛 ∑𝑛𝑛𝑖𝑖=1 𝑟𝑟𝑖𝑖2

𝑉𝑉�𝜀𝜀,𝑛𝑛 > 𝑉𝑉�𝜀𝜀,𝑛𝑛 ⇒ 𝑅𝑅� 2 < 𝑅𝑅2

𝑦𝑦 𝑥𝑥�0 𝑥𝑥�1 𝑥𝑥�2 ⋯ 𝑥𝑥�𝑚𝑚

Parametric form, with 𝑚𝑚 + 1 parameters 𝜃𝜃0 , 𝜃𝜃1 , … , 𝜃𝜃𝑚𝑚 .

∀𝑖𝑖 ∈ 1,2, … , 𝑛𝑛 : 𝐱𝐱 𝑖𝑖 = 1 𝑥𝑥𝑖𝑖𝑖 𝑥𝑥𝑖𝑖𝑖 ⋯ 𝑥𝑥𝑖𝑖𝑖𝑖

Linear predictions: 𝐟𝐟 𝐗𝐗, 𝛉𝛉 = 𝐗𝐗 𝛉𝛉 Noise: 𝛆𝛆~𝒩𝒩 0, 𝜎𝜎𝜀𝜀2 𝐈𝐈

Basic equation: 𝐲𝐲 = 𝐟𝐟 + 𝛆𝛆 = 𝐗𝐗 𝛉𝛉 + 𝛆𝛆 independent noise

Residual sum of squares, it is a quadratic form of 𝛉𝛉:

LH function: 𝐘𝐘�𝐗𝐗, 𝛉𝛉, 𝜎𝜎𝜀𝜀2 ~𝒩𝒩 𝐟𝐟 𝐗𝐗, 𝛉𝛉 , 𝜎𝜎𝜀𝜀2 𝐈𝐈 = 𝒩𝒩 𝐗𝐗 𝛉𝛉, 𝜎𝜎𝜀𝜀2 𝐈𝐈

= 𝒩𝒩 𝐗𝐗 𝛉𝛉; 𝐲𝐲, 𝜎𝜎𝜀𝜀2 𝐈𝐈

Unbiased estimator, after 𝑚𝑚 + 1 calibrated parameters.

[Actually, this is the formula for the unbiased estimate of 𝜎𝜎𝜀𝜀2 ,

� 𝛉𝛉, 𝜎𝜎𝜀𝜀2 ~ 𝒩𝒩 𝛉𝛉, 𝚺𝚺Θ

Stand. error estimating 𝜃𝜃𝑗𝑗 : se

Conf. interval for 𝜃𝜃𝑗𝑗 : 𝐶𝐶𝐶𝐶𝑗𝑗 = 𝜃𝜃̂𝑗𝑗 − 𝑧𝑧𝛼𝛼/2 se

Dataset of Criminal activity, 𝑛𝑛 = 47 datapoints, 𝑚𝑚 = 10 regressors.

𝑦𝑦∗ = 𝑓𝑓∗ + 𝜀𝜀∗ = 𝐱𝐱 ∗ 𝛉𝛉 + 𝜀𝜀∗ 𝑦𝑦�∗ = 𝑓𝑓̂∗

Uncertainty in the estimator: 𝑓𝑓̂∗ |𝐱𝐱 ∗ , 𝐗𝐗, 𝛉𝛉 ≈ 𝒩𝒩 𝑓𝑓∗ , 𝐱𝐱 ∗ 𝚺𝚺Θ� 𝐱𝐱∗T

Conf. Int.: � 𝑓𝑓∗ ; 𝑓𝑓̂∗ + 𝑧𝑧𝛼𝛼/2 se

𝐶𝐶𝐶𝐶 𝑦𝑦∗ = 𝑦𝑦�∗ − 𝑧𝑧𝛼𝛼/2 se

For low , we should use 𝑡𝑡𝛼𝛼/2,𝑛𝑛−𝑚𝑚 instead of 𝑧𝑧𝛼𝛼/2 .

The pseudo-inverse matrix can be derived from the orthogonality condition.

Time series: strain data collected by a sensor.

15 We can “explain” them using multiple inputs.

Time series: strain data collected by a sensor.

15 We can “explain” them using multiple inputs.

𝑋𝑋2 : temperature, 𝑋𝑋3 : load.

With only 𝑋𝑋1 , data are norm. distr., indep. of

Time series: strain data collected by a sensor.

15 We can “explain” them using multiple inputs.

With 𝑋𝑋1 and 𝑋𝑋2 , regression function is

-1 95% 𝐶𝐶𝐶𝐶𝑛𝑛 true

𝜃𝜃1 -2.0 20.2 4.0

Time series: strain data collected by a sensor.

15 We can “explain” them using multiple inputs.

𝑋𝑋2 : temperature, 𝑋𝑋3 : load.

With 𝑋𝑋1 , 𝑋𝑋2 and 𝑋𝑋3 , temperature

-1 95% 𝐶𝐶𝐶𝐶𝑛𝑛 true

𝜃𝜃1 17.8 20.3 4.0

The linear time trend 𝑋𝑋1 is similar to load 𝑋𝑋3 ,

actual regr. funct. 𝑓𝑓

With 𝑋𝑋1 , 𝑋𝑋2 , 𝑋𝑋3 , 𝑋𝑋4 , the model is consistent

-1 95% 𝐶𝐶𝐶𝐶𝑛𝑛 true

𝜃𝜃1 -0.80 7.3 4.0

Multiple regression can integrate the linear effects of

multiple distinct inputs (e.g.: temperature and humidity