Professional Documents
Culture Documents
MSCI570 - Lecture 8 - Advanced Regression Analysis 2022 Part 2
MSCI570 - Lecture 8 - Advanced Regression Analysis 2022 Part 2
Example:
Year Sales TV ads Internet ads Mailings
t-1 57000 0 0 0
t 124000 10000 1500 2500
During the first year we had no marketing budget. In the second year we
started all TV ads, Internet ads and Mailings. Which one caused the increase
of sales?
Y
Y
X2 X2
How does the interrelationship
between X1 and X2 affect the model? 3
Problems due to
multicollinearity
• The model estimates of each effect (explanatory variable) are very imprecise
making inference impossible.
• We may end up using the wrong model, including the wrong variables.
4
Avoiding
multicollinearity
Solutions to multicollinearity:
• Drop some variables from the model, although this may reduce the
predictive power of the model.
5
Agenda
• Manually/judgementally
• Based on all possible subsets
• Using stepwise (forward/backward) regression
• Using information criteria
• Using validation sets and cross-validation (multiple validation sets)
• etc.
It is not uncommon that we may be missing important variables
from our dataset. How will we identify these and include them?
7
Best subsets
selection
https://app.wooclap.com/events/DZMJNF
9
Forward selection
of variables
Forward selection
• starts by including the variable that has greatest explanatory power (and
most significant t-value or AIC).
• Next, those variables not yet included in the model are evaluated, and the
variable that adds most to the model’s explanatory power is added to the
model.
10
Backward selection
of variables
Backward selection
• starts by including all the variables in the model
• and then deleting the variable with least explanatory power (and least
significant t-value or AIC).
• This step is then repeated for the variables remaining in the model.
11
Example for forward and
backward selection
12
Notes on forward and
backward selection
• If different, then: i) forward may have found ‘best’, ii) backward may have
found ‘best’, iii) neither may have found ‘best’
300
Correlation: 0.88
12
250
10
200
Crime
Sales
8
150
6
100
50 4
0 2
5 10 15 20 25 30 35 5 10 15 20 25 30 35
Time Time
However this is not due to explanatory power, but due to common structure
(trend – the same effect would happen with seasonality).
15
Time series differences
To remove this artefact we can calculate the ‘first differences’ of the time
series, removing the trend (stochastic).
yt = yt − yt −1
6
Netflix sales x 10 Property crime US
40 1.5
30 1
Sales differenced
Crime differenced
20 0.5
10 0
0 -0.5
-10 -1
0 10 20 30 40 0 10 20 30 40
Time Time
19
Logarithmic
transformation
Logarithmic transformation is also quite common and useful.
S = b0 Ab1
• Model coefficients based on data available to the end of the training set.
• Predictors: use only what was known at the end of the training set (e.g.
know future days of week, holiday season, but may need forecasts of other
variables).
22
Ex-post evaluation
23
What variables to
use for evaluation?
https://app.wooclap.com/events/DZMJNF
24
Ex-ante evaluation
Implementation
• Keep on applying the regression model in the test set, using rolling origins,
one period at a time, using only the data that would be known up that time
• Re-estimate regression model at the end of each new period in the test set
• Care needed when there is a group of dummy variable – omit one dummy
variable but no more