MSCI570 - Lecture 8 - Advanced Regression Analysis 2022 Part 2

Agenda
• Lags and autoregression

• Dummy variables
• Multicollinearity
• Variable selection
• Differences
• Nonlinearity
• Evaluating forecast accuracy
1
Multicollinearity
Multicollinearity: explanatory variables are themselves interdependent.
Example:
Year Sales TV ads Internet ads Mailings
t-1 57000 0 0 0
t 124000 10000 1500 2500
During the first year we had no marketing budget. In the second year we
started all TV ads, Internet ads and Mailings. Which one caused the increase
of sales?
In this case it is impossible to identify which explanatory variable caused the

increase in sales, because all explanatory variables are perfectly correlated. 2
Relationships between
variables
The Simple Case: More usual:

X1 and X2 have separate influences X2 also influences X1
X1 X1
Y
Y
X2 X2
How does the interrelationship
between X1 and X2 affect the model? 3
Problems due to
multicollinearity
Multicollinearity is a problem because:
• The model estimates of each effect (explanatory variable) are very imprecise
making inference impossible.
• We can no longer rely on the estimated coefficients and p-values and

t-statistics.
• The estimated effects may have the wrong sign.
• We may end up using the wrong model, including the wrong variables.
4
Avoiding
multicollinearity
Solutions to multicollinearity:
• Combine correlated variables into a single composite variable
• Drop some variables from the model, although this may reduce the
predictive power of the model.
• Use judgement to decide which variables should be included in the model.
• Collect more data → may break multicollinearity
5
Agenda

• Dummy variables
• Differences
• Nonlinearity
6
Variable selection
The selection of the appropriate variables is crucial for regression modelling.

This can be done:
• Manually/judgementally
• Based on all possible subsets
• Using stepwise (forward/backward) regression
• Using information criteria
• Using validation sets and cross-validation (multiple validation sets)
• etc.
It is not uncommon that we may be missing important variables
from our dataset. How will we identify these and include them?
7
Best subsets
selection
All possible subsets
• Consider every possible subset, of all possible sizes
• For example, if 20 possible predictors:
• Consider all single subset predictors (20 of these)

• Consider all subsets with two predictors (380 of these)
…
• Consider all 20 predictors
• Only feasible with a small number of predictors. 8

How many
subsets?
https://app.wooclap.com/events/DZMJNF
9
Forward selection
of variables
Forward selection
• starts by including the variable that has greatest explanatory power (and
most significant t-value or AIC).
• Next, those variables not yet included in the model are evaluated, and the
variable that adds most to the model’s explanatory power is added to the
model.
• This step is then repeated.
10
Backward selection
of variables
Backward selection
• starts by including all the variables in the model
• and then deleting the variable with least explanatory power (and least
significant t-value or AIC).
• This step is then repeated for the variables remaining in the model.
11
Example for forward and
backward selection
12
Notes on forward and
backward selection
Imperfect even at finding best fit (training data)
• Neither approach considers all possible subsets
• If different, then: i) forward may have found ‘best’, ii) backward may have
found ‘best’, iii) neither may have found ‘best’
• They may be compared using AIC or adjusted R-square
Imperfect at finding best forecast (test data)
• Separate evaluation is needed to text forecast accuracy.

13
Agenda

• Dummy variables
• Differences
• Nonlinearity
14
Netflix sales vs. property
crime
6
Netflix sales x 10 Property crime US
350 14
300
Correlation: 0.88
12
250
10
200
Crime
Sales
8
150
6
100
50 4
0 2
5 10 15 20 25 30 35 5 10 15 20 25 30 35
Time Time
Should we forecast Netflix sales using as a predictor the number of

property crimes in the US? (The variables are strongly correlated.)
However this is not due to explanatory power, but due to common structure
(trend – the same effect would happen with seasonality).
15
Time series differences
To remove this artefact we can calculate the ‘first differences’ of the time
series, removing the trend (stochastic).
yt = yt − yt −1
6
Netflix sales x 10 Property crime US
40 1.5
30 1
Sales differenced
Crime differenced
20 0.5
10 0
0 -0.5
-10 -1
0 10 20 30 40 0 10 20 30 40
Time Time
The differenced series do not exhibit correlation any 16

more. Therefore, property crime is not causal
Agenda

• Dummy variables
• Differences
• Nonlinearity
17
Nonlinear models
Some variable relationships are not linear.
We can model this using polynomials:
Sales = b + b Adv + bAdv2 +... 18

t 0 1 2
Model 1 (linear) vs.
Model 2 (quadratic)
19
Logarithmic
transformation
Logarithmic transformation is also quite common and useful.
S = b0 Ab1
log S = log (b0 ) + b1 log (A )
Consider the shape of nonlinearity and reliability of each approach 20

Agenda

• Dummy variables
• Differences
• Nonlinearity
21
Ex-ante evaluation
• Model coefficients based on data available to the end of the training set.
• Predictors: use only what was known at the end of the training set (e.g.
know future days of week, holiday season, but may need forecasts of other
variables).
• This gives genuine forecast errors
• Therefore, we have some indication of how accurate the model will be in

the future.
22
Ex-post evaluation
• Uses later information on the predictors

(but not on the target variable to be predicted)
• i.e. Uses information would not be known in the training set
• This does NOT give genuine forecast errors
• However, comparing with ex-ante forecast accuracy can show whether

forecast errors have arisen due to poor prediction models or poor forecasts
of the predictors.
23
What variables to
use for evaluation?
https://app.wooclap.com/events/DZMJNF
24
Ex-ante evaluation
Implementation
Regression not re-estimated in the test set
• Identify regression model in training set and estimate parameters
• Keep on applying the regression model in the test set, using rolling origins,
one period at a time, using only the data that would be known up that time
Regression re-estimated in the test set
• Re-estimate regression model at the end of each new period in the test set
• Then apply as above.

25
Summary
• Initial analysis using graphical plots, adjusted R-squared, AIC, F-test
• Best subset or (for larger number of variables) forward and backward

selection procedure to identify candidate models (using t-tests)
• Care needed when there is a group of dummy variable – omit one dummy
variable but no more
• Check favoured models for multicollinearity and check regression model

assumptions
• Compare best models for ex-ante forecast accuracy .

26

MSCI570 - Lecture 8 - Advanced Regression Analysis 2022 Part 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MSCI570 - Lecture 8 - Advanced Regression Analysis 2022 Part 2

Uploaded by

Copyright:

Available Formats

Agenda

• Lags and autoregression

Multicollinearity: explanatory variables are themselves interdependent.

In this case it is impossible to identify which explanatory variable caused the

The Simple Case: More usual:

Multicollinearity is a problem because:

• We can no longer rely on the estimated coefficients and p-values and

• The estimated effects may have the wrong sign.

• Combine correlated variables into a single composite variable

• Use judgement to decide which variables should be included in the model.

• Collect more data → may break multicollinearity

• Lags and autoregression

The selection of the appropriate variables is crucial for regression modelling.

All possible subsets

• Consider every possible subset, of all possible sizes

• For example, if 20 possible predictors:

• Consider all single subset predictors (20 of these)

• Only feasible with a small number of predictors. 8

• This step is then repeated.

Imperfect even at finding best fit (training data)

• Neither approach considers all possible subsets

• They may be compared using AIC or adjusted R-square

Imperfect at finding best forecast (test data)

• Separate evaluation is needed to text forecast accuracy.

• Lags and autoregression

Should we forecast Netflix sales using as a predictor the number of

The differenced series do not exhibit correlation any 16

• Lags and autoregression

Some variable relationships are not linear.

We can model this using polynomials:

Sales = b + b Adv + bAdv2 +... 18

log S = log (b0 ) + b1 log (A )

Consider the shape of nonlinearity and reliability of each approach 20

• Lags and autoregression

• This gives genuine forecast errors

• Therefore, we have some indication of how accurate the model will be in

• Uses later information on the predictors

• i.e. Uses information would not be known in the training set

• This does NOT give genuine forecast errors

• However, comparing with ex-ante forecast accuracy can show whether

Regression not re-estimated in the test set

• Identify regression model in training set and estimate parameters

Regression re-estimated in the test set

• Then apply as above.

• Initial analysis using graphical plots, adjusted R-squared, AIC, F-test

• Best subset or (for larger number of variables) forward and backward

• Check favoured models for multicollinearity and check regression model

• Compare best models for ex-ante forecast accuracy .

You might also like