Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Agenda

• Lags and autoregression


• Dummy variables
• Multicollinearity
• Variable selection
• Differences
• Nonlinearity
• Evaluating forecast accuracy
1
Multicollinearity

Multicollinearity: explanatory variables are themselves interdependent.

Example:
Year Sales TV ads Internet ads Mailings
t-1 57000 0 0 0
t 124000 10000 1500 2500

During the first year we had no marketing budget. In the second year we
started all TV ads, Internet ads and Mailings. Which one caused the increase
of sales?

In this case it is impossible to identify which explanatory variable caused the


increase in sales, because all explanatory variables are perfectly correlated. 2
Relationships between
variables

The Simple Case: More usual:


X1 and X2 have separate influences X2 also influences X1
X1 X1

Y
Y

X2 X2
How does the interrelationship
between X1 and X2 affect the model? 3
Problems due to
multicollinearity

Multicollinearity is a problem because:

• The model estimates of each effect (explanatory variable) are very imprecise
making inference impossible.

• We can no longer rely on the estimated coefficients and p-values and


t-statistics.

• The estimated effects may have the wrong sign.

• We may end up using the wrong model, including the wrong variables.

4
Avoiding
multicollinearity

Solutions to multicollinearity:

• Combine correlated variables into a single composite variable

• Drop some variables from the model, although this may reduce the
predictive power of the model.

• Use judgement to decide which variables should be included in the model.

• Collect more data → may break multicollinearity

5
Agenda

• Lags and autoregression


• Dummy variables
• Multicollinearity
• Variable selection
• Differences
• Nonlinearity
• Evaluating forecast accuracy
6
Variable selection

The selection of the appropriate variables is crucial for regression modelling.


This can be done:

• Manually/judgementally
• Based on all possible subsets
• Using stepwise (forward/backward) regression
• Using information criteria
• Using validation sets and cross-validation (multiple validation sets)
• etc.
It is not uncommon that we may be missing important variables
from our dataset. How will we identify these and include them?

7
Best subsets
selection

All possible subsets

• Consider every possible subset, of all possible sizes

• For example, if 20 possible predictors:

• Consider all single subset predictors (20 of these)


• Consider all subsets with two predictors (380 of these)

• Consider all 20 predictors

• Only feasible with a small number of predictors. 8


How many
subsets?

https://app.wooclap.com/events/DZMJNF

9
Forward selection
of variables

Forward selection

• starts by including the variable that has greatest explanatory power (and
most significant t-value or AIC).

• Next, those variables not yet included in the model are evaluated, and the
variable that adds most to the model’s explanatory power is added to the
model.

• This step is then repeated.

10
Backward selection
of variables

Backward selection
• starts by including all the variables in the model

• and then deleting the variable with least explanatory power (and least
significant t-value or AIC).

• This step is then repeated for the variables remaining in the model.

11
Example for forward and
backward selection

12
Notes on forward and
backward selection

Imperfect even at finding best fit (training data)

• Neither approach considers all possible subsets

• If different, then: i) forward may have found ‘best’, ii) backward may have
found ‘best’, iii) neither may have found ‘best’

• They may be compared using AIC or adjusted R-square

Imperfect at finding best forecast (test data)

• Separate evaluation is needed to text forecast accuracy.


13
Agenda

• Lags and autoregression


• Dummy variables
• Multicollinearity
• Variable selection
• Differences
• Nonlinearity
• Evaluating forecast accuracy
14
Netflix sales vs. property
crime
6
Netflix sales x 10 Property crime US
350 14

300
Correlation: 0.88
12
250
10
200

Crime
Sales

8
150
6
100

50 4

0 2
5 10 15 20 25 30 35 5 10 15 20 25 30 35
Time Time

Should we forecast Netflix sales using as a predictor the number of


property crimes in the US? (The variables are strongly correlated.)

However this is not due to explanatory power, but due to common structure
(trend – the same effect would happen with seasonality).
15
Time series differences

To remove this artefact we can calculate the ‘first differences’ of the time
series, removing the trend (stochastic).

yt = yt − yt −1
6
Netflix sales x 10 Property crime US
40 1.5

30 1
Sales differenced

Crime differenced
20 0.5

10 0

0 -0.5

-10 -1
0 10 20 30 40 0 10 20 30 40
Time Time

The differenced series do not exhibit correlation any 16


more. Therefore, property crime is not causal
Agenda

• Lags and autoregression


• Dummy variables
• Multicollinearity
• Variable selection
• Differences
• Nonlinearity
• Evaluating forecast accuracy
17
Nonlinear models

Some variable relationships are not linear.

We can model this using polynomials:

Sales = b + b Adv + bAdv2 +... 18


t 0 1 2
Model 1 (linear) vs.
Model 2 (quadratic)

19
Logarithmic
transformation
Logarithmic transformation is also quite common and useful.

S = b0 Ab1

log S = log (b0 ) + b1 log (A )

Consider the shape of nonlinearity and reliability of each approach 20


Agenda

• Lags and autoregression


• Dummy variables
• Multicollinearity
• Variable selection
• Differences
• Nonlinearity
• Evaluating forecast accuracy
21
Ex-ante evaluation

• Model coefficients based on data available to the end of the training set.

• Predictors: use only what was known at the end of the training set (e.g.
know future days of week, holiday season, but may need forecasts of other
variables).

• This gives genuine forecast errors

• Therefore, we have some indication of how accurate the model will be in


the future.

22
Ex-post evaluation

• Uses later information on the predictors


(but not on the target variable to be predicted)

• i.e. Uses information would not be known in the training set

• This does NOT give genuine forecast errors

• However, comparing with ex-ante forecast accuracy can show whether


forecast errors have arisen due to poor prediction models or poor forecasts
of the predictors.

23
What variables to
use for evaluation?

https://app.wooclap.com/events/DZMJNF

24
Ex-ante evaluation
Implementation

Regression not re-estimated in the test set

• Identify regression model in training set and estimate parameters

• Keep on applying the regression model in the test set, using rolling origins,
one period at a time, using only the data that would be known up that time

Regression re-estimated in the test set

• Re-estimate regression model at the end of each new period in the test set

• Then apply as above.


25
Summary

• Initial analysis using graphical plots, adjusted R-squared, AIC, F-test

• Best subset or (for larger number of variables) forward and backward


selection procedure to identify candidate models (using t-tests)

• Care needed when there is a group of dummy variable – omit one dummy
variable but no more

• Check favoured models for multicollinearity and check regression model


assumptions

• Compare best models for ex-ante forecast accuracy .


26

You might also like