Download as pdf or txt
Download as pdf or txt
You are on page 1of 185

Part IV

Time series

James A. Duffy

Quantitative Economics, TT24

IV.1
The road behind: cross-sectional data

• Basic model: random sampling from a (‘large’) population


• generates i.i.d. data
• X1 has the same distribution as X2 , X3 , . . .
• X1 is independent of X2 , X3 , . . .
• i subscript on Xi merely distinguishes observations
• Ordering not important

IV.2
The road ahead: time series data

• Why consider time series data?


• fundamental to empirical macroeconomics and finance
• we observe one history of the UK economy: a single time series of
GDP, inflation, interest rates, asset prices, etc.
• don’t observe a random sample of economies identical to the UK
• We’d like to:
• estimate macroeconomic relationships / models (hard!)
• empirically test certain implications of macro theory (slightly less hard)
• forecast macro variables: conceptually easiest, so we’ll begin here . . .

IV.3
What is a time series?

• Collection of random variables Y1 , Y2 , . . . , YT


• formed from observations on the same entity at dates t = 1, . . . , T
• denoted {Yt }T
t=1 or just {Yt }
• Assume consecutive, evenly-spaced observations
• macro data typically observed at monthly, quarterly, or annual
frequencies
• Time plots (Yt against t):
• of time series make sense, and convey useful information
• of cross-sectional data (Yi against i) do not

IV.4
Time series: examples

UK CPI inflation: annualised % change from previous quarter


30
Per cent per annum

20
10
0

1960 1970 1980 1990 2000 2010 2020

IV.5
Time series: examples
USD/GBP exchange rate Yield on 10−year UK government bonds

15
2.5
USD per GBP

10
Per cent
2.0

5
1.5

0
1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

USD/GBP exchange rate Yield on UK government consols


5

15
4
USD per GBP

Per cent

10
3

5
2
1

1900 1920 1940 1960 1980 2000 2020 1750 1800 1850 1900 1950 2000

IV.5
Time series: examples
UK Unemployment rate UK quaterly real GDP
12

500
Billions of GBP, in 2015 prices
10
Per cent of labour force

400
8

300
6
4

200
2

1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

UK real GDP growth rate (to 2019Q4) UK real GDP growth rate (to 2020Q4)
20

50
15
Per cent per annum

Per cent per annum


10

0
5
0

−50
−5
−10

1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

IV.5
What makes time series data ‘different’ ?

• Ordering matters:
• ‘arrow of time’: causation runs from the past to the present
• Yt may depend on Yt−1 , Yt−2 , . . . ; but Yt−1 does not depend on Yt
• [terminology: Yt−k is the ‘kth lag’ of Yt ]
• Dependence between observations:
• Yt is almost always correlated with Yt−1 , and with Yt−2 , etc.
• graphically evident in series with ‘smooth’ time plots
• Non-identical distributions:
• Yt may have different mean, variance, etc. from Ys
• trends and breaks; changes in behaviour over time
• Consequences?
• many potential pitfalls when applying statistical methods for time series
• just diving in and running regressions – can lead to seriously misleading
inferences

IV.6
Forecasting

• We observe {Yt }T T
t=1 (and perhaps {Xt }t=1 too)
• final observation(s) in period T
• want to predict – to forecast – YT +1 , YT +2 , etc., using that data
• Conceptually straightforward:
• only needs a good ‘descriptive’ model
• of how Yt can be ‘best predicted’ using Yt−1 , Yt−2 , . . . (and perhaps
Xt−1 , Xt−2 , . . .)
• a natural application of regression analysis

IV.7
Roadmap

• Descriptive time series modelling (Sec. 13)


• ‘stability’ of the (economic) processes generating {Yt }: (weak)
stationarity
• AR(p) models: fundamental class of descriptive models
• Forecasting with stationary data (Sec. 14)
• evaluating / choosing between competing models
• incorporating multiple predictors
• Handling nonstationary data (Sec. 15)
• structural breaks: changes in the data generating process
• ‘random wandering’ behaviour and unit roots
• Cointegration (Sec. 16)
• pitfalls of regressions with nonstationary data
• what can we learn from nonstationary series that ‘move together’ ?

IV.8
Section 13

Descriptive time series modelling

IV.9
Subsection 13.i

‘Stability’ and stationarity

IV.10
‘Stable’ time series

• When plotted, only some series look ‘stable’:


• they tend to revert back to a fixed mean
• have a fixed variability around that mean
• Many time series are not ‘stable’: e.g. due to
• ‘random wandering’: no fixed mean
• trending: in the mean or variance
• apparently stable only for certain epochs: interrupted by ‘breaks’ in
mean or variance

IV.11
‘Stable’ time series

• When plotted, only some series look ‘stable’:


• they tend to revert back to a fixed mean
• have a fixed variability around that mean

UK real GDP growth rate (to 2019Q4) UK unemployment rate: change from previous quarter
20

0.8
15

0.6
Per cent of labour force
Per cent per annum

10

0.4
0.2
5

0.0
0
−5

−0.4
−10

1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

IV.11
‘Stable’ time series

• Many time series are not ‘stable’: e.g. due to


• wandering ‘randomly’: no fixed mean
• trending: in the mean or variance
• apparently stable only for certain epochs: interrupted by ‘breaks’ in
mean or variance

USD/GBP exchange rate UK quaterly real GDP

500
Billions of GBP, in 2015 prices
2.5

400
USD per GBP

2.0

300
1.5

200

1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

IV.11
Time series: examples

UK CPI inflation: annualised % change from previous quarter


30
Per cent per annum

20
10
0

1960 1970 1980 1990 2000 2010 2020

IV.11
Why do we want ‘stable’ time series?

• Usual descriptive statistics make sense:


• sample means / variances computed over different epochs, still
estimate the same population quantity
• what could it mean to compute the ‘mean’ of a trending series?
• Modelling and forecasting (much) more straightforward
• ‘stable’ series: described by a model with ‘stable’ (i.e. constant)
parameters
• more plausible that model applies beyond the end of our time series:
has a better claim to ‘external validity’
• i.e. forecasting with the model is possible!

IV.12
Stationarity: heuristics

• How to express ‘stability’ precisely? ‘Stationarity’ . . .


1. Constant mean and variance
• each Yt is a r.v. with mean EYt and variance var(Yt )
• require: EYt = EYs and var(Yt ) = var(Ys ), for all s, t
• Is that enough?
• Yt is serially correlated (or autocorrelated): corr(Yt , Ys ) 6= 0, for s 6= t
• [terminology: ‘auto’-correlated = correlated with own lags, with ‘itself’]
• these serial correlations also need to be ‘stable’

IV.13
Stationarity: heuristics

2. Constant (i.e. time-invariant) autocovariances: cov(Yt , Yt−h ) =: γh


• allowed to depend on h
• but not allowed to depend on t
• Implications:
• cov(Y1 , Y2 ) = cov(Y2 , Y3 ) = cov(Y3 , Y4 ) = γ1 , etc.
• in general, cov(Y1 , Y2 ) 6= cov(Y1 , Y3 )
• hth serial correlation (or autocorrelation) only depends on h:

cov(Yt , Yt−h ) cov(Yt , Yt−h )


ρh := corr(Yt , Yt−h ) = 1/2
=
[var(Yt ) var(Yt−h )] var(Yt )

using assumed constancy of the variance

IV.14
Subsection 13.ii

Stationarity: weak and strict

IV.15
Weak stationarity: definition

{Yt } is weakly stationary if its


• mean: µ := EYt ;
• variance: σY2 := var(Yt ); and
• autocovariances: γh := cov(Yt , Yt−h ), h ∈ Z
do not depend on t (they are time invariant)

• {Xt } and {Yt } are jointly weakly stationary if


• the above holds for each series; and
• cov(Yt , Xt−h ) does not depend on t, for all h ∈ Z
• Formalises ‘stability’ of the process generating {Yt }

IV.16
Weak stationarity: consequences
• Descriptive statistics have something fixed to estimate!
• sample means and variance consistent for µ, σY2
• sample autocovariance:
T
1 X
γ̂h := cov(Y
ˆ t , Yt−h ) := (Yt − Y T )(Yt−h − Y T )
T
t=h+1
p
→ cov(Yt , Yt−h ) = γh

• sample autocorrelation:

cov(Y
ˆ t , Yt−h ) p cov(Yt , Yt−h )
ρ̂h := corr(Y
ˆ t , Yt−h ) := → = ρh
var(Y
ˆ t) var(Yt )
• consistency follows from LLNs for stationary, ‘weakly dependent’
processes
• Note the slight peculiarities in the definitions:
• γ̂h divides sum by T , rather than T − h or T − h − 1
• ρ̂h divides by var(Y
ˆ t ), because var(Yt ) = var(Yt−h )

IV.17
Weak stationarity: persistence

• Population autocorrelation function (ACF): h 7→ ρh


• the extent of correlation of Yt with Yt−h , as h varies
• measures the persistence of the time series Yt
• consistently estimated by the sample ACF, h 7→ ρ̂h
• Only necessary to consider h ≥ 0, since

γh = cov(Yt , Yt−h ) = cov(Yt−h , Yt ) = cov(Yt , Yt+h ) = γ−h

• implies ρh = ρ−h
• note that ρ0 = cov(Yt , Yt )/ var(Yt ) = 1

IV.18
Sample ACF: examples
UK unemployment rate: change from previous qtr (1960Q1−2016Q4)

1.0
0.8
Sample ACF: ρ h
^

0.6
0.4
0.2
0.0

0 1 2 3 4 5 6

Lag: h

• Red dashed lines: interval that contains ρ̂h with 95% probability,
assuming an i.i.d. (i.e. uncorrelated) series
IV.19
Sample ACF: examples
UK real GDP qtrly growth rate (1982Q1−2019Q4)

1.0
0.8
h

0.6
Sample ACF: ρ
^

0.4
0.2
0.0
−0.2

0 1 2 3 4 5 6

Lag: h

• Red dashed lines: interval that contains ρ̂h with 95% probability,
assuming an i.i.d. (i.e. uncorrelated) series
IV.20
Weak stationarity: persistence

• Sample ACF also suggestive about (non-)stationarity


• Stationary processes are (very commonly) only weakly persistent
• may have substantial serial correlation at short lags
• but ρh decays ‘rapidly’ as h → ∞: correlation is ‘short lived’
• reflected in tendency to rapidly revert to their means
• Strong persistence (slowly decaying ρ̂h ):
• often suggestive of non-stationarity
• not a necessary connection, but a very useful heuristic

IV.21
Stationarity and persistence

Unemployment rate Unemployment: quarterly change


12

0.0 0.2 0.4 0.6 0.8


10
Per cent of labour force

Per cent of labour force


8
6
4

−0.4
2

1960 1970 1980 1990 2000 2010 1960 1970 1980 1990 2000 2010
1.0

1.0
0.8

0.8
h

h
Sample ACF: ρ

Sample ACF: ρ
^

^
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 1 2 3 4 5 6 0 1 2 3 4 5 6

Lag: h Lag: h
IV.22
^
Sample ACF: ρ h Log £b GBP, 2015 prices

−0.2 0.0 0.2 0.4 0.6 0.8 1.0 26.2 26.4 26.6 26.8 27.0

0
1
1990

2
3
2000

Lag: h
4
Real GDP: logarithm

2010

5
6
Stationarity and persistence

2020

^
Sample ACF: ρ h Percent per annum

−0.2 0.0 0.2 0.4 0.6 0.8 1.0 −5 0 5 10

0
1
1990

2
3
2000

Lag: h
4
Real GDP: qtrly growth rate

2010

5
6
2020

IV.23
Strict stationarity

• Weak stationarity not the only possible formalisation of ‘stability’


• time-invariance of first and second moments (mean, var, cov) only
• what about the other aspects of the distribution of Yt ?

{Yt } is strictly (or strongly) stationary if, for every k ≥ 0,

(Yt , Yt+1 , . . . , Yt+k ) =d (Ys , Ys+1 , . . . , Ys+k )

where ‘=d ’ means ‘has the same (joint) distribution as’

• {Xt } and {Yt } are jointly strictly stationary if, for every k ≥ 0,

(Xt , Yt , . . . , Xt+k , Yt+k ) =d (Xs , Ys , . . . , Xs+k , Ys+k )

IV.24
Stationarity

• Strict stationary implies weak stationarity: e.g.

(Yt , Yt+1 ) =d (Ys , Ys+1 ) =⇒ EYt = EYs


=⇒ cov(Yt , Yt+1 ) = cov(Ys , Ys+1 )

• converse does not hold


• Both express idea that {Yt } is ‘probabilistically stable’ through time
• Henceforth: ‘stationarity’ (without qualifier) = ‘weak stationarity’
• why? because weak stationarity is easier to check!
• note: what S&W call ‘stationarity’ is in fact ‘strict stationarity’
• ‘nonstationary’ time series = not weakly stationary ( =⇒ not strictly
stationary, either!)
• Note: stationarity is a mathematical ideal, which we only expect to
hold ‘approximately’ in the real world

IV.25
How to tell if a time series is stationary?

• ‘Eyeball econometrics’: plot the data! Does the series have:


• a constant mean and variance?
• a tendency to revert (rapidly) to its mean?
• at least, perhaps only certain epochs?
• Formal analysis?
• plot the sample ACF (ρ̂h ): does it tend to zero rapidly?
• look at means and variances over different sub-samples
• formal, model-based tests for certain kinds of non-stationarity:
discussed in Section 15
• If you have a stationary series, you’re in luck!
• can use statistical methods to describe, model and forecast the series

IV.26
Subsection 13.iii

Nonstationary time series

IV.27
What can we do with nonstationary time series?

• Many economic time series are inherently nonstationary:


• deterministic trends: many aggregates exhibit exponential growth
• ‘stochastic’ trends: randomly wandering with no tendency to revert to
a fixed mean
• ‘breaks’: changes in behaviour induced by underlying structural changes
• ‘Breaks’: partly dealt with by analysing subsamples: e.g.
• £/$ exchange rates: before / after the end of Bretton Woods (or the
ERM, post-92)
• inflation: before / during the ‘inflation targeting’ era, roughly post-1990
• choices partly informed by empirical analysis, partly by economic theory
• Trends: usually dealt with by transforming the data, to obtain a
stationary process

IV.28
Transforming to stationarity

• Deterministic detrending: Yt − δ0 − δ1 t
• where δ0 and δ1 are estimated (by regression)
• Differencing:
• first difference: ∆Yt := Yt − Yt−1
• seasonal difference (quarterly data): ∆4 Yt = Yt − Yt−4
• Logarithms and growth rates:
• difference of logs: ∆ log Yt := log Yt − log Yt−1
• % growth rate: 100 × ∆ log Yt
• annualised % growth rate: 400 × ∆ log Yt , for quarterly data
• Why log differences, and not % changes?
• log differences are ‘symmetric’ in ± changes
• but % changes are not: ↓ 50% followed by ↑ 50% is a 25% decline!
• what I label a ‘% change’ in figures is always a difference of logs [as per
the usual practice in (macro)economics]

IV.28
Transforming to stationarity: US real GDP

US annual real GDP Log of US annual real GDP

^ ^
δ0 + δ1 × t

9.5
Log trillion USD, 2012 prices
Trillion USD, 2012 prices

15

9.0
8.5
10

8.0
7.5
5

7.0
1940 1960 1980 2000 2020 1940 1960 1980 2000 2020
1.0

1.0
0.8

0.8
h

h
0.6

0.6
Sample ACF: ρ

Sample ACF: ρ
^

^
0.4

0.4
0.2

0.2
−0.2 0.0

−0.2 0.0

0 2 4 6 8 10 12 0 2 4 6 8 10 12

Lag: h Lag: h
IV.29
Transforming to stationarity: US real GDP

Detrended log US annual real GDP Change in log US annual real GDP

0.15
0.0 0.1 0.2

Log trillion USD, 2012 prices


Trillion USD, 2012 prices

0.05
−0.2

−0.05
−0.4

−0.15
1940 1960 1980 2000 2020 1940 1960 1980 2000 2020
1.0

1.0
0.8
0.8

0.6
h

h
0.6
Sample ACF: ρ

Sample ACF: ρ
^

0.4
0.4

0.2
0.2
−0.2 0.0

−0.2

0 2 4 6 8 10 12 0 2 4 6 8 10 12

Lag: h Lag: h
IV.30
Transforming to stationarity

• The ‘folklore’ in empirical macroeconomics: trending and randomly


wandering series can be differenced to stationarity
• You might have to:
• take logarithms before differencing: particularly for series where
exponential growth / decay makes sense
• to difference more than once
• But be careful here:
• ‘overdifferencing’ can also cause problems: so don’t go overboard!
• later (Sec. 15): testing for ‘how much differencing you should do’

IV.31
Differencing twice to get to stationarity?

Log US CPI, 1982−84 = 100 US Inflation: annual change in CPI


5.5

0.12
Log US CPI, 1982−84 = 100

5.0

0.08
Per cent
4.5

0.04
4.0
3.5

0.00
1950 1960 1970 1980 1990 2000 2010 2020 1950 1960 1970 1980 1990 2000 2010 2020
1.0

1.0
0.8

0.8
h

h
0.6

0.6
Sample ACF: ρ

Sample ACF: ρ
^

^
0.4

0.4
0.2

0.2
−0.2 0.0

−0.2 0.0

0 2 4 6 8 10 12 0 2 4 6 8 10 12

Lag: h Lag: h
IV.32
Differencing twice to get to stationarity?

US Inflation: annual change in CPI Change in US inflation


0.12

0.04
0.08
Per cent

Per cent

0.00
0.04

−0.04
0.00

1950 1960 1970 1980 1990 2000 2010 2020 1950 1960 1970 1980 1990 2000 2010 2020
1.0

1.0
0.8
0.8

0.6
h

h
0.6
Sample ACF: ρ

Sample ACF: ρ
^

0.4
0.4

0.2
0.2
−0.2 0.0

−0.2

0 2 4 6 8 10 12 0 2 4 6 8 10 12

Lag: h Lag: h
IV.33
Subsection 13.iv

Autoregressive modelling: AR(1) models

IV.34
Modelling stationary time series

• We’ve seen: macro time series that are plausibly stationary have
• (roughly) constant means and variances
• sample ACFs that decay ‘rapidly’ in h
• Next step: develop simple descriptive models that can
• match the pattern of serial correlation we see in these series
• ‘repackage’ this in a way useful for prediction – for forecasting!
• Autoregressive of order p models: AR(p)
• essential building block for more elaborate models
• [terminology: ‘auto’-regressive: regressed on own lags]

IV.35
AR(1) model

• Suppose Yt evolves according to

Yt = β0 + β1 Yt−1 + ut

• for t ∈ {1, 2, . . . , }: we observe up to t = T


• Y0 , the initial value, is also a random variable
• {ut } is the driving innovation or shock sequence. Assumed:
• weakly (or strictly) stationary, with Eut = 0 and σu2 := Eut2
• unforecastable in the sense that

0 = E[ut | Yt−1 , Yt−2 , . . .] =: E[ut | Yt−1 ]

where Yt := (Yt , Yt−1 , . . . , Y1 , Y0 )


• implies {ut } is serially uncorrelated: cov(ut , ut−h ) = 0
• sometimes assume {ut } is i.i.d. (with mean zero)

IV.36
AR(1) model

Yt = β0 + β1 Yt−1 + ut E[ut | Yt−1 ] = 0

• Intended as a descriptive model, not a causal one:


• implies the best (nonlinear) predictor of Yt is

E[Yt | Yt−1 ] = E[β0 + β1 Yt−1 + ut | Yt−1 ] = β0 + β1 Yt−1

• (β0 , β1 ) consistently estimable by OLS regression of Yt on Yt−1


• β1 is not the ‘causal effect’ of Yt−1 , it is just a population linear
regression coefficient!
• Could the model be (descriptively) false? Two key assumptions:
1. cond. exp. of Yt given Yt−1 = (Yt−1 , Yt−2 , . . .) depends only on Yt−1
2. conditional expectation is linear
• AR(p) models with p ≥ 2: relax the first assumption
• we’ll return to linearity later

IV.37
AR(1) model: is it fit for purpose?

1. Does an AR(1) model generate a stationary time series?


2. Can it match key features of real-world stationary time series?
• tendency to revert (rapidly) to its mean
• nonzero but (rapidly) decaying ACF

IV.38
When is an AR(1) process stationary?

Yt = β0 + β1 Yt−1 + ut E[ut | Yt−1 ] = 0

• Stationarity depends:
• crucially on β1 : regulates how persistent {Yt } is
• ‘technically’ on Y0 : how the process is ‘initialised’
• To derive requirements on β1 and Y0 : suppose {Yt } is stationary . . .
1. µ := EYt must satisfy

µ = EYt = E[β0 + β1 Yt−1 + ut ] β0


=⇒ µ =
= β0 + β1 EYt−1 = β0 + β1 µ 1 − β1

2. σY2 := var(Yt ) must satisfy

σY2 = var(Yt ) = var(β0 + β1 Yt−1 + ut ) σu2


=⇒ σY2 =
= β12 var(Yt−1 ) + σu2 = β12 σY2 + σu2 1 − β12

IV.39
When is an AR(1) process stationary?

Yt = β0 + β1 Yt−1 + ut E[ut | Yt−1 ] = 0


• If {Yt } is stationary:

β0 σu2
EYt = µ = var(Yt ) = σY2 =
1 − β1 1 − β12
• implies |β1 | ≥ 1 is inconsistent with stationarity
• ‘solution’ for σY2 is either ∞ or negative in such cases

• {Yt } is (weakly) stationary if β1 ∈ (−1, 1) and Y0 has

β0 σu2
EY0 = var(Y0 ) =
1 − β1 1 − β12

• Preceding shows EYt and var(Yt ) are time invariant: what about
cov(Yt , Yt−h )?
IV.40
When is an AR(1) process stationary?

Yt = β0 + β1 Yt−1 + ut E[ut | Yt−1 ] = 0

• Finally, check the autocovariances:


• for h ≥ 1:

cov(Yt , Yt+h ) = cov(Yt , β0 + β1 Yt+h−1 + ut+h ) = β1 cov(Yt , Yt+h−1 )

• gives a recursion: working back to h = 1 yields

cov(Yt , Yt+h ) = β1 cov(Yt , Yt+h−1 )


= β12 cov(Yt , Yt+h−2 )
= · · · = β1h cov(Yt , Yt ) = β1h var(Yt ) = β1h σY2

• depends only on h, not t


• So these are time-invariant: hence {Yt } is stationary!

IV.41
Properties of a stationary AR(1) process

Yt = β0 + β1 Yt−1 + ut E[ut | Yt−1 ] = 0 β1 ∈ (−1, 1)

• Population ACF:

cov(Yt , Yt−h ) β h σ2
ρh := = 1 2 Y = β1h
var(Yt ) σY

• nonzero if β1 6= 0: {Yt } is serially correlated


• decays rapid (geometrically) in h: but more slowly as β1 → 1
• Mean reversion: using µ = β0 /(1 − β1 ):

Yt = (1 − β1 )µ + β1 Yt−1 + ut

• if ut = 0 and β1 ∈ (0, 1): Yt will lie between µ and Yt−1


• non-stochastic dynamics continually pull Yt back towards µ
• generally rapid mean reversion, but slows down as β1 → 1

IV.42
Population ACFs for AR(1) models
β1 = 0.25 β1 = 0.5
1.0

1.0
0.8

0.8
Population ACF: ρh

Population ACF: ρh
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
0 1 2 3 4 5 6 0 1 2 3 4 5 6

β1 =Lag
0.75 β1 =Lag
0.95
1.0

1.0
0.8

0.8
Population ACF: ρh

Population ACF: ρh
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 1 2 3 4 5 6 0 1 2 3 4 5 6

• Plotted as for quarterly data (x-axis = lag length in years)


IV.43
Typical trajectories for AR(1) processes
β1 = 0.25 β1 = 0.5

2
2

1
1

0
0

−1
−1

−2
−2

1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020

β1 = 0.75 β1 = 0.95

4
4

2
2

0
−2
0

−4
−2

−6
−8
−4

1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020

• Simulations with T = 200, plotted at quarterly frequency; β0 = 0


IV.44
AR(1) processes with β1 < 0
• Less practically relevant to macro time series

β1 = − 0.75 β1 = − 0.95

5
2
0

0
−2

−5
−4

1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020
1.0

1.0
0.5

0.5
Population ACF: ρh

Population ACF: ρh
0.0

0.0
−0.5

−0.5
−1.0

−1.0

0 1 2 3 4 5 6 0 1 2 3 4 5 6

Lag Lag IV.45


Subsection 13.v

Forecasting with an AR(1) model

IV.46
Forecasting: the general problem
• Observe Yt for t = 1, . . . , T ; want to forecast T + h for h = 1, 2, . . .
• What is the best possible forecast we could make (in theory) of YT +1 ,
using the available past history of {Yt }T t=1 ?

• The mean squared forecast error (MSFE) minimising forecast is the


m∗ (YT ) = m∗ (YT , YT −1 , . . . , Y1 ) that solves

min E[YT +1 − m(YT )]2


m(·)

• Solution is the conditional expectation:

YT +1|T := E[YT +1 | YT ] = E[YT +1 | YT , YT −1 , . . .]

• More generally; optimal h-step-ahead (or horizon h) forecast is

YT +h|T := E[YT +h | YT ]
• also termed the MSFE-minimising forecast
IV.47
Forecasting: with an AR(1) model
• To operationalise: need to estimate YT +1|T := E[YT +1 | YT ]. How?
• If we assume {Yt } is an AR(1) process:

Yt = β0 + β1 Yt−1 + ut E[ut | Yt−1 ]= 0

• then YT +1|T has a parametric form:

YT +1|T = E[YT +1 | YT ] = E[β0 + β1 YT + uT +1 | YT ]


= β0 + β1 E[YT | YT ] + E[uT +1 | YT ]
= β0 + β1 YT

• all we need to do is estimate (β0 , β1 ) by OLS!


• feasible (sample) counterpart of YT +1|T is

ŶT +1|T := β̂0 + β̂1 YT

• Note: ŶT +1|T is not an (in-sample) OLS fitted value, it is an


out-of-sample prediction
IV.48
Forecasting: with an AR(1) model
• Optimal forecasts at longer horizons: obtained via the LIE:

YT +h|T = E[YT +h | YT ] = E[β0 + β1 YT +h−1 + uT +h | YT ]


= β0 + β1 E[YT +h−1 | YT ]
= β0 + β1 YT +h−1|T

• gives a recursion, starting from YT |T = YT


• e.g. optimal h = 2-step-ahead forecast is

YT +2|T = β0 + β1 YT +1|T = β0 + β1 (β0 + β1 YT ) = (β0 + β1 β0 ) + β12 YT

• if |β1 | < 1: as h → ∞, YT +h|T → µ = β0 /(1 − β1 ) (see notes); the


best ‘long run’ forecast of a stationary process is its mean
• Feasible (i.e. estimated) counterpart:

ŶT +h|T = β̂0 + β̂1 ŶT +h−1|T

• a recursion starting from ŶT |T = YT


IV.49
Forecasting: GDP growth

• Suppose we are in 2016Q1: want to forecast quarterly UK GDP


growth rate (grgdp := 400∆ log GDP) for the next year
1. Estimate an AR(1) model: sample 1982Q1 to 2016Q1: R’s dynlm()
command yields
Estimate Std. Error t value Pr(¿—t—)
(Intercept) 1.1432 0.2583 4.43 2.0e-05 ***
L(grgdp, 1) 0.5197 0.0735 7.07 7.4e-11 ***

• i.e. β̂0 = 1.14, β̂1 = 0.52


2. Use the model to make forecasts (out-of-sample predictions):
• final observation on grdgp: YT =2016Q1 = 0.80
• 2016Q2: ŶT +1|T = β̂0 + β̂1 YT = 1.14 + 0.52 × 0.80 = 1.56
• 2016Q3: ŶT +2|T = β̂0 + β̂1 ŶT +1|T = 1.14 + 0.52 × 1.56 = 1.95

IV.50
Forecasting: GDP growth

• How do our forecasts of grgdp compare its subsequent evolution?


UK GDP growth rate (2016Q1−2017Q3)

YT+h
Y Y˙hat e˙hat ^
YT+h|T

2.0
2016 Q1 0.80 0.80 0.00

Qrtly % change
2016 Q2 1.81 1.56 0.25
2016 Q3 1.22 1.95 -0.73

1.5
2016 Q4 2.38 2.16 0.22
2017 Q1 1.98 2.26 -0.29
2017 Q2 1.17 2.32 -1.15 1.0

2017 Q3 1.69 2.35 -0.66 2016 2017

IV.51
Forecast errors: in an AR(1) model
• Error made by the:
• (infeasible) optimal forecast:

eT +1|T := YT +1 − YT +1|T = (β0 + β1 YT + uT +1 ) − (β0 + β1 YT )


= uT +1

• (feasible) OLS-based forecast:

êT +1|T := YT +1 − ŶT +1|T = (β0 + β1 YT + uT +1 ) − (β̂0 + β̂1 YT )


= {(β0 − β̂0 ) + (β1 − β̂1 )YT } + uT +1

• Forecast errors generally have two components:


1. estimation error: β0 − β̂0 and β1 − β̂1
2. unforecastable component of the model: uT +1
• orthogonal because E[uT +1 | YT ] = 0
• second dominates in ‘large’ sample sizes, but usually both matter
• êT +1|T is not an OLS residual: it is an out-of-sample prediction error

IV.52
Forecast performance
• How can we measure the performance of our forecasts?
• use same criteria as we are trying to minimise: the MSFE!
• for the theoretically optimal forecast:

MSFE(YT +1|T ) = E(YT +1 − YT +1|T )2 = EeT2 +1|T = EuT2 +1 = σu2

• for the OLS-based forecast

MSFE(ŶT +1|T ) = E(YT +1 − ŶT +1|T )2


= EêT2 +1|T
= E[{(β0 − β̂0 ) + (β1 − β̂1 )YT } + uT +1 ]2
= E{(β0 − β̂0 ) + (β1 − β̂1 )YT }2 + σu2

• Main implications:
• MSFE(ŶT +1|T ) is always larger than MSFE(YT +1|T )
• residual variance σ̂u2 isn’t a good estimate of MSFE(ŶT +1|T )
• we’ll discuss how to estimate MSFE(ŶT +1|T ) in Section 14
IV.53
Subsection 13.vi

Autoregressive modelling: AR(p) models

IV.54
AR(p) models

Yt = β0 + β1 Yt−1 + β2 Yt−2 + · · · + βp Yt−p + ut E[ut | Yt−1 ] = 0

• Relaxes a key assumption of the AR(1) model:


• cond. exp. of Yt given Yt−1 now depends on p ≥ 1 of its own lags
• allows for Yt to have ‘richer’ dynamics
• Also generate stationary time series:
• but under more complex restrictions the βl ’s
• a key requirement is pi=1 βi < 1 (see Section 15)
P

• Properties similar to stationary AR(1) models:


• nonzero but (approximately) geometrically decaying ACF
• but can exhibit more complex patterns
• exhibit rapid mean reversion: but slows as pi=1 βi → 1
P

IV.55
Possible trajectories for AR(2) processes
β1 = 0.15, β2 = 0.45 β1 = 0.45, β2 = 0.15

3
3

2
2

1
1

0
0

−1
−1

−2
−2

−3
−3

−4
−4

1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020

β1 = 1, β2 = − 0.5 β1 = 0.5, β2 = 0.45


3
2

4
1

2
0
−1

0
−2

−2
−3

−4
−4

1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020

• Simulations with T = 200, plotted at quarterly frequency; β0 = 0


IV.56
Population ACFs for AR(2) models
β1 = 0.15, β2 = 0.45 β1 = 0.45, β2 = 0.15
1.0

1.0
Population ACF: ρh

Population ACF: ρh
0.5

0.5
0.0

0.0
−0.5

−0.5
0 1 2 3 4 5 6 0 1 2 3 4 5 6

β1 = 1, Lag
β2 = − 0.5 β1 = 0.5,Lag
β2 = 0.45
1.0

1.0
Population ACF: ρh

Population ACF: ρh
0.5

0.5
0.0

0.0
−0.5

−0.5

0 1 2 3 4 5 6 0 1 2 3 4 5 6

• Plotted as for quarterly data (x-axis = lag length in years)


IV.57
AR(p) models: forecasting

Yt = β0 + β1 Yt−1 + β2 Yt−2 + ut E[ut | Yt−1 ] = 0

• Very similar to AR(1) models: here consider p = 2 for simplicity


• Optimal (MSFE-minimsing) 1-step-ahead forecast is

YT +1|T = E[YT +1 | YT ] = E[β0 + β1 YT + β2 YT −1 + uT +1 | YT ]


= β0 + β1 YT + β2 YT −1

• feasible counterpart: use OLS estimates

ŶT +1 = β̂0 + β̂1 YT + β̂2 YT −1

• Optimal h-step-ahead forecast: obtained via recursion

YT +h|T = β0 + β1 YT +h−1|T + β2 YT +h−2|T

where Yt|T = Yt for t ≤ T


IV.58
AR(p) models: forecasting GDP growth
• Return to the problem of forecasting quarterly UK GDP growth rate
(grgdp := 400∆ log GDP) from 2016Q1 onwards
1. Estimate an AR(2) model, using data from 1982Q1 to 2016Q1:
Estimate Std. Error t value Pr(¿—t—)
(Intercept) 0.9606 0.2743 3.50 0.00063 ***
L(grgdp, 1:2)1 0.4386 0.0850 5.16 8.6e-07 ***
L(grgdp, 1:2)2 0.1571 0.0848 1.85 0.06605 .

• i.e. β̂0 = 0.96, β̂1 = 0.44, β̂2 = 0.16


2. Use the model to make forecasts (out-of-sample predictions):
• final observations on grgdp: YT =2016Q1 = 0.80, YT −1=2015Q4 = 2.62

2016Q2: ŶT +1|T = β̂0 + β̂1 YT + β̂2 YT −1


= 0.96 + 0.44 × 0.80 + 0.16 × 2.62 = 1.72
2016Q3: ŶT +2|T = β̂0 + β̂1 ŶT +1|T + β̂2 YT
= 0.96 + 0.44 × 1.72 + 0.16 × 0.80 = 1.84

IV.59
Why (linear) AR(p) models?

Yt = β0 + β1 Yt−1 + β2 Yt−2 + · · · + βp Yt−p + ut E[ut | Yt−1 ] = 0

• Linearity is (in principle) restrictive: why not consider e.g.


2
Yt = β0 + β1 Yt−1 + β2 Yt−1 + ut

1. Difficult to analyse: e.g. conditions for stationarity are much more


complicated
2. Short time series: limits complexity of what we can estimate
• macro series often around 150–250 quarterly observations
• better to use extra parameters to add in lags, than to add in
nonlinearities
• regression will ‘linearly approximate’ omitted nonlinearities

IV.60
Section 14

Forecasting

IV.61
Roadmap

• We want to elaborate further on the following two ‘steps’ involved in


forecasting
1. Finding a stationary (-looking) transform (or subsample) of the
original series (Sec. 15)
• finding and testing for ‘breaks’
• handling trends and random wandering (unit roots)
2. Using an estimated model to make forecasts ŶT +h|T (Sec. 14)
• forecast evaluation: estimating MSFE(ŶT +h|T )
• model selection: what should the p in AR(p) be?
• incorporating additional covariates: do they help improve our forecasts?
(‘Granger causality’)

IV.62
Subsection 14.i

Forecast evaluation

IV.63
Estimating the MSFE

• ‘Ideal’ measure of forecast precision is the MSFE

MSFE(ŶT +h|T ) = E(YT +h − ŶT +h|T )2 = EêT2 +h|T


• How can we estimate it?
• we don’t observe Yt after t = T , so we can’t compute êT +h|T
• For specific models, can obtain analytic expressions for EêT2 +h|T :
• e.g. with h = 1, for the AR(1) model:

EêT2 +1|T = E{(β̂0 − β0 ) + (β̂1 − β1 )YT }2 + σu2

• could estimate σu2 using the residual variance


• first term requires estimation of the (co)variance of (β̂0 , β̂1 )

IV.64
Pseudo out-of-sample forecasting

MSFE(ŶT +h|T ) = E(YT +h − ŶT +h|T )2 = EêT2 +h|T

• More flexible approach: use pseudo out-of-sample (OOS) forecasting


• works irrespective of the type of model used
• at all horizons: for simplicity, we’ll focus on h = 1
• Basic idea: ‘replicate’ forecasting problem within the sample period
• estimate model only using data from t = 1 up to s < T
• generate a ‘pseudo OOS’ forecast Ŷs+1|s for Ys+1 , with error

ês+1|s = Ys+1 − Ŷs+1|s


• can compute ês+1|s , because Ys+1 is observed!
• Why couldn’t we use the residuals for this? e.g. for an AR(1) model:

ût := Yt − β̂0 − β̂1 Yt−1

• ût is an in sample prediction error: β̂l uses information after t − 1, up


to period T ! (it even depends on Yt !)
• OLS residuals do not faithfully ‘replicate’ the OOS forecasting problem
IV.65
Pseudo out-of-sample forecasting

• Repeat this for each s ∈ {T − P, . . . , T − 1}


• choose P to be around 0.1T or 0.2T Last 10% to 20%

• gives a series of pseudo OOS forecast errors {ês+1|s }s=TT −1


−P

• Estimate MSFE(ŶT +1|T ) = EêT2 +1|T as

T −1
ˆ ŶT +1|T ) = 1
X
ςˆ12 := MSFE( 2
ês+1|s
P
s=T −P

• ςˆ1 is the estimated root MSFE (or RMSFE)


• analogous procedure yields ςˆh2 := MSFE(
ˆ ŶT +h|T )

IV.66
Example: forecasting GDP growth

• Forecasting grgdp, using UK data from 1960Q1–2018Q4 (59 years)


• set P = 64 quarters, to use the final 16 years (2003Q1–2018Q4) for
pseudo OOS forecasting
ˆ
• use RMSFE to evaluate AR(p) forecast performance as a function of p
and horizon h
p: 1 2 3 4 5 6 7 8

h: 1 2.363 2.249 2.304 2.264 2.256 2.275 2.281 2.386


2 2.444 2.319 2.346 2.312 2.303 2.327 2.340 2.433
3 2.454 2.458 2.465 2.441 2.432 2.464 2.473 2.576
4 2.455 2.461 2.494 2.478 2.470 2.500 2.503 2.596

ˆ
• Compares with sd(grgdp) = 2.436 over the OOS period

IV.67
Subsection 14.ii

Model selection

IV.68
Which model should we forecast with?

• A problem of (forecast) model selection


• Within the AR(p) framework: a problem of choosing

p ∈ {0, 1, . . . , pmax }

• we’ll concentrate on this case for concreteness


• methods also apply to selection within and between other models
• What should inform our choice here?

IV.69
Fundamental bias–variance trade-off

Yt = β0 + β1 Yt−1 + β2 Yt−2 + · · · + βp Yt−p + ut p ∈ {0, 1, . . . , pmax }

• Advantages to choosing a larger p?


• more flexible model, potentially better description of the dynamics of Yt
• more parameters over which to optimise our forecast rule
• less ‘bias’ in the approximation of the optimal forecast E[YT +1 | YT ]
• Disadvantages of a larger p?
• more parameters to estimate, using the same amount of data
• more ‘variance’ in the estimation of the model parameters

IV.70
Model selection: formal approaches

ˆ
1. Directly compare estimated forecast performance: MSFE
• forecasting specific
ˆ
• may be unreliable if small P is used to estimate MSFE
2. Stepwise ‘testing down’ from a larger to smaller model
• applicable to only ‘nested’ models
3. Information criteria: penalise fit by number of model parameters
• very widely applicable, even to ‘non-nested’ models

IV.71
Stepwise ‘testing down’ procedure

• Suppose ‘true’ model for {Yt } is AR(p0 ), where p0 < pmax


• entails βi = 0 for i ≥ p0 + 1
• Idea: start with a model with p = pmax lags
1. perform t test of H0 : βp = 0: say at 5% level
2. if accept H0 : reduce p to p − 1, and return to step 1
• stop at first rejection of H0 : this is your chosen p
• Can we use the usual critical values here?
• yes, provided data is stationary (and ‘weakly dependent’)
a
• β̂l ∼ N[βl , T1 ωβ2l ], exactly as in the cross-sectional case
• implies t and F statistics have the familiar limiting distributions
• Disadvantages?
• β̂i could be significant, for a large i, just by chance
• stopping rule prevents consideration of more parsimonious models

IV.72
Information criteria

• Derived from a mathematical formulation of the bias–variance


trade-off
• can be formulated in more than one way: different solutions, so
multiple ICs exist
• relevant not just in a forecasting context: also useful e.g. for selecting
between (cross-sectional) models with nonlinearities
• General form: for m = # model parameters (including intercept)

SSRm 2
Akaike IC AICm := log +m
T T
SSRm log T
Bayesian IC BICm := log +m
T T
• ‘best’ model minimises one of these criteria
• depends on model fit (SSR), and a penalty that grows with m
• note that AR(p) model has m = p + 1 parameters

IV.73
Information criteria

SSRm 2 SSRm log T


AICm = log +m BICm = log +m
T T T T
• BIC penalises larger models more than AIC does
• always chooses a (weakly) smaller model (if T ≥ 8)
• penalty grows larger with the sample size
• What if {Yt } is really an AR(p0 ) process, with p0 < pmax ?
• BIC is consistent: chooses p = p0 with high prob., in large samples
• AIC is conservative: may still choose p (slightly) larger than p0
• both generally lead to ‘reasonable’ choices
• How do they compare with regression diagnostics?
• R 2 always favours larger models: always chooses p = pmax !
• SER / R 2 penalise additional parameters too little; in fact (see notes):

SSRm 1
2 log SERm ' log +m
T T
and so tend to choose models with weaker forecasting performance

IV.74
Model selection: example
• Estimate AR(p) models for GDP growth and the change in the
unemployment rate (both 1986Q1–2018Q4):

GDP growth rate Change in unemployment rate


p SER* AIC BIC |t| SER* AIC BIC |t|

0 1.788 1.796 1.818 — −2.741 −2.733 −2.711 —


1 1.331 1.346 1.390 5.685 −3.397 −3.382 −3.338 8.672
2 1.313 1.335 1.401 1.600 −3.415 −3.392 −3.327 1.636
3 1.318 1.348 1.435 0.445 −3.407 −3.377 −3.290 0.030
4 1.322 1.359 1.468 0.798 −3.412 −3.374 −3.265 1.027
5 1.330 1.374 1.505 0.126 −3.404 −3.359 −3.228 0.049
6 1.326 1.378 1.530 1.011 −3.399 −3.347 −3.194 0.592
7 1.326 1.385 1.560 0.848 −3.392 −3.333 −3.158 0.417
8 1.333 1.399 1.595 0.388 −3.438 −3.373 −3.176 2.175

• SER∗ := 2 log SER, for comparability with AIC and BIC

IV.75
Why did we not allow ‘gaps’ ?

Yt = β0 + β1 Yt−1 + β2 Yt−2 + · · · + βp Yt−p + ut E[ut | Yt−1 ] = 0

• AR(p) model includes all lags Yt−1 up to Yt−p


• e.g. AR(4) has Yt−1 , Yt−2 , Yt−3 and Yt−4 as regressors
• why not also consider a model with only (say) Yt−1 and Yt−4 ?
• Potential problems?
• enormous proliferation of models: 2pmax possible models!
• estimation / comparison may be computationally infeasible
• increases probability of selecting a ‘bad’ model by chance: all measures
of performance are subject to random variation
• Good practice is only to have ‘gaps’ where this is a priori sensible
• e.g. model for monthly data: with first 12 monthly lags, then lags at
24, 36, 48 months to capture longer-lived seasonal dependence

IV.76
Subsection 14.iii

Additional predictors:
forecasting and Granger causality

IV.77
Beyond the AR(p) model

• Forecasts from AR(p) models are an important benchmark


• But could we do better?
• by adding lags of other (economically relevant) variables to the model
• in addition to, not instead of lags of {Yt }: leaving these out would be a
very bad idea!
• For example:
• decreases in the term spread (long minus short rates) tends to
anticipate recessions: could this help forecast GDP?
• unemployment rates tend to lag contractions in output: could GDP
could help forecast unemployment?

IV.78
The ADL(p,q) model
• Take an AR(p) model and add in q lags of another predictor {Xt }:

Yt = β0 + β1 Yt−1 + β2 Yt−2 + · · · + βp Yt−p


+ δ1 Xt−1 + δ2 Xt−2 + · · · + δq Xt−q + ut

• an autoregressive distributed lag (ADL) model of order (p, q)


• {Xt } should also be a (weakly) stationary process: or more precisely,
{Yt , Xt } should be jointly stationary
• Assumptions on {ut }
• unforecastable by lags of both Yt and Xt :

0 = E[ut | Yt−1 , Xt−1 , Xt−2 , Xt−2 , . . .] =: E[ut | Yt−1 , Xt−1 ]

• implies OR holds: parameters consistently estimated by OLS


• standard inferences continue to apply so long as {Yt } and {Xt } are
stationary (and ‘weakly dependent’): t and F tests can be conducted
‘as per usual’

IV.79
Forecasting with an ADL(p,q) model
p
X q
X
Yt = β0 + βi Yt−i + δi Xt−i + ut E[ut | Yt−1 , Xt−1 ] = 0
i=1 i=1

• Optimal (MSFE-minimising) forecast of YT +1 is

YT +1|T := E[YT +1 | YT , XT ]
p q
" #
X X
= E β0 + βi YT +1−i + δi XT +1−i + uT +1 | YT , XT
i=1 i=1
p
X q
X
= β0 + βi YT +1−i + δi XT +1−i
i=1 i=1

• Estimate by ‘plugging in’ the OLS estimates:


p
X q
X
ŶT +1|T = β̂0 + β̂i YT +1−i + δ̂i XT +1−i
i=1 i=1

IV.80
Forecasting with an ADL(p,q) model

p
X q
X
Yt = β0 + βi Yt−i + δi Xt−i + ut E[ut | Yt−1 , Xt−1 ] = 0
i=1 i=1

• Machinery of forecast evaluation and model selection also applies here


• note ADL(p,q) model has m = p + q + 1 parameters
• Usual to force p = q to reduce set of models to consider
• sequential testing then involves testing H0 : βp = δp = 0, at each step

IV.81
Forecasting with an ADL(p,q) model
• What about h-step ahead forecasting?
• suppose p = q = 1 for simplicity, i.e.

Yt = β0 + β1 Yt−1 + δ1 Xt−1 + ut E[ut | Yt−1 , Xt−1 ] = 0

• if h = 2: then

YT +2|T = E[YT +2 | YT , XT ]
= E[β0 + β1 YT +1 + δ1 XT +1 + uT +2 | YT , XT ]
= β0 + β1 E[YT +1 | YT , XT ] + δ1 E[XT +1 | YT , XT ]
= β0 + β1 YT +1|T + δ1 XT +1|T

• but where do we get XT +1|T from?


• We’d also have to specify an equation for forecasting XT +1 !
• usually taken to be ‘symmetric’ with that for Yt , e.g.

Xt = γ0 + γ1 Yt−1 + θ1 Xt−1 + vt E[vt | Yt−1 , Xt−1 ] = 0

• leads to ‘vector autoregressive’ models: beyond the scope of this course

IV.82
Forecasting with an ADL(p,q) model: example
• Suppose we are in 2016Q1:
• want to forecast quarterly UK GDP growth rate
(grgdp := 400∆ log GDP) for the next quarter
• additionally including the term spread (term; difference between the
10-year and 3-month bond rate)

1. Estimate an ADL(1,1) model (1986Q1–2016Q1):


Estimate Std. Error t value Pr(¿—t—)
(Intercept) 0.8396 0.2453 3.42 0.00086 ***
L(grgdp, 1) 0.6032 0.0722 8.35 1.5e-13 ***
L(term, 1) 0.1799 0.1083 1.66 0.09939 .

• i.e. β̂0 = 0.83, β̂1 = 0.60, δ̂1 = 0.18


2. Use the model to make forecasts (out-of-sample predictions):
• final observations: YT =2016Q1 = 0.80, XT =2016Q1 = 1.02
• forecast for 2016Q2:

ŶT +1|T = β̂0 + β̂1 YT + δ̂1 XT = 0.83 + 0.60 × 0.80 + 0.18 × 1.02 = 1.49

IV.83
Does {Xt } help to forecast {Yt }?
• Referred to in econom(ics/etrics) as the problem of Granger causality
• We say Xt Granger causes Yt if:
• informally: lags of Xt improve forecasts of Yt made on the basis its
own lags
• formally: lags of Xt reduce the optimal forecast’s MSFE:

E{YT +1 − E[YT +1 | YT , XT ]}2 < E{YT +1 − E[YT +1 | YT ]}2

• or equivalently:

E[YT +1 | YT , XT ] 6= E[YT +1 | YT ]

• concerns marginal predictability, not ‘causality’ in the usual sense


• Of interest for two reasons:
1. for forecasters: prefer more parsimonious models, so want to know if
we can exclude Xt from the model
2. for economists: some economic theories imply that one variable should
/ should not help to forecast another

IV.84
Testing for Granger causality
p
X p
X
Yt = β0 + βi Yt−i + δi Xt−i + ut E[ut | Yt−1 , Xt−1 ] = 0
i=1 i=1

• Setting is always an ADL(p,q) model with p = q


• must have same number of lags of both variables
• otherwise extra predictability could come from having more lags!

1. Choose p: via a model selection procedure (sequential testing or IC)


2. Perform an F test of

H0 : δ1 = δ2 = · · · = δp = 0
• a null of Granger non-causality
a
• a test of p linear restrictions: F ∼ Fp,∞
• unrestricted model: ADL(p, p), with k = 2p slope coefficients
• restricted model: AR(p)
• If reject H0 : conclude {Xt } Granger causes {Yt }
• Good practice to report F from several p’s, as a robustness check
IV.85
Granger causality: examples

• Sometimes GC tests are motivated by the desire to improve forecasts


• Suppose e.g. we are trying to forecast unemployment
• labour markets are well known to ‘lag’ output fluctuations
• firms tend to delay laying off workers during a downturn, e.g. because
of the costs of re-hiring / training new employees
• Suggests an ADL model relating the unemployment rate (UR) to its
own lags, and GDP
• because of evident nonstationarities, best phrased in terms of change in
UR, and growth rate of GDP
• transformed series more plausibly stationary: F statistic will have its
usual distribution

IV.86
Does output Granger-cause unemployment?
• Table reports estimates for ADL(p,p) models:
p
X p
X
∆urt = β0 + βi ∆urt−i + δi grgdpt−i + ut
i=1 i=1

(Sample: UK, 1986Q1–2018Q4)

• F statistic is for the null of Granger


p AIC BIC F p-val non-causality
0 -1.96 -1.94 — — H0 : δ1 = · · · = δp = 0
1 -2.31 -2.25 63.07 0.00
2 -2.59 -2.48 15.32 0.00 with p-value P{F > Fp,∞ }
3 -2.65 -2.49 11.23 0.00
4 -2.95 -2.76 8.96 0.00
• Clearly reject the null; conclude:
• output helps to predict unemployment
5 -3.01 -2.77 5.34 0.00
6 -2.99 -2.71 4.97 0.00
• i.e. output Granger causes
7 -2.97 -2.64 4.18 0.00 unemployment
8 -2.97 -2.60 3.84 0.00

IV.87
Does unemployment Granger-cause output?
• Could also run in reverse: can unemployment help forecast output?
p
X p
X
grgdpt = β0 + βi grgdpt−i + δi ∆urt−i + ut
i=1 i=1

(Sample: UK, 1986Q1–2018Q4)

• Very weak / fragile evidence of Granger


p AIC BIC F p-val causality in this direction
0 1.80 1.82 — — • Unemployment only very marginally
1 1.31 1.38 5.64 0.02
helpful in predicting output; arguably
2 1.33 1.44 1.79 0.17 does not Granger-cause output
3 1.36 1.51 1.17 0.33 • Findings consistent with our priors
4 1.37 1.57 1.36 0.25
about lags in firm hiring and firing
5 1.37 1.61 1.66 0.15
6 1.37 1.65 2.09 0.06
7 1.39 1.72 1.97 0.06
8 1.39 1.76 2.23 0.03

IV.88
Granger causality: stock prices and dividends

• Sometimes Granger causality tests are of interest because they tell us


something about the predictions of economic theory
• E.g. ‘efficient markets hypothesis’ implies stock prices should
efficiently incorporate all publicly available information
• excess returns of stocks (over a riskless asset) should be
unforecastable: nothing should Granger-cause excess returns
• discussed in S&W: large body of evidence consistent with at best very
weak predictability of excess returns
• Another aspect of this theory: stock prices should reflect the present
value of expected future dividends
• stocks with higher expected future dividends are more attractive to
investors, so should be priced more highly today
• stock prices should reflect information about future dividends, beyond
that contained in past dividends
• implies stock prices should Granger-cause dividends: even though
‘actual’ causation runs in the opposite direction!

IV.89
Stock prices and dividends: for the S&P 500
S&P 500: index S&P 500: dividends
4000

60
50
3000

40
2000

30
20
1000

10
0

0
1940 1960 1980 2000 2020 1940 1960 1980 2000 2020

S&P 500: index growth rate S&P 500: dividend growth rate

15
10
40

5
20

0
0

−5
−20

−20 −15 −10


−40

1940 1960 1980 2000 2020 1940 1960 1980 2000 2020

IV.90
Granger causality: stock prices and dividends
• Stock prices (pr) and dividends (div) are highly nonstationary, so
estimate model in growth rates
p
X p
X
∆ log divt = β0 + βi ∆ log divt−i + δi ∆ log prt−i + ut
i=1 i=1

(Sample: US, 1930Q1–2018Q4; pr and div are from the S&P 500)

• Strong evidence of Granger causality


p AIC BIC F p-val • Reject the null of non-causality
0 2.23 2.24 NA NA whenever p ≥ 2
1 1.16 1.19 2.28 0.13 • Overwhelming reject at model chosen by
2 1.05 1.11 6.43 0.00 AIC (p = 6) or BIC (p = 3)
3 1.02 1.10 5.97 0.00
4 1.00 1.10 5.53 0.00
• Consistent with predictions of the theory
5 1.01 1.13 4.67 0.00
6 0.99 1.13 4.29 0.00
7 0.99 1.15 3.55 0.00
8 0.99 1.18 3.36 0.00

IV.91
Section 15

Nonstationary time series

IV.92
Stationarity: revisited

• So far: we’ve maintained {Yt } and {Xt } are (jointly) stationary, at


least after suitably transforming them
• Objective now: to model and forecast nonstationary time series
1. ‘Breaks’ and parameter instability (Sec. 15.i)
• modelling and testing for changes in the process generating {Yt }
• selecting the epoch on which to estimate a forecasting model
2. Deterministic and stochastic trends (Sec. 15.ii–15.iv)
• how many times should we difference a series to stationarity?
• how should we handle trends when forecasting?

IV.93
Subsection 15.i

Breaks and parameter instability

IV.94
Parameter instability

Yt = β0 + β1 Yt−1 + ut E[ut | Yt−1 ] = 0 Eut2 = σu2 |β1 | < 1

• How could nonstationarity be accommodated?


• perhaps different β0 , β1 or σu2 apply during different epochs?
• reflecting underlying structural economic changes?
• Example: behaviour of UK real GDP growth apparently changed
during the late 1980s / early 1990s. Why might that have happened?
• legacy of Thatcherism; ‘de-industrialisation’ ?
• liberalisation of financial markets and globalisation?
• decline of trade unions and ‘deregulation’ of labour markets?
• ‘Eyeball’ evidence of parameter instability: but are differences
statistically significant?

IV.95
Example: UK real GDP growth
Real GDP: qtrly growth rate
20
15

mean sd
Percent per annum

10
5

1960–1989 2.75 4.73


0

1990–2018 1.97 2.33


−5
−10

1960 1970 1980 1990 2000 2010 2020

Real GDP growth: 1960Q1−1989Q4 Real GDP growth: 1990Q1−2018Q4


1.0

1.0
0.8
0.8
h

0.6
0.6
Sample ACF: ρ

Sample ACF: ρ
^

0.4
0.4

0.2
0.2

0.0
0.0

−0.2
−0.2

0 1 2 3 4 5 6 0 1 2 3 4 5 6

Lag: h Lag: h

IV.96
‘Breaks’ and parameter instability
• ‘Break’: abrupt change in model parameters on (or very near) a
particular date; modelled using breakpoint dummies
• Suppose we think things changed on (or around) 1989Q4; we could
specify:

Yt = β0 + β1 Yt−1 + γ0 Dt (τ ) + γ1 Dt (τ ) · Yt−1 + ut

• where, for τ corresponding to 1989Q4,


(
1 if t ≤ τ
Dt (τ ) := 1{t ≤ τ } =
0 if t > τ

• Implications?
• intercept is: β0 + γ0 until 1989Q4, β0 thereafter
• coef. on Yt−1 is: β1 + γ1 until 1989Q4, β1 thereafter
• equivalent to estimating separate AR(1) models on 1960–1989 and
1990–2018 subsamples

IV.97
Chow test for a break

Yt = β0 + β1 Yt−1 + γ0 Dt (τ ) + γ1 Dt (τ ) · Yt−1 + ut Dt (τ ) := 1{t ≤ τ }

• Chow breakpoint test:


• test H0 : γ0 = γ1 = 0 via the usual F test
• null of no break: implies Yt is stationary
• q = 2 linear restrictions: critical values taken from F2,∞ distribution
• Do we need to take account of how Eut2 = σu2 might have changed?
• if our object is just to fit an AR(p) model, this can be ignored
• AR(p) model still consistently estimates E[Yt | Yt−1 ]
• inferences not affected by changing variance of errors

IV.98
Testing for a break in UK GDP: at 1989Q4

Yt = β0 + β1 Yt−1 + γ0 Dt (τ ) + γ1 Dt (τ ) · Yt−1 + ut Dt (τ ) := 1{t ≤ τ }

• Is there a break in AR(1) model for real GDP growth?


Estimate Std. Error t value Pr(¿—t—)
(Intercept) 0.717 0.440 1.63 0.10467
L(grgdp, 1) 0.637 0.144 4.41 1.6e-05 ***
dum 2.107 0.584 3.61 0.00038 ***
L(grgdp, 1):dum -0.664 0.160 -4.14 4.8e-05 ***

• γ̂0 = 2.11, γ̂1 = −0.66


• F test (hetero-robust) of H0 : γ0 = γ1 = 0 yields

F = 9.41 =⇒ p = P{F2,∞ > 9.41} = 0.00012

• Easily reject the null: strong evidence of a break in the AR(1) model
describing GDP growth: at (or around) 1989Q4

IV.99
What if the break date is unknown?

Yt = β0 + β1 Yt−1 + γ0 Dt (τ ) + γ1 Dt (τ ) · Yt−1 + ut Dt (τ ) := 1{t ≤ τ }

• We found evidence of a break at 1989Q4:


• seems reasonable ex post, but would we think to look there a priori?
• how can we detect (and date) a breakpoint without just relying on our
eyeballs?
• Quandt likelihood ratio (QLR) test:
• allow for breaks at τ ∈ {τ0 , τ0 + 1, . . . , τ1 }, for τ0 ' πT , τ1 ' (1 − π)T
• for each τ : compute the Chow break test statistic F (τ )
• QLR statistic is the maximum over all these τ ’s:

QLR := max{F (τ0 ), F (τ0 + 1), . . . , F (τ1 )}

IV.100
What if the break date is unknown?

QLR = max F (τ ) [τ0 , τ1 ] ' [πT , (1 − π)T ]


τ0 ≤τ ≤τ1

• What is π?
• ‘trimming parameter’, to ensure enough pre- and post-break
observations
• π = 15% is a reasonable choice, in most macro applications
• QLR involves many, many tests:
• critical values larger than for the F test (see Table 14.5 in S&W)
• depends on: q = # parameters allowed to break; trimming π
• e.g. if q = 2 and π = 0.15, c0.05 = 5.86 (v. 3.00 for F test)
• Break date estimator? The maximiser τ̂ of these F statistics:

τ̂ := argmax F (τ )
τ0 ≤τ ≤τ1

IV.101
QLR test for a break in UK real GDP growth
20
15 Real GDP: quarterly growth rate
Percent per annum

10
5
0
−10 −5

1960 1970 1980 1990 2000 2010 2020

F statistics for a break: AR(1) model, UK real GDP (1960Q1−2018Q4)


12

1% critical value
5% critical value
10
8
F statistic

6
4
2
0

1960 1970 1980 1990 2000 2010 2020

• Estimated break date: τ̂ = 1987Q1


IV.102
QLR test for a break in UK real GDP growth
10 Real GDP: quarterly growth rate
Percent per annum

5
0
−5

1990 2000 2010 2020

F statistics for a break: AR(1) model, UK real GDP (1987Q2−2018Q4)


12

1% critical value
5% critical value
10
8
F statistic

6
4
2
0

1990 2000 2010 2020

• Less evidence for instability in 1987Q2–2018Q4 subsample


IV.103
Implications?

• If we want to forecast UK GDP now, we should only estimate our


model on data after 1987Q2
• What if we just ignore the evidence for a break, and fit an AR(1)
model over 1960–2018 anyway?
1. OLS will be consistent for some ‘average’ of the pre- and post-break
parameters
2. Forecasting performance may be worse
• Breaks are a leading cause of forecast failure: particularly those near
the end of a sample

IV.104
Breaks: summary

• ‘Breaks’ a convenient device for handling nonstationarity due to


parameter instability
• should always be checked for prior to forecasting
• use the QLR test, rather than try to ‘guess’ the break point
• if a break found, fit model to most recent stable epoch
• Chow / QLR tests extend straightforwardly to AR(p) and ADL models
• not necessary to require all parameters to change
• only the # restrictions q matters for distribution of test statistics
• Breaks also of intrinsic interest, aside from forecasting
• evidence of underlying structural economic changes
• trying to explain why breaks have occurred: helps improve
understanding of how the economy has developed over time

IV.105
Subsection 15.ii

Unit roots and stochastic trends

IV.106
Trends / ‘random wandering’

• Major source of nonstationarity in economic time series


• Previously dealt with by ‘differencing to stationarity’: then model the
differences
• But actually: such series can be described by an AR(p) model (with
stable parameters)!

IV.107
Examples of trends / ‘random wandering’

Yield on 10−year UK government bonds UK quarterly CPI inflation


15

30
Per cent per annum

Per cent per annum


10

20
10
5

0
0

1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020
1.0

1.0
0.8

0.8
h

h
Sample ACF: ρ

Sample ACF: ρ
^

^
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 1 2 3 4 5 6 0 1 2 3 4 5 6

Lag: h Lag: h
IV.108
Examples of trends / ‘random wandering’

UK unemployment rate Real GDP: logarithm


12

26.8
10

Log £b GBP, 2015 prices


Per cent of labour force

26.4
6

26.0
4

25.6
2

1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020
1.0

1.0
0.8

0.8
h

h
Sample ACF: ρ

Sample ACF: ρ
^

^
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 1 2 3 4 5 6 0 1 2 3 4 5 6

Lag: h Lag: h
IV.109
AR models for nonstationary time series

Claim: if ∆Yt follows an AR(p − 1) model, then:


• Yt follows an AR(p) model with pi=1 βi = 1 (not stationary!)
P

• converse is also true

• Suppose e.g. ∆Yt is stationary AR(1):

∆Yt = β0 + γ1 ∆Yt−1 + ut E[ut | Yt−1 ] = 0 |γ1 | < 1

• rewrite using ∆Yt := Yt − Yt−1 :

Yt − Yt−1 = β0 + γ1 (Yt−1 − Yt−2 ) + ut


=⇒ Yt = β0 + (γ1 + 1)Yt−1 − γ1 Yt−2 + ut
=⇒ Yt = β0 + β1 Yt−1 + β2 Yt−2 + ut
• yields implied AR(2) model for Yt , with

β1 + β2 = (γ1 + 1) + (−γ1 ) = 1

IV.110
Unit root processes

p
X
{∆Yt } is AR(p − 1) ⇐⇒ {Yt } is AR(p) with βi = 1
i=1

• Hitherto: have differenced trending / randomly wandering series {Yt },


before fitting a (stationary) AR(p) model to {∆Yt }
• But preceding implies we could equivalently describe {Yt } directly
using an AR(p + 1) model!
• Nice in theory, but is this empirically useful?
• can AR(p) models really generate trajectories like those of trending /
randomly wandering series?
• not all AR(p) models: not those with stationary parameters!
• but perhaps AR(p) models with pi=1 βi = 1: termed unit root AR
P

models / processes
• So what are the properties of unit root AR models?

IV.111
Unit roots: AR(1) case
• Unit root AR(1) process has β1 = 1:

Yt = β0 + Yt−1 + ut

• recursive substitution gives:

Yt = β0 + (β0 + Yt−2 + ut−1 ) + ut


= 2β0 + Yt−2 + (ut−1 + ut )
= 3β0 + Yt−3 + (ut−2 + ut−1 + ut )

• continuing back to the ‘dawn of time’:


t
X
Yt = β0 t + us + Y0 =: β0 t + Ut + Y0
s=1

• Yt decomposes as the sum of


1. deterministic linear trend: β0 t
2. random walk: Ut = ts=1 us
P

3. an initial value: Y0
IV.112
Unit roots: AR(1) case

t
X
Yt = β0 t + us + Y0 =: β0 t + Ut + Y0
s=1

1. β0 t: deterministic function of time


2. Ut wanders randomly:
• highly persistent: large and slowly decaying sample ACF
• Ut+1 is equal to its previous value, plus an (unforecastable) innovation:

t+1
X t
X
Ut+1 = us = us + ut+1 = Ut + ut+1
s=1 s=1

• manifestly nonstationary: var(Ut ) = tσu2 → ∞ as t → ∞

• Combination of both components (or Ut alone) generates trajectories


similar to those observed in many macro time series

IV.113
Trajectories of unit root AR(1): driftless
0 β0 = 0 β0 = 0

0
−5

−5
−10

−10
−15
−15

1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

β0 = 0 β0 = 0

10
5

5
0
−5

0
−10

−5

1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

IV.114
Trajectories of unit root AR(1): with drift
β0 = 0.2 β0 = 0.3

100
30

80
20

60
40
10

20
0

0
1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

β0 = 0.4 β0 = 0.5

120
80

100
60

80
60
40

40
20

20
0

1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

IV.115
Unit roots: AR(p) case

• Unit root AR(p) process:


p
X p
X
Yt = β0 + βi Yt−i + ut βi = 1 E[ut | Yt−1 ] = 0
i=1 i=1

• Similar behaviour to unit root AR(1); can decompose (see notes):

Yt = β0 t + Vt + Y0

1. β0 t: deterministic trend
2. Vt := ts=1 vs , for {vt } stationary: a stochastic trend
P

• slight generalisation from AR(1): {vt } may be serially correlated


• AR(p) for Yt allows ∆Yt to be serially correlated (if p ≥ 2): improves
descriptive accuracy of the model

IV.116
Trajectories of unit root AR(p) processes
AR(4), β0 = 0 AR(4), β0 = 0

2
6

1
4

0
−1
2

−2
0

−3
−2

−4
−4

−5
1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

AR(4), β0 > 0 AR(4), β0 > 0

2.5
2.0

2.0
1.5

1.5
1.0

1.0
0.5

0.5
0.0

0.0

1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

IV.117
Deterministic and stochastic trends

1. Deterministic trend: β0 t
• generates linear growth
Pt
2. Stochastic trend: Vt := s=1 vs for {vt } stationary and lrvar(vt ) > 0
• generates (and synonymous with) ‘random wandering’ behaviour
• special case: a random walk if {vt } is serially uncorrelated

• Technical aside: what is lrvar(vt )? (see notes)


• the ‘long run variance’ of vt ; defined for γv ,h := cov(vt , vt−h ) as

X ∞
X
lrvar(vt ) := γv ,h = γv ,0 + 2 γv ,h
h=−∞ h=1

• can be zero if there is ‘too much’ negative serial correlation in {vt }


• but guaranteed to be positive if vt is stationary AR(p)
• lrvar(vt ) > 0 ensures var(Vt ) → ∞

IV.118
Unit root processes: do they match real series?

Unit root (AR) process:


• {Yt } follows an AR(p) model with pi=1 βi = 1
P

• Decomposes into sum of a deterministic trend (if β0 6= 0), a stochastic


trend, and an initial value
t
X
Yt = β0 t + vt + Y0
s=1

• Combination of these components present in (log) levels in many


macro time series
• ‘randomly wandering’ series: behave similarly to model with β0 = 0
• most series with a drift (β0 6= 0) also have a stochastic trend
component: evident after linear detrending
• Unit root AR(p) model describes many macro time series well

IV.119
Unit root processes: do they match real series?

UK unemployment rate Unit root AR(4), β0 = 0


12

10
10

8
Per cent of labour force

6
6

4
4

2
2

0
1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020
1.0

1.0
0.8

0.8
h

h
Sample ACF: ρ

Sample ACF: ρ
^

^
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 1 2 3 4 5 6 0 1 2 3 4 5 6

Lag: h Lag: h
IV.120
Unit root processes: do they match real series?

Unit root AR(4), β0 = 0 Unit root AR(4), β0 = 0

8
8

6
6

4
4

2
2

0
1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020
1.0

1.0
0.8

0.8
h

h
Sample ACF: ρ

Sample ACF: ρ
^

^
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 1 2 3 4 5 6 0 1 2 3 4 5 6

Lag: h Lag: h
IV.120
Unit root processes: do they match real series?
Log of US annual real GDP Log of UK quarterly real GDP

27.0
^ ^ ^ ^
δ0 + δ1 × t δ0 + δ1 × t
9.5
9.0
Log $tr, 2012 prices

Log £b, 2015 prices

26.5
8.5
8.0

26.0
7.5
7.0

25.5
1940 1960 1980 2000 2020 1960 1970 1980 1990 2000 2010 2020

Detrended log US annual real GDP Detrended log UK quarterly real GDP

0.10
0.2
0.1

0.05
Log $tr, 2012 prices

Log £b, 2015 prices


0.0

0.00
−0.2

−0.05
−0.4

−0.10

1940 1960 1980 2000 2020 1960 1970 1980 1990 2000 2010 2020

IV.121
Subsection 15.iii

Testing for unit roots

IV.122
Handling unit roots in time series

• Previously, if {Yt } ‘looked like’ it had a stochastic trend:


• differenced it to obtain a (more) stationary-‘looking’ process
• fitted an AR model to {∆Yt }
• Now: we know we don’t need to do this
• equivalent AR models exist for both {Yt } and {∆Yt }:
p
X
{∆Yt } is AR(p − 1) ⇐⇒ {Yt } is AR(p) with βi = 1
i=1

• parameters of both consistently estimated by OLS (but are not


numerically equivalent)
• So should we fit an AR (or ADL) model to a unit root process {Yt }
directly? Not necessarily . . .

IV.123
Pitfalls of unit roots

1. Biases in estimated coefficients


• OLS consistent, but substantial finite-sample biases
• e.g. estimate of β1 in AR(1) model: large downward bias near β1 = 1
• can negatively affect forecast accuracy
2. Non-standard limiting distributions
• CLTs do not apply to unit root processes
• regression estimates (of some parameters) not asymptotically normal
• inference is non-standard: complicates e.g. Granger causality tests
• ‘spurious regressions’ are possible (Sec. 16.i)

IV.124
Pitfalls of unit roots
• Conclusion? If we know {Yt } has a unit root, it is still preferable to fit
an AR (or ADL) model to {∆Yt }
• estimating the AR model for {∆Yt } implicitly imposes a unit root in
Yt : reduces estimation error / biases
• forecasts for Yt can be recovered as per

ŶT +1|T := YT + (∆Y


d )T +1|T

• But if {Yt } is stationary, we should model it directly


• ‘overdifferencing’ (working with ∆Yt ) creates other problems: imposes
a unit root in {Yt } when there isn’t one!
• We should perform a formal statistical test for a unit root in {Yt }
before forecasting; also important for:
1. detecting possible instances of ‘spurious regression’ (Sec. 16.i)
2. identifying long-run equilibrium relationships between co-moving series
(‘cointegration’; Sec. 16.ii)
3. sometimes of intrinsic interest: e.g. do ‘shocks’ to GDP have
permanent effects?

IV.125
Testing for a unit root: AR(1) case

Yt = β0 + β1 Yt−1 + ut E[ut | Yt−1 ] = 0

• Want to test of the null of a unit root, against a stationary alternative:

H0 : β1 = 1 v. H1 : β1 < 1
• Why the one-sided alternative?
• β1 < 1: consistent with stationarity (provided β1 > −1; not tested)
• β1 > 1: implies ‘explosive’ behaviour, which is usually implausible
• one-sided test interprets evidence for β1 > 1 as evidence against
stationarity (correctly)
• one-sided test means smaller critical value; a more powerful test!

IV.126
Explosive AR(1) processes: β1 > 1
β0 = 0, β1 = 1.01 β0 = 0, β1 = 1.02

300
25

250
20

200
15

150
100
10

50
5

0
1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

β0 = 0, β1 = 1.03 β0 = 0, β1 = 1.04

800
6000

600
4000

400
2000

200
0
0

1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

IV.127
Testing for a unit root: AR(1) case

Yt = β0 + β1 Yt−1 + ut E[ut | Yt−1 ] = 0


• Equivalent formulation: subtract Yt−1 from both sides

∆Yt = β0 + (β1 − 1)Yt−1 + ut =: β0 + δYt−1 + ut


• δ := β1 − 1; OLS estimate is δ̂ = β̂1 − 1
• hypotheses equivalently stated as

H0 : δ = 0 v. H1 : δ < 0

• Can we use the usual t statistic? ‘Almost’


• in large samples, t is not approximately N[0, 1], but has the
Dickey–Fuller (constant only) distribution
δ̂ d
t := → DFcn
se(δ̂)
• convention is to use homoskedasticity-only s.e.’s: makes no difference
to the limiting distribution in this case
IV.128
The Dickey–Fuller distribution: constant only
0.5 DFcn
N[0,1]
0.4
0.3
0.2
0.1
0.0

−4 −2 0 2

Left-tail critical value 10% 5% 1%


N[0, 1] −1.28 −1.64 −2.33
DFcn : constant only −2.57 −2.86 −3.43

IV.129
Example: is there a unit root in unemployment?
∆Yt = β0 + δYt−1 + ut

• Regress ∆Yt on Yt−1 , where Yt = unempt (1960Q1–2016Q4)


Estimate Std. Error t value Pr(¿—t—)
(Intercept) 0.05818 0.04050 1.44 0.15
L(unemp, 1) -0.00769 0.00595 -1.29 0.20

• Implied t statistic for H0 : δ = 0:

δ̂ −0.00769
t= = = −1.29
se(δ̂) 0.00595

• Decision rule for α = 0.05 test: reject if

t < c0.05 = −2.86


• Conclusion: do not reject H0 :
• conclude unemployment has a unit root
IV.130
Testing for a unit root: AR(p) case

∆Yt = β0 + δYt−1 + ut
• AR(1) model is very restrictive
• implies e.g. ∆Yt is serially uncorrelated (under H0 )
• may give misleading conclusions about presence of a unit root
• How to test for a unit root in the more general AR(p) setting?
• consider adding a lag of ∆Yt to the model above:

∆Yt = β0 + δYt−1 + γ1 ∆Yt−1 + ut

• how does this help us . . . ?

IV.131
Testing for a unit root: AR(p) case

∆Yt = β0 + δYt−1 + γ1 ∆Yt−1 + ut

• Equivalent AR(2) model for Yt :

Yt − Yt−1 = β0 + δYt−1 + γ1 (Yt−1 − Yt−2 ) + ut


=⇒ Yt = β0 + (δ + γ1 + 1)Yt−1 −γ1 Yt−2 + ut
= β0 + β1 Yt−1 + β2 Yt−2 + ut

• implies
(
=1 if δ = 0
β1 + β2 = (δ + γ1 + 1) + (−γ1 ) = δ + 1
<1 if δ < 0

• Unit root corresponds to δ = 0, stationarity to δ < 0


• carries over to AR(p) models with p ≥ 3 (see notes)

IV.132
ADF test: constant only
• ‘Augmented’ Dickey–Fuller (ADF) test for a unit root
p
X
∆Yt = β0 + δYt−1 + γi ∆Yt−i + ut
i=1

• model ‘augmented’ by lags of ∆Yt


• critical values from the ‘Dickey–Fuller distribution’

1. Hypotheses:
H0 : δ = 0 v. H1 : δ < 0
• null: {∆Yt } is AR(p), so {Yt } has a unit root
• alternative: {Yt } is stationary AR(p + 1)
2. Test statistic: homoskedasticity-only t statistic:

δ̂ d
t := → DFcn
se(δ̂)
3. Decision rule: reject if t < cα
4. Critical values: from tabulation of DFcn
IV.133
Example: is there a unit root in unemployment?

∆Yt = β0 + δYt−1 + γ1 ∆Yt−1 + γ2 ∆Yt−2 + γ3 ∆Yt−3 + ut

• Repeat the test allowing for Yt = unempt to follow an AR(4) model


Estimate Std. Error t value Pr(¿—t—)
(Intercept) 0.05496 0.02371 2.32 0.021 *
L(unemp, 1) -0.00844 0.00350 -2.41 0.017 *
L(diff(unemp), 1:3)1 0.79332 0.06635 11.96 ¡2e-16 ***
L(diff(unemp), 1:3)2 0.02936 0.08501 0.35 0.730
L(diff(unemp), 1:3)3 -0.00340 0.06664 -0.05 0.959

• Implied t statistic for H0 : δ = 0:

δ̂
t= = −2.41 ≮ −2.86 = c0.05
se(δ̂)
• Still do not reject H0 (at α = 0.05): so unemployment has a unit root
• note: use of normal critical values (c0.05 = −1.64) would incorrectly
lead to a rejection of H0 here

IV.134
ADF test: a limitation
p
X
∆Yt = β0 + δYt−1 + γi ∆Yt−i + ut
i=1

• Under the null (δ = 0), {Yt } is a unit root process, so

Yt = β0 t + Vt + Y0
• has a deterministic trend (if β0 6= 0)
• but under the alternative, {Yt } is stationary: with no trend!
• Could lead to misleading conclusions, when applied to series with a
linear trend, e.g. log GDP
• may fail to reject the null, because the only null allows for a trend
• we need to allow for a linear trend under the alternative
• accommodated by augmenting model by a linear trend, αt
• E.g. an (otherwise) stationary AR(1) alternative (β1 < 1):

Yt = β0 + αt + β1 Yt−1 + ut
• Yt is stationary ‘around’ a linear trend: a trend stationary process
• linearly detrending Yt would yield a stationary process
IV.135
ADF test: constant and trend

p
X
∆Yt = β0 + αt + δYt−1 + γi ∆Yt−i + ut
i=1

1. Hypotheses:
H0 : δ = 0 v. H1 : δ < 0
• null: {∆Yt } is AR(p), {Yt } has a unit root
• alternative: {Yt } is trend stationary AR(p + 1)
2. Test statistic: homoskedasticity-only t statistic:

δ̂ d
t := → DFtr
se(δ̂)

3. Decision rule: reject if t < cα


4. Critical values: from tabulation of DFtr

IV.136
The Dickey–Fuller distributions
DFcn : constant only
0.5 DFtr : constant and trend
N[0,1]
0.4
0.3
0.2
0.1
0.0

−4 −2 0 2

Left-tail critical value 10% 5% 1%


N[0, 1] −1.28 −1.64 −2.33
DFcn : constant only −2.57 −2.86 −3.43
DFtr : constant and trend −3.21 −3.41 −3.96
IV.137
ADF test: practicalities

p
X
∆Yt = β0 + αt + δYt−1 + γi ∆Yt−i + ut
i=1

1. Type: (a) constant only or (b) constant and trend?


• use constant only unless some deterministic drift is plausible: gives a
more powerful test
• clear a priori for some series, inferred from time plots for others
2. Lag order determination:
• δ̂ has a non-standard distribution, but γ̂i ’s are asymptotically normal!
• standard procedures can be applied: ICs and stepwise testing
• best practice is to report ADF statistics for selected p, and for some
others (to verify robustness)

IV.138
Example: is there a unit root in (log) real GDP?
p
X
∆Yt = β0 + αt + δYt−1 + γi ∆Yt−i + ut
i=1

• Since log(GDPt ) has a prominent drift, include αt in the ADF regression

1960Q1–2018Q4 1990Q1–2018Q4
p AIC BIC tδ=0 AIC BIC tδ=0

0 -9.34 -9.29 -1.56 -10.26 -10.19 -0.38


1 -9.34 -9.28 -1.74 -10.77 -10.68 -1.40
2 -9.36 -9.29 -2.07 -10.76 -10.64 -1.52
3 -9.37 -9.29 -2.35 -10.76 -10.62 -1.25
4 -9.37 -9.26 -2.20 -10.75 -10.58 -1.36
• Do not reject the null of a unit root at 10% level (c0.10 = −3.21)
• Highly robust to number of included lags (even out to p = 12!)

IV.139
Subsection 15.iv

Orders of integration

IV.140
Orders of integration

p
X
∆Yt = β0 + δYt−1 + γi ∆Yt−i + ut
i=1

• Suppose we accept H0 : δ = 0:
• conclude {Yt } has a unit root
• does this imply {∆Yt } is stationary?
• {∆Yt } follows an AR(p) model
p
X
∆Yt = β0 + γi ∆Yt−i + ut
i=1

• but possible that pi=1 γi = 1: {∆Yt } may itself have a unit root!
P
• could it be ‘unit roots all the way down’ ?

IV.141
Orders of integration

Order of integration of {Yt }: the smallest d ∈ {0, 1, 2, . . .} such that


{∆d Yt } is stationary, denoted Yt ∼ I (d)
• ∆2 Yt := ∆(∆Yt ) = ∆Yt − ∆Yt−1 , etc.

• Estimate d via sequential ADF tests:


1. test for a UR in {Yt }: if reject, then d = 0; if accept . . .
2. test for a UR in {∆Yt }: if reject, then d = 1; if accept . . .
3. test for a UR in {∆2 Yt }, etc.
• First d s.t. UR in {∆d Yt } is rejected = (estimated) order of
integration

IV.142
Orders of integration

• Most economic time series are (approximately) I (d) for d ∈ {0, 1}, a
very small number are I (2)
• I (0): stationary processes
• I (1): have stochastic trends, but differences are stationary
• I (2): have to be differenced twice to obtain a stationary process
• I (d), d ≥ 1 series are termed integrated processes
• I (2): are randomly wandering, but even more persistent (‘smoother’)
than I (1) processes

IV.143
Example: is the (log) price level I (2)?
• Previously, we noted might be necessary to difference US log CPI
(1950–2019) twice to obtain a stationary series
• Does this bear up to formal unit root tests?
log CPIt ∆ log CPIt ∆2 log CPIt
constant + trend constant only constant only

p AIC BIC tδ=0 AIC BIC tδ=0 AIC BIC tδ=0

0 -7.22 -7.12 0.04 -8.09 -8.03 -3.27 -7.94 -7.88 -9.58


1 -8.07 -7.94 -1.30 -8.08 -7.98 -3.43 -8.21 -8.11 -11.53
2 -8.05 -7.89 -1.05 -8.24 -8.11 -2.04 -8.19 -8.06 -7.46
3 -8.26 -8.06 -2.09 -8.21 -8.05 -1.91 -8.17 -8.01 -6.24
4 -8.24 -8.01 -2.24 -8.19 -8.00 -1.72 -8.14 -7.95 -4.79

• Using the AIC/BIC selected model in each case:


• log CPIt : −2.09 ≮ −3.21 = c0.10 : do not reject a unit root (at 10%)
• ∆ log CPIt : −2.04 ≮ −2.57 = c0.10 : do not reject a unit root (at 10%)
• ∆2 log CPIt : −11.53 < −3.43 = c0.01 : reject a unit root (at 1%)
• Conclusion: log CPIt is indeed I (2)

IV.144
Differencing twice to get to stationarity?

Log US CPI, 1982−84 = 100 US Inflation: annual change in CPI


5.5

0.12
Log US CPI, 1982−84 = 100

5.0

0.08
Per cent
4.5

0.04
4.0
3.5

0.00
1950 1960 1970 1980 1990 2000 2010 2020 1950 1960 1970 1980 1990 2000 2010 2020
1.0

1.0
0.8

0.8
h

h
0.6

0.6
Sample ACF: ρ

Sample ACF: ρ
^

^
0.4

0.4
0.2

0.2
−0.2 0.0

−0.2 0.0

0 2 4 6 8 10 12 0 2 4 6 8 10 12

Lag: h Lag: h
IV.145
Differencing twice to get to stationarity?

US Inflation: annual change in CPI Change in US inflation


0.12

0.04
0.08
Per cent

Per cent

0.00
0.04

−0.04
0.00

1950 1960 1970 1980 1990 2000 2010 2020 1950 1960 1970 1980 1990 2000 2010 2020
1.0

1.0
0.8
0.8

0.6
h

h
0.6
Sample ACF: ρ

Sample ACF: ρ
^

0.4
0.4

0.2
0.2
−0.2 0.0

−0.2

0 2 4 6 8 10 12 0 2 4 6 8 10 12

Lag: h Lag: h
IV.146
Beyond the AR model

• Thus far, have considered I (d) within the AR framework:


• I (d), d ≥ 1, processes were unit root AR processes
• I (0) processes were stationary AR processes
• More generally, definition of I (d) allows
• I (0) to be any stationary process, not necessarily AR, e.g.

Yt = εt + θ1 εt−1 + θ2 εt−2 εt ∼ i.i.d.

cannot be written as a (finite-order) AR process, but is I (0)


• I (1) to be any process that can be differenced to stationarity;
equivalently, the cumulation of an I (0) process
• Unit root tests still effectively distinguish between I (0) and I (1)
processes, even if they aren’t exactly AR processes

IV.147
Subsection 15.v

Forecasting: a ‘recipe’ for nonstationary data

IV.148
Handling nonstationarities

• Finally: distil what we’ve covered into a ‘recipe’ for forecasting with
nonstationary data
1. Plot the series and its sample ACF
• Does the series pass the ‘eyeball test’ for stationarity?
• or is a stochastic trend possibly present? =⇒ Step 2
• or is it stationary but for possible breaks? =⇒ Step 3

IV.149
Handling stochastic trends

2. Determine order of integration of the series


• ADF tests for Yt , ∆Yt , etc.,
• until a rejection obtains for some ∆d Yt
• conclude that {Yt } ∼ I (d)
• usually: d = 0 (already stationary) or d = 1 (difference once to
stationarity)
• Repeat for any other predictors {Xt } ∼ I (dX )
• Fit AR(p) [or ADL] model to ∆d Yt [and ∆dX Xt , etc.]

IV.150
Handling breaks

3. Check stability of forecasting model


• QLR test for breaks in fitted AR / ADL model
• If none present, you may proceed to forecast!
• Otherwise, either:
• re-estimate model with breakpoint dummies; or
• fit model on the most recent stable epoch
• Forecast using the (re-)estimated model for ∆d Yt
• if necessary, re-cumulate forecasts to obtain forecasts for Yt

IV.151
Section 16

Cointegration

IV.152
Regressions with integrated processes

• Up to now: if Yt and Xt are I (1):


• we ‘difference them to stationarity’;
• fit a model to ∆Yt and ∆Xt
• What if we don’t do this, and run regressions on Yt and Xt directly?
• can anything go seriously wrong?
• is there anything new and useful we can learn?
The answer to both questions is yes . . .

IV.153
Subsection 16.i

Spurious regression

IV.154
Regression: with i.i.d. or stationary processes

Q. Are X and Y ‘related’, either in a causal or a predictive sense?

• In the cross section, can answer by regression of Y on X


• If X ⊥
⊥ Y , OLS consistently estimates a zero coefficient on X
• can perform formal tests of ‘unrelatedness’ via the t test
• This carries over to stationary time series
• if {Yt } and {Xt } are stationary and independent, OLS consistently
estimates a zero coefficient on Xt (or Xt−1 , etc.)

Yt = β0 + β1 Xt + ut cov(Xt , ut ) = 0

• because no lags of Yt are included (unlike in AR / ADL model), ut may


be serially correlated
• inferences have to be adjusted: should compute heteroskedasticity and
autocorrelation consistent (HAC) standard errors
• But when stochastic trends are present, regression can go seriously
awry . . .

IV.155
Regression: with stochastic trends
• Suppose {Xt } and {Yt } are independent random walks
t
X t
X
Xt = εxs Yt = εys
s=1 s=1

• mean zero, i.i.d. innovations {εxs } ⊥


⊥ {εys }
• I (1) processes, with no deterministic drift (β0 = 0)
• Surely, a regression of Yt on Xt will consistently estimate zero?

Yt = β̂0 + β̂1 Xt + ût

• we would expect (for some ξ not necessarily normal):

p β̂1 − 0 d
β̂1 → 0 t(0) = →ξ
se(β̂1 )
• but what actually happens is quite the opposite!

IV.156
Regression: with stochastic trends

• OLS regression has a systematic tendency to find a ‘statistically


significant’ relationship between {Xt } and {Yt }!
• β̂1 never converges to zero or any other constant, it remains a random
variable even as T → ∞!
• t statistic diverges (in magnitude): suggesting a significant
relationship, whatever critical value / significance level is chosen

P{|t(0)| > c} → 1

• R 2 does not converge to zero, but also remains random as T → ∞:


may be quite close to unity
• But {Xt } and {Yt } are independent: this apparent significance is
entirely spurious!

IV.157
Distribution of the t statistic
• Simulation:
• generate 1000 pairs of independent random walks {Xt }T T
t=1 and {Yt }t=1
• regress {Yt } on {Xt }, and compute the t statistic for H0 : β1 = 0
Spurious regression: distribution of the t statistic

T = 50
T = 100
0.06

T = 200
± 2.58
0.04
0.02
0.00

−30 −20 −10 0 10 20 30

N = 2000 Bandwidth = 2.5


IV.158
Spurious regression

Spurious regression: defined as the systematic tendency to find statistically


significant regression relationships between unrelated I (1) series

• One of the leading examples of why you have to know something


about time series data, before running regression on it!
• Possible to construct all sorts of ‘crazy’ examples of variables
apparently ‘related’ to each other
• Yule (1926): ‘nonsense correlations’
• gave as an instance the strong positive correlation between mortality
and the proportion of marriages in the Church of England (1865–1913)
• investigated the phenomenon via simulations (using numbered cards!)
• recognised the role of persistence: invented the autocorrelation
function to quantify this

IV.159
Spurious regression: example

• Does the (log of) US industrial production (IP) help to account for
unemployment in the UK?

β̂1 se tβ1 =0 R2

1960–2016 2.57 0.22 11.70 0.16


1960–1989 9.18 0.39 23.38 0.60
1990–2016 −7.54 0.40 −18.64 0.52
• In the two subperiods, IP ‘explains’ more than 50% of the variation in
unemployment!
• with a relationship that has almost exactly the opposite coefficient!
• that is highly statistically significant in all cases!

IV.160
Spurious regression: example
UK unemployment rate
12
10
Per cent

8
6
4
2

1960 1970 1980 1990 2000 2010 2020

Log of US industrial production


100
80
Index

60
40
20

1960 1970 1980 1990 2000 2010 2020

IV.161
Spurious regression: and stochastic trends

• So why does this happen?


• Not merely: ‘both series grew / declined during the sample period’
• I (1) processes can have a deterministic drift, but spurious regression
happens even when they don’t!
• the problem is really due to stochastic trends
• Stochastic trends exhibit long swings of increase and decline
• even unrelated I (1) series will tend to have periods in which they move
in the same direction
• co-movement happens regularly enough, by chance, that a ‘statistically
significant’ relationship emerges
• Mathematically explained by Phillips (1986), 60 years after Yule!

IV.162
Spurious regression: diagnosis

• How can a spurious regression be diagnosed?


1. Check the order of integration: are Xt , Yt ∼ I (1)?
2. Analyse the residuals

ût = Yt − β̂0 − β̂1 Xt

• if Xt , Yt ∼ I (1) and unrelated, then ût will inherit a stochastic trend


• ût will be highly serially correlated, and look like an I (1) process
• possible to perform an ADF test on ût , though critical values have to
be adjusted . . .

IV.163
Spurious regression: residuals
UK unemployment rate
10
9
Per cent

8
7
6
5

1990 1995 2000 2005 2010 2015 2020

Log of US industrial production


100
90
Index

80
70
60

1990 1995 2000 2005 2010 2015 2020

Regression residual
2
1
Index

0
−1
−2

1990 1995 2000 2005 2010 2015 2020

IV.164
Subsection 16.ii

Cointegration and long-run equilibria

IV.165
Regressions with integrated processes

• So can it ever be appropriate to regress one I (1) series on another?


• Yes! In fact, this can tell us a lot about the long-run behaviour of
these series – which we cannot learn from their first differences
• All I (1) processes have stochastic trends
• but some share common stochastic (and deterministic) trends
• they tend to move together over time
• You might say they are integrated (of order 1) ‘together’: they are
‘co’-integrated . . .

IV.166
Cointegration: definition and interpretation

Xt and Yt are cointegrated if Xt , Yt ∼ I (1), and there exists a θ, termed


the cointegrating coefficient, such that Yt − θXt ∼ I (0).
• arises because two individually integrated processes share a common
stochastic (and deterministic) trend
• they are ‘co’-integrated – integrated ‘together’

• Xt and Yt wander randomly, with no tendency to revert to a fixed level


• but Yt − θXt ∼ I (0) is mean-reverting
• why? because Yt and Xt tend to move together, so as to keep
Yt − θXt close to its mean!
• θ parametrises the long-run equilibrium relationship between Yt and Xt
• θ is often of intrinsic economic interest, rather than merely
‘descriptive’

IV.167
Example: I (1) processes that tend to co-move
UK 10−year and 3−month Treasury bond yields (annualised)

10 year
15

3 month
10
Per cent

1960 1970 1980 1990 2000 2010

Term spead: 10−year minus 3−year yield


6
4
Per cent per annum

2
0
−2
−4

1960 1970 1980 1990 2000 2010

IV.168
Example: I (1) processes that tend to co-move
UK, 1870–2011: real wages [1985=1] and real output per worker [£1985 prices]
Sheet1

10
log (real output / worker)
log (hourly real wages) 0.75
9.5
log (real output / worker)

log (hourly real wages)


0.25

9
-0.25

8.5
-0.75

8
-1.25

7.5 -1.75
1870 1890 1910 1930 1950 1970 1990 2010

IV.169
Cointegration: mathematical illustration
• Suppose vt , wyt and wxt are mean zero and I (0), and

t t
X 1X
Yt = (µ + vs ) + wyt Xt = (µ + vs ) + wxt
s=1
θ s=1

• Yt , Xt ∼ I (1): both have stochastic trends, and

∆Yt = µ + vt + wyt − wy ,t−1 ∼ I (0)

i.e. their first differences are stationary


• However, there is a (unique) linear combination that is I (0):
" t # " t #
X 1X
Yt − θXt = (µ + vs ) + wyt − θ (µ + vs ) + wxt
s=1
θ s=1
= wyt − θwxt ∼ I (0)

• Taking Yt − θXt eliminates the common trends

IV.170
Cointegration: implications for OLS
• Does OLS regression of Yt on Xt make sense if there is cointegration?
• OLS chooses (α̂, θ̂) to solve

T
X
min (Yt − a − cXt )2
(a,c)
t=1

• Stochastic trends have a large (and growing) variance


• the OLS criterion will be minimised by a θ̂ that (approximately)
eliminates the common stochastic trend
• θ̂ does not equal θ exactly, but will be consistent for θ

ξ̂t := Yt − θ̂Xt ' Yt − θXt ∼ I (0)

p
• Conclusion: if Xt and Yt are cointegrated, θ̂ → θ
• but because Xt ∼ I (1), θ̂ has a non-standard limiting distribution
• other estimators more efficient than OLS (and asymptotically normal!):
beyond scope of this course

IV.171
Contrast with spurious regression
• What if Yt , Xt ∼ I (1) but do not have a common stochastic trend?
t
X t
X
Yt = vys + wyt Xt = vxs + wxt
s=1 s=1

• there is no linear combination that eliminates the trends, even if vys and
vxs are (imperfectly) correlated
• for every γ,
t
X
Yt − γXt = (vys − γvxs ) + (wyt − γwxt ) ∼ I (1)
s=1

• Implication?
• residuals from a spurious regression will appear I (1):

ût = Yt − β̂0 − β̂1 Xt ∼ I (1)


• whereas those from a valid cointegrating regression will be I (0)
• we can use an ADF test to distinguish cointegration from spurious
regression!
IV.172
Testing for cointegration

1. Perform ADF tests to verify that Xt , Yt ∼ I (1)


2. If we know the cointegrating coefficient θ
• perform an ADF test on ξt := Yt − θXt
• null of a unit root, ξt ∼ I (1), corresponds to a null of no cointegration
• if reject the null: conclude that ξt ∼ I (0), so Xt , Yt are cointegrated
3. If we don’t know the cointegrating coefficient θ
• estimate θ by OLS, compute ξ̂t := Yt − θ̂Xt
• perform an ADF test on ξ̂t , using adjusted (Engle–Granger) critical
values [S&W, Table 16.2]
• if reject of the null that ξ̂t ∼ I (1): conclude Xt , Yt are cointegrated

• The ADF test is a very crude way of assessing cointegration: better


methods are available (beyond scope of this course)

IV.173
Example: real wages and output per worker
• Yt = log real wages; Xt = log output per worker
1. Perform ADF tests on both series
• constant + trend on levels; constant only on differences
• with lag order selected by AIC in each case

Yt ∆Yt Xt ∆Xt

tDF −1.49 −6.39 −1.36 −7.56


c0.05 −3.41 −2.86 −3.41 −2.86
• conclusion: both Yt and Xt are I (1)
3. Regress Yt on Xt to estimate the cointegrating coefficient:

Yt = −10.6 + 1.13Xt + ξ̂t

• perform an ADF (constant only) test on the OLS residuals {ξ̂t }


• tDF = −3.81, if lag order chosen by AIC
• EG critical value is c0.05 = −3.41: reject a unit root in {ξ̂t }!

• Conclusion: real wages and output per worker are cointegrated


IV.174
Example: UK unemp. and US ind. production

• Yt = UK unemployment, Xt = log US industrial production


(1990–2016)
1. ADF tests (not reported) indicate Xt , Yt ∼ I (1)
3. Regress Yt on Xt to estimate the cointegrating coefficient:

Yt = 40.43 − 7.54Xt + ξ̂t

• perform an ADF (constant only) test on the OLS residuals {ξ̂t }


• tDF = −2.26, if lag order chosen by AIC
• EG critical value is c0.05 = −3.41: do not reject a unit root in {ξ̂t }!

• Conclusion: UK unemployment and US industrial production are not


cointegrated – this regression is spurious!

IV.175
Example: the term spread

• Yt = 10 year UK Treasury bond yield, Xt = 3 month yield (annualised)


• Expectations theory of the term structure: suggests that these should
be cointegrated with a coefficient of unity
1. ADF tests (note reported) indicate Xt , Yt ∼ I (1)
2. Since θ = 1 is known, compute the spread

ξt := Yt − θXt = Yt − Xt

• perform an ADF (constant only) test on {ξt }


• tDF = −3.86, if lag order chosen by AIC
• ADF critical values can be used here: c0.10 = −3.43: reject null of a
unit root in {ξt }
• Conclusion: 10-year and 3-month interest rates are cointegrated, with
a unit coefficient
• evidence is consistent with the expectations theory

IV.176
Cointegration: extensions

• Cointegration is not limited to pairs of variables


• it is possible that Xt , Yt , Zt ∼ I (1) have (one or two) common
stochastic trends
• then Yt − θX Xt − θZ Zt ∼ I (0)
• Handled similarly to the bivariate case:
• now regress Yt on Xt and Zt : consistent for (θX , θZ )
• perform an ADF test on ξ̂t := Yt − θ̂X Xt − θ̂Z Zt
• The I (0)/I (1) distinction is a crude dichotomy:
• no series are ever ‘exactly I (1)’, or their equilibrium errors ‘exactly I (0)’
• but these are tolerably good approximations to the behaviour of, and
relationships between, many macro time series

IV.177
Lessons learned

• If Xt , Yt ∼ I (1):
• regression of Yt on Xt consistently estimates the cointegrating
relationship, if there is one
• otherwise, estimates can spuriously indicate a significant relationship
• you have to check the order of integration before running time series
regression: often the ‘eyeball ADF test’ is sufficient
• and then check the order of integration of your residuals!
• Spurious regression only arises when regressions are ‘unbalanced’
• when the stochastic trend in Yt does not appear on the r.h.s.
• won’t arise in a regression with lagged Yt , e.g. in

Yt = β0 + β1 Yt−1 + γ2 Xt + ut

since Yt and Yt−1 trivially have a common stochastic trend


• won’t arise if Yt ∼ I (0), even if Xt ∼ I (1): OLS consistently estimates
a zero coefficient on Xt in this case

IV.178

You might also like