Professional Documents
Culture Documents
Emetnotes 1
Emetnotes 1
Emetnotes 1
Joshua Chan
Research School of Economics,
Australian National University
1 A MATLAB Primer 1
1.5 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Principles of Forecasting 21
4.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A MATLAB Primer
http://www.mathworks.com/help/techdoc/.
In addition, the command help function name gives information about the function func-
tion name. In general, whenever you need to look up information about a specific function
or topic, just press F1 for the MATLAB Help. Alternatively, select Help -> Product
Help in the toolbar in the MATLAB command window.
The most fundamental objects in MATLAB are, not surprisingly, matrices. For instance,
to create the 1 × 3 row vector a, enter into the MATLAB command window:
>> a = [1 2 3]
a =
1 2 3
To create a matrix with more than one row, use semi-colons to separate the rows. For
example,
A = [1 2 3; 4 5 6; 7 8 9];
creates the 3 × 3 matrix A. It is worth noting that MATLAB is case sensitive for variable
names and built-in functions. That means MATLAB treats a and A as different objects.
To display the i-th element in a vector x, just type x(i). For example,
>> a(2)
ans =
refers to the second element in a. Similarly, one can access a particular element in a matrix
X by specifying its row and column number (row first followed by column). For instance,
>> A(2,3)
ans =
displays the (2, 3)-entry of the matrix A. Alternatively, one can also refer to the elements
of a matrix with a single subscript, e.g., A(i). This is because MATLAB stores matrices
as a single column — composed of all of the columns from the matrix, each appended
to the previous. Thus, A(8) and A(2,3) refer to the same element. To display multiple
elements in the matrix, one can use expressions involving colons. For example,
>> A(1,1:2)
ans =
1 2
displays the first and second element in the first row, whereas
>> A(:,2)
ans =
2
5
8
To perform numerical computation, we will need some basic matrix operations. In MAT-
LAB, the following matrix operations are available:
For example,
>> a’
ans =
1
2
3
and
>> a*A
ans =
30 36 42
give respectively the transpose of a and the product of a and A. Other operations are
obvious except for the matrix divisions \ and /. If B is an invertible square matrix and
b is a compatible vector, then x = B\b is the solution of Bx = b and x = b/B is the
solution of xB = b. In other words, B\b gives the same result (in principle) as B−1 b,
though they compute their results in different ways. Specifically, the former solves the
linear system Bx = b for x by Gaussian elimination, whereas the latter first computes
the inverse B−1 , then multiplies it by b. As such, the second method is in general slower
as computing the inverse of a matrix is time-consuming (and inaccurate).
It is important to note that although addition and subtraction are element-wise operations,
the other matrix operations listed above are not — they are matrix operations. For
example, Aˆ2 gives the square of the matrix A, not a matrix whose entries are the squares
of those in A. One can make the operations ∗, \, /, and ˆ to operate element-wise by
preceding them by a full stop. For example,
>> A^2
ans =
30 36 42
66 81 96
102 126 150
>> A.^2
ans =
1 4 9
16 25 36
49 64 81
In this section we list some useful built-in functions which will be used in later chapters.
Should you want to learn more about a specific function, say, eye, just type help eye in
the command window.
>> diag(A)
ans =
1
5
9
>> diag(a)
ans =
1 0 0
0 2 0
0 0 3
If x is a vector, sum(x) returns the sum of the elements in x. For a matrix X, sum(X)
returns a row vector consisting of sums of each column, while sum(X,2) returns a column
vector of sums of each row. For example,
>> sum(A)
ans =
12 15 18
>> sum(A,2)
ans =
6
15
24
compute the sums of each column and row, respectively. For a positive definite matrix X,
chol(X) returns the Cholesky factorization C such that C′ C = X. For example,
>> B = [ 1 2 3; 0 4 5; 0 0 6];
>> chol(B’*B)
ans =
1 2 3
0 4 5
0 0 6
Exercise 1.4. Find the determinant of the matrix B by computing the product of the
diagonal elements. Check your result using det.
MATLAB has the usual control flow statements such as if-then-else, while and for,
and they are used in the same way as in other languages. For instance, the general form
of a simple if statement is
if condition
statements
end
The statements will be executed if the condition is true. Multiple branching is done by
using elseif and else. For example, the following codes simulate rolling a four-sided
dice:
u = rand;
if u <= .25
disp(’1’);
elseif u <= .5
disp(’2’);
elseif u <= .75
disp(’3’);
else
disp(’4’);
end
Exercise 1.5. Write a MATLAB program to simulate flipping a loaded coin where the
probability of getting a head is 0.8.
while condition
statements
end
The statements will be repeatedly executed while the condition remains true. To illustrate
the while loop syntax, suppose we wish to generate a standard normal random variable
left truncated at 0 (i.e., positive). We can do that using the following simple while loop:
u = randn;
while u <= 0
u = randn;
end
Exercise 1.6. Luke and Leia are playing a coin-flipping game with a supposedly fair coin.
Each time it comes up head, Luke pays Leia one dollar; if it is tail, Leia pays Luke one
dollar. At the beginning of the game Luke has ten dollars and Leia five. The game is over
when either one is broke. Write a MATLAB program to simulate one such game.
Another useful control flow statement is the for loop, where its general form is
for count
statements
end
Unlike a while loop, the for loop executes the statements for a pre-fixed number of times.
As an example, the following codes generate five draws from the truncated standard normal
distribution.
x =
Exercise 1.7. Consider the coin-flipping game between Luke and Leia in Exercise 1.6.
Simulate that game 1000 times and record the winner. What is the number of times Luke
win?
In previous sections we have introduced some built-in functions in MATLAB. For instance,
sqrt is a function that takes an argument and returns an output (its square root). Later on
we will need to create our own functions that take one or more input arguments, operate
on them, and return the results. One way to create new functions is through function
handles. For example,
creates the function f (x) = x2 + 5x − 10 with the function handle f.1 To evaluate f (10),
we can type feval(f,10) or more simply,
>> f(10)
ans =
140
Function handles can be passed to other functions as inputs. For instance, if we want to
find the minimum point of f (x) in the interval (−10, 10), we can use the build-in function
fminbnd:
xmin =
-2.5000
fmin =
-16.2500
Note that fminbnd takes three inputs (a function handle, and the two end-points of the
interval) and returns two outputs (the minimizer and the minimal value). In our example,
the minimizer of f (x) in (−10, 10) is −2.5, where its value is −16.25. To create a function
that takes more than one input is as easy. For example,
Exercise 1.8. Create a function in MATLAB that can be used to evaluate the standard
normal density
1 1 2
φ(x) = (2π)− 2 e− 2 x .
For more complex functions that involve multiple lines and intermediate variables, we need
the command function. For example, the following codes take a column vector of data
and compute its mean and standard deviation:
% function STAT
% INPUT: a column vector of data x
% OUTPUT: mean and standard deviation of x.
function [meanx, stdevx] = stat(x)
n = length(x);
meanx = sum(x)/n;
stdevx = sqrt(sum(x.^2)/n - meanx.^2);
It is important to note that all the codes must be written and saved in a separate m-file.
Also, the name of the file should coincide with the name of the function; in this case the
file must be called stat.m. After saving the file, it can be used the same way as other
built-in functions, for example:
meanx =
0.0530
stdx =
0.9902
Exercise 1.9. Write a function in MATLAB that takes two series of the same length, say,
x = (x1 , . . . , xT )′ and y = (y1 , . . . , yT )′ , and computes their sample correlation coefficient
rxy defined as
PT
(xt − x̄)(yt − ȳ)
rxy = qP t=1 PT ,
T 2 2
t=1 (xt − x̄) t=1 (yt − ȳ)
1.5 Graphics
MATLAB has several high-level graphical routines and very extensive plotting capabilities.
It allows users to create various graphical objects including two- and three-dimensional
graphs. In addition, one can also have a title on a graph, add a legend, change the font
and font size, label the axis, etc. For more information, in the command window click on
Help and next select Demos. Then choose Graphics followed by 2-D Plots.
0.8
0.6
0.4
0.2
0
y
−0.2
−0.4
−0.6
−0.8
−1
0 1 2 3 4 5 6 7
x
In MATLAB the most basic function used to create 2-D graphs is plot. For example, to
make a graph of y = sin(x) on the interval from x = 0 to x = 2π, we use the following
codes:
x = 0:.01:2*pi;
y = sin(x);
plot(x,y);
xlabel(’x’);
ylabel(’y’);
The graph produced is reported in Figure 1.1. Note that the command x = 0:.01:2*pi;
creates a vector with components ranging from 0 to 2π in steps of 0.01. Another useful
command to create a grid is linspace (use help linspace to learn more about this
function).
Exercise 1.12. Plot the standard normal density function in Exercise 1.8 from −3 to 3.
Another useful function is hist, which allows us to plot histograms. Fox example,
hist(randn(1000,1),50);
creates a histogram where the 1000 standard normal draws are put into 50 equally spaced
bins.
70
60
50
40
30
20
10
0
−4 −3 −2 −1 0 1 2 3 4
Exercise 1.13. Generate 1000 standard normal draws and use the function histc to
count the number of draws that fall into the 5 bins: (−∞, −2), [−2, −1), [−1, 1), [1, 2) and
[2, ∞).
It is often desirable to plot several graphs in the same figure window. For this purpose
we need the function subplot. The function subplot(i,j,k) takes three arguments: the
first two tell MATLAB that an i × j array of plots will be created, and the third is the
running index that indicates the k-th subplot is currently generated. Suppose we wish
to plot the functions y = sin(0.5x2 ) and y = sin(2x) in the same figure window. A little
modification of the above codes will accomplish this goal:
x = 0:.01:2*pi;
y1 = sin(x.^2/2);
y2 = sin(2*x);
subplot(1,2,1); plot(x,y1); title(’y = sin(0.5x^2)’);
subplot(1,2,2); plot(x,y2); title(’y = sin(2x)’);
2
y = sin(0.5x ) y = sin(2x)
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 2 4 6 8 0 2 4 6 8
Figure 1.3: Plots of the graphs y = sin(0.5x2 ) and y = sin(2x) from 0 to 2π.
In addition, one can also easily produce 3-D graphical objects in MATLAB. To illustrate
various useful routines, suppose we want to plot the density function of the bivariate
normal distribution (see Section 4.1) given by
1 − 1
2(1−ρ2 )
(x2 −2ρxy+y 2 )
f (x, y; ρ) = p e .
2π 1 − ρ2
As in plotting a 2-D graph, we first need to build a grid, and this can be done with the
function meshgrid. After computing the values of the function at each point on the grid,
we can plot the 3-D graph using mesh. For example, we use the following codes
rho = .6;
[x y] = meshgrid(-2:.1:2, -2:.1:2); % to build a 2D grid
z = 1/(2*pi*sqrt(1-rho^2)) * exp(-(x.^2 -2*rho*x.*y + y.^2)/(2*(1-rho^2)));
mesh(x,y,z);
to plot the the bivariate normal density function with ρ = 0.6 in Figure 1.4.
0.2
0.15
0.1
0.05
0
2
1 2
0 1
0
−1 −1
−2 −2
Figure 1.4: The density function of the bivariate normal distribution with ρ = 0.6.
contour(x,y,z);
1.5
0.5
−0.5
−1
−1.5
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Figure 1.5: A contour plot of the bivariate normal density function with ρ = 0.6.
Exercise 1.14. Plot the bivariate t density function (see Section 4.1) given by
−( ν2 +1)
1 x2 + y 2
g(x, y; ν) = 1+ .
2π ν
Create also a contour plot.
A sparse matrix is simply a matrix that contains a large proportion of zeros. As such,
computations involving sparse matrices can typically be done much faster than those
involving full matrices. In addition, as most of the elements in a sparse matrix are zeros,
storage cost of a sparse matrix is also often small. In later chapters we will deal with some
large but sparse matrices. So it would be useful to first discuss how to handle them in
MATLAB.
Perhaps the most useful function for creating sparse matrices is sparse. For example,
suppose the matrix
1 0 0 0 0
0 1 0 0 0
W= 0 0 2 0 0
0 0 0 3 1
is stored as a full matrix in MATLAB. The command sparse(W) converts W to sparse
form by squeezing out any zero elements:
>> sparse(W)
ans =
(1,1) 1
(2,2) 1
(3,3) 2
(4,4) 3
(4,5) 1
Notice that only the non-zero elements in W are stored. In general, we can create a matrix
S by the command S = sparse(i,j,s,m,n), which uses vectors i, j, and s to generate
an m × n sparse matrix such that S(i(k), j(k)) = s(k). For example, to create the matrix
W above, we first need to build a vector s that stores all the non-zero elements:
s = [1 1 2 3 1]’;
Next, we create a vector i that stores the row position for each element in s. For example,
the first element in s should be in the first row, the second element in second row, and so
on. We then do the same thing for the column positions and store them in the vector j:
i = [1 2 3 4 4]’;
j = [1 2 3 4 5]’;
Finally,
W = sparse(i,j,s,4,5);
There are several useful build-in functions for creating special sparse matrices. For exam-
ple,
I = speye(100);
creates the 100 × 100 sparse identity matrix. Of course we can accomplish the same goal
by using
I = sparse(1:100,1:100,ones(1,100));
though the latter is more clumsy. Another useful function is spdiags, the sparse version of
diag, that can be used to extract and create sparse diagonal matrices. Use help spdiags
to learn more about this function.
Exercise 1.16. Create the matrix H in Exercise 1.15 by using only the function spdiags.
As mentioned earlier, one main advantage of working with sparse rather than full matrices
is that computations involving sparse matrices are usually much quicker. For instance,
it takes about 2.7 seconds to obtain the Cholesky decomposition of the full 5000 × 5000
identity matrix:
whereas the same operation takes only 0.15 second for a sparse 5000×5000 identity matrix:
MATLAB provides various built-in optimization routines. In this section we discuss some
of them that are useful in later chapters. Note is that all the optimization routines in
MATLAB are framed in terms of minimization. In order to perform maximization, some
minor changes to the objective function are required. More precisely, suppose we want to
maximize the function f (x) and find a maximizer xmax = argmaxx f (x). Instead of the
original maximization problem, consider minimizing −f (x) and noting that
Hence, without loss of generality, we will focus on minimization routines. One basic
minimization function is fminbnd, which finds the minimum of a single-variable function
on a fixed interval. To illustrate its usage, suppose we wish to minimize the function
f (x) = sin x2 over the interval [0, 3] (see Figure 1.6).
2
sin(x )
0.5
−0.5
−1
f = @(x) sin(x.^2);
we pass f to fminbnd, which takes three inputs — the function handle, lower and upper
bounds of the interval — and gives two outputs — the minimizer and value of the function
evaluated at the minimizer:
ans =
2.1708 -1.0000
The function fminbnd can only be used to minimize univariate function in a closed in-
terval. For multivariate minimization, one very useful function is fminsearch that finds
the unconstrained minimum of a function of several variables. fminsearch takes two in-
puts, namely, the function handle and a starting value. Like fminbnd, fminsearch gives
two outputs: the minimizer and value of the function evaluated at the minimizer. As an
example, suppose we wish to maximize the bivariate normal pdf
1 − 1
2(1−ρ2 )
(x21 −2ρx1 x2 +y22 )
f (x1 , x2 ; ρ) = p e
2π 1 − ρ2
with respect to x = (x1 , x2 ) while fixing ρ. To this end, first define g(x1 , x2 ) = −f (x1 , x2 ; ρ):
rho = .6;
g = @(x) -1/(2*pi*sqrt(1-rho^2)) ...
*exp(-(x(1).^2 -2*rho*x(1).*x(2) +x(2).^2)/(2*(1-rho^2)));
Note that the variable x is a 2 × 1 vector. Then, we pass g to fminsearch with starting
values, say, [1 − 1]:
ans =
That is, the mode of f (x1 , x2 ; ρ) is x = (0, 0), and f (0, 0) = 0.1989.
http://www.mathworks.com/help/techdoc/
A good place to learn more about the major functionality in MATLAB is the MATLAB
Getting Started Guide available at
http://www.mathworks.com/help/pdf doc/matlab/getstart.pdf.
Principles of Forecasting
One simple forecasting rule is simply to use the last observed value yT as a prediction
for yT +h . This rule of thumb is trivial to implement and might be reasonable in some
situations. However, one obvious drawback is that it does not take advantage of all
available information — it ignores all the data except the last observed value yT — as
such, the point forecast ybT +h = yT is likely to be poor in most applications. A better rule
of thumb might be to construct forecasts as weighted averages of past observations, where
higher weights are given to more recent observations than distant ones. Seemingly this is
more reasonable, and includes the first forecasting rule as a special case, where the weight
for the last observation is one and all other weights are zero. In fact, this is the main
idea behind a class of popular forecasting methods called exponential smoothing. Notice
that neither forecasting rules requires the user to specify a statistical model — the data
generating process for the observations y1 , . . . , yT is not specified — and we don’t know
how the data arise or what features they have. For this reason, these forecasting rules
are sometimes described as ad hoc forecasting methods. The main advantages of these ad
hoc rules of thumb are that they are seemingly reasonable, and are typically very easy to
implement. The main drawback, however, is that without a formal model it is difficult to
analyze the statistical properties of the forecasting rules.
The other approach of forecasting is model-based — before we can make a forecast for
yT +h , we must first build a statistical model that summarizes and dynamic patterns in the
historical data, and estimate the model parameters using past data. This model-based
approach forces us to think carefully about how the data are generated, and what key
features of the data we want to model. The idea is that if we have a good model that fits
observed data well, then this model might give reasonable forecasts provided that there is
anything to be learned in the past data. The main advantage of this approach is that it
gives us a framework to model relevant features of the data important in producing good
forecasts. In addition, since we have a statistical model, we can analyze the statistical
properties of the point forecast or even go beyond a mere point forecast. The price of the
model-based approach, on the other hand, is that it requires some background in statistical
inference, and the implementation is often computationally intensive.
Before we go into the details of both approaches — ad hoc methods are discussed in
Chapter 3 and model-based forecasting is covered in Chapters 4–6 — we will discuss in
this chapter various fundamental concepts in forecasting, including forecast horizon, point
vs density forecasts, direct vs iterated forecasts, loss function, and methods for comparing
non-nested forecast models. In what follows, we will first have an overview of several
popular forecasting models. Note that throughout this chapter we assume the model
parameters are known, and the estimation problem is discussed in later chapters.
T = 50;
y1 = zeros(T,1); y2 = zeros(T,1);
rho1 = .5; rho2 = 1;
sig = 1;
for t=1:T
if t == 1
y1(t) = 1; y2(t) = 1;
else
y1(t) = rho1*y1(t-1) + sig*randn;
y2(t) = rho2*y2(t-1) + sig*randn;
end
end
4 20
3
15
2
10
1
0 5
−1
0
−2
−3 −5
0 10 20 30 40 50 0 10 20 30 40 50
Figure 2.1: Two realizations of the AR(1) processes with ρ = 0.5 (left panel) and ρ = 1
(right panel).
We use the above codes to generate two realizations for each AR(1) process, and the results
are plotted in Figure 2.1. Note that the two AR(1) processes behave quite differently. In
particular, the AR(1) process with ρ = 0.5 (left panel) tends to fluctuate around 0, whereas
the one with ρ = 1 has an apparent trend (right panel).
Exercise 2.1. Suppose X ∼ N(0, In ), where In is the n × n identity matrix. If we let
Y = µ + CX, show that Y ∼ N(µ, Σ), where Σ = CC′ .
Exercise 2.2. Write a MATLAB program to generate data from the following AR(2)
process with a drift:
yt = µ + ρ1 yt−1 + ρ2 yt−2 + εt , εt ∼ N(0, σ 2 ),
for t = 1, 2, . . . , T , where the process is initialized with y−1 = y0 = µ. Plot one realization
of the AR(2) process with T = 50, µ = 5, ρ1 = 0.5, ρ2 = 0.1 and σ 2 = 2.
Example 2.2 (MA(1) process). Suppose ε1 , ε2 , . . . are independent and identically
distributed (iid) N(0, σ 2 ) random variables and ε0 = 0. If we generate yt according to the
equation
yt = θεt−1 + εt , (2.2)
By construction, the data generated from the AR(1) and MA(1) models exhibit autocorre-
lation (and quite distinct autocorrelation patterns). In addition, the random walk process
is also able to produce data with a stochastic trend. In the following example, we describe
a class of simple models that can generate both trends and cycles.
Example 2.3 (Models with Trend and Cycle). Typical features of time-series data
include trends and cycles. In this example we study a class of simple models that can
generate both features. Suppose yt is generated via the following equation:
yt = mt + ct + εt , εt ∼ N(0, σ 2 ), (2.3)
10 30
25
5
20
15
0
10
5
−5
0
−10 −5
0 10 20 30 40 50 0 10 20 30 40 50
Figure 2.2: The trend-cycle models with mt = 0 and ct = 4 cos(π/5t) + 2 sin(π/5t) (left
panel); and mt = 0.5t and ct = 4 cos(π/5t) + 2 sin(π/5t) (right panel).
mt = a0 + a1 t,
where a0 is the level and a1 is the slope of the trend. A popular choice for ct is the
following periodic function:
ct = b1 cos(ωt) + b2 sin(ωt),
where b1 and b2 control the magnitude of the cycle component whereas ω determines the
frequency with which ct completes a cycle. We use the following codes to generate two
realizations from (2.3), and the series are plotted in Figure 2.2.
T = 50; sig = 1;
t = (1:T)’; % construct the vector [1 2 3 ...]’
a0 = 0; a1 = 0; b1 = 4; b2 = 2; omega = 2*pi/10;
y1 = a0 + a1*t + b1*cos(omega*t) + b2*sin(omega*t) + sig*randn(T,1);
For the trend-cycle model on the left, the trend component is set to be zero: mt = 0, and
the time series only exhibits cyclical behavior, whereas the time series on the right has a
linear trend mt = 0.5t. Moreover, in both examples, it takes 10 (2π/ω) periods for ct to
complete one cycle.
Exercise 2.3. Write a MATLAB program to generate data from the trend-cycle model
in Example 2.3 with a quadratic trend component
mt = a0 + a1 t + a2 t2 ,
ct = b1 cos(ωt) + b2 sin(ωt).
In addition to the time series of interest, yt , often we observe other relevant variables. For
instance, in forecasting the international visitor arrivals in Australia, we might also have
other seemingly relevant variables such as AUD/USD exchange rate, oil price (fuel costs),
GDP growth in major tourist markets, etc. By utilizing this additional information, one
might be able to produce better forecasts. One of the popular ways to incorporate these
additional variables is via the linear regression model discussed in the following example.
Example 2.4 (Linear regression model). Let xt−1 = (xt−1,1 , . . . , xt−1,k ) be a 1×k vec-
tor of relevant variables in predicting yt . The variables xt−1 are often called regressors,
exogenous variables, or explanatory variables. For instance, if yt is the unemploy-
ment rate at time t, xt−1 might contain inflation rate, GDP growth rate and unemployment
rate at time t − 1. The linear regression model specifies the linear relationship between yt
and the regressors as follows:
or equivalently,
y = Xβ + ε, ε ∼ N(0, σ 2 IT ), (2.5)
where IT is the T × T identity matrix, and
y1 x0,1 x0,2 ··· x0,k ε1
y2 x1,1 x ··· x1,k ε2
1,2
y = . , X = . . .. .. , ε = . .
.. .. .. . . ..
yT xT −1,1 xT −1,1 · · · xT −1,1 εT
By a simple change of variable, (2.5) implies that y follows a multivariate normal distri-
bution: y ∼ N(Xβ, σ 2 IT ). In particular, the expected value of y is Xβ.
Exercise 2.4. Write the trend-cycle model in Example 2.3 as a linear regression model
by defining X and β appropriately.
The forecast horizon is simply the number of periods between today and the date of the
forecast. For example, if we have data on quarterly inflation rates and we want to forecast
the inflation rate one year from now, then the forecast horizon is four (four quarters), and
we speak of a four-step-ahead forecast. Should we have monthly data, a forecast one year
from now is a 12-step-ahead forecast. In general, given a particular data frequency, an
h-step-ahead forecast makes a statement about the future observable that is h periods
ahead into the future. Consideration of the forecast horizon is important for two main
reasons. First, forecast horizon affects the forecast, and how the forecast is made. For
instance, two different approaches for producing an h-step-ahead forecast with h > 1 is
available, namely, iterated forecast and direct forecast. We will illustrate these two
approaches via the following examples.
Example 2.5 (h-step-ahead forecasts for the AR(1) process). Consider the AR(1)
process in Example 2.1:
yt = ρyt−1 + εt , εt ∼ N(0, σ 2 ), (2.6)
and suppose we wish to make a one-step-ahead forecast, denoted as ybT +1 . Given the model
parameters θ = (ρ, σ 2 )′ and the last observed data point yT , one could use the conditional
expectation of yT +1 given θ and all the information up to time T as a point forecast.
Specifically, let It denote the information set at time t that summarizes all the available
information up to time t. Then the point forecast for yT +1 is ybT +1 = E(yT +1 | IT , θ).
Later we will show that this conditional expectation of yT +1 is optimal in a well-defined
sense. In our AR(1) setting, the information set It is simply all the observed data up to
time t, i.e., It = {y0 , y1 , . . . , yt }. Moreover, using (2.6) with t = T + 1, we have
yT +1 = ρyT + εT +1 , εT +1 ∼ N(0, σ 2 ).
Now, let’s consider how we can produce a two-step-ahead forecast. Using (2.6) once again
with t = T +2, one might think that the conditional expectation E(yT +2 | IT +1 , θ) = ρyT +1
could be used as a point forecast for yT +2 . However, this approach is not feasible as it
requires the information at time t = T + 1. Particularly, we need to know yT +1 in order
to compute E(yT +2 | IT +1 , θ), but of course we only have information up to time t = T .
One feasible approach is to iterate (2.6) to get
yT +2 = ρyT +1 + εT +2 = ρ(ρyT + εT +1 ) + εT +2
= ρ2 yT + ρεT +1 + εT +2 ,
where εT +1 and εT +2 are independent N(0, σ 2 ). By taking the expectation of the above ex-
pression given the model parameters θ and the information set IT , we can use E(yT +2 | IT , θ) =
ρ2 yT as a 2-step-ahead forecast. By exactly the same reasoning, an h-step-ahead iterated
forecast is given by E(yT +h | IT , θ) = ρh yT .
For one-step-ahead forecasts, the iterated and direct forecasts coincide. But for longer
forecast horizons, the two approaches generally give distinct results. In empirical ap-
plications, it is often found that iterated forecasts perform better than direct forecasts,
especially when the forecast horizon is long. However, there are situations where com-
puting iterated forecasts are not feasible, and we will discuss one such case in the next
example.
Example 2.6 (h-step-ahead forecasts for linear regression model). For the linear
regression model in Example 2.4
yt = xt−1 β + εt , εt ∼ N(0, σ 2 ),
the iterated forecasting approach is not feasible for h > 1. This is because in predicting
yT +h , we need the variables xT +h−1 that are not in the information set IT available at
time t = T . Moreover, the regression model does not specify a recursive relation between
xt and xt−1 . Instead, we could follow the direct forecasting approach and consider the
following linear regression model:
e + εt ,
yt = xt−h β εt ∼ N(0, σ 2 ).
e = x′ β,
We can then use the conditional expectation E(yT +h | IT , θ) e where θ e ′ , σ 2 )′
e = (β
T
as an h-step-ahead forecast.
Exercise 2.5. Suppose we wish to make an iterated two-step-ahead forecast using the
AR(2) model in Exercise 2.2. Compute the conditional expectation E(yT +2 | IT , θ), where
IT is the information set at time t = T and θ = (µ, ρ1 , ρ2 , σ 2 )′ .
The second reason why forecast horizon is important is that it affects the types of models
used for making forecasts. For example, in forecasting GDP of a growing economy in
the near future (e.g. next quarter), short-term fluctuations over the business cycle are
important for making an accurate forecast. On the other hand, should we want to make
a long horizon forecast over, say, the next decade, short-term fluctuations become less
important. Instead, it is far more essential to model accurately the long-term trend, as
the next example illustrates.
yt = mt + ct + εt , εt ∼ N(0, σ 2 ),
ct = b1 cos(ωt) + b2 sin(ωt),
is the cycle component. It is easy to see that cycle component ct is bounded in t. In fact,
we have
|ct | = |b1 cos(ωt) + b2 sin(ωt)| 6 |b1 cos(ωt)| + |b2 sin(ωt)| 6 |b1 | + |b2 |.
In other words, no matter how large t becomes, the value of |ct | is at most |b1 | + |b2 |. On
the other hand, the trend component, by construction, is increasing (or decreasing). For
example, if we specify a linear trend
mt = a0 + a1 t,
then the value of |mt | is unbounded. Since |ct | is bounded while |mt | is not, this means
that |Eyt /mt | approaches 1 when t is sufficiently large. Since the trend component dom-
inates over the long run, it is paramount to model it correctly for long horizon forecasts.
To appreciate this point, suppose instead of the linear trend above, we use a different
specification for mt , say, a quadratic trend
mt = a0 + a1 t + a2 t2 .
Then we would in general have a very different forecast compared to the original linear
specification, as a quadratic trend grows (or declines) much faster than a linear one over
a long horizon.
In discussing the forecast horizon in Section 2.2, we focus on one particular type of forecast,
namely, point forecast. That is, we produce a single number as our best guess for the
quantity we are interested in predicting. To be more precise, we use the conditional
expectation E(yT +h | IT , θ) as our h-step-ahead forecast for the future observable yT +h ,
where IT is the information set at time t = T and θ is the set of model parameters.
However, in many situations a point forecast is not informative enough to aid decision
making. As a simple example, consider the problem of forecasting the GDP growth rate
next quarter. We might give a point forecast of 1% as our best guess. But how much
confidence do we place on this number? Or put differently, what is the variability of the
point forecast? Suppose the growth rate can be as low as -5% in the worst scenario and
5% in the best turn of events. Specifically, let’s say with probability 0.95 the growth rate
falls within the interval (−5%, 5%), and we call this an interval forecast. Even though
the point forecast represents our best guess, the additional information manifested in the
interval forecast is obviously valuable as it helps our decision making (see the discussion
on loss function in the next section).
Can we do better than an interval forecast? Or to ask the question in a different way:
what is the largest amount of information we can realistically obtain regarding the quantity
yT +h ? Since the future observable yT +h is a random variable, we can characterize all the
information about yT +h by a probability density function, and we call this a density
forecast. Now the question is, how can we obtain a suitable density function for yT +h ?
Suppose for the moment that the values of the model parameters θ are known (we will
relax this seemingly unrealistic assumption in later chapters). Since at time t = T all the
available information is summarized by the information set IT , the conditional density
f (yT +h | IT , θ) represents the best available information about yT +h at time t = T , and
we will use this as a density forecast for yT +h .
Example 2.8 (Interval and density forecasts for the AR(1) process). Consider
again the AR(1) process in Example 2.1:
Given the time-series history IT = {y0 , y1 , . . . , yT } and the model parameters θ = (ρ, σ 2 )′ ,
recall that we showed in Example 2.5 that the conditional expectation E(yT +1 | IT , θ) is
simply ρyT , and we used this as a point forecast for yT +1 . Of course, we have much more
information than a mere point forecast. Once again we use (2.8) with t = T + 1 to obtain
yT +1 = ρyT + εT +1 , εT +1 ∼ N(0, σ 2 ).
By a simple change of variable (see Exercise 2.1), it follows that (yT +1 | yT , θ) ∼ N(ρyT , σ 2 ).
That is, a density forecast for yT +1 is the normal density f (yT +1 | IT , θ) with mean ρyT
and variance σ 2 . From this one can easily derive an interval forecast. For instance, a 95%
interval forecast is simply: (ρyT − 1.96σ, ρyT + 1.96σ).
Exercise 2.6. Under the AR(1) model in (2.8), derive a two-step-ahead 95% interval
forecast for yT +2 .
Exercise 2.7. Instead of the normal AR(1) model, let’s consider the following Student-t
variant:
yt = ρyt−1 + εt , εt ∼ t(ν, 0, σ 2 ).
Derive a one-step-ahead point forecast and a 95% interval forecast for yT +1 .
Example 2.9 (Density forecast for the MA(1) process). Suppose we wish to obtain
a density forecast for yT +1 under the MA(1) process in Example 2.2:
yt = θεt−1 + εt , (2.9)
where ε1 , . . . , εT are iid N(0, σ 2 ) random variables and ε0 = 0. To this end, we want to
compute the conditional density f (yT +1 | IT , θ), where IT = {y1 , . . . , yT } and θ = (θ, σ 2 )′ .
We first introduce the following useful notation: for a time series z1 , z2 , . . . , zt , we let
z1:t = (z1 , z2 , . . . , zt )′ . For example, z1:T denotes the whole sample from t = 1 to t = T .
Now, note that y1:T = (y1 , . . . , yT )′ is a linear combination of normal random variables,
and therefore y1:T follows a multivariate normal distribution. More specifically, write (2.9)
in the following matrix form:
y1 1 0 0 ··· 0 ε1
y2 θ 1 0 · · · 0 ε2
y3 0 θ 1 · · · 0 ε3
= ,
.. .. . . .. ..
. . . . .
yT 0 0 ··· θ 1 εT
| {z } | {z }
Γ(θ) ε1:T
Now to compute the conditional density f (yT +1 | IT , θ), note that the information set IT
is nothing other than the history of the time series up to time t = T , i.e., y1:T . Hence,
f (yT +1 | IT , θ) is simply the conditional density of yT +1 given y1:T and θ, and it can
be obtained in two steps. First, we show that the joint density of (y1:T , yT +1 ) is normal.
Then, by using some standard results for the normal density, we can derive the conditional
density of yT +1 given y1:T . Now, write
y1:T Γ(θ) 0 ε1:T
= ,
yT +1 b(θ)′ 1 εT +1
where b(θ) is a T × 1 vector of zeros except the last element that is θ. In particular,
we have Γ(θ)b(θ) = b(θ), Γ(θ)−1 b(θ) = b(θ), and b(θ)′ b(θ) = θ 2 . Once again the joint
density for (y1:T , yT +1 ) is normal with mean vector 0 and covariance matrix
2 Γ(θ) 0 Γ(θ)′ b(θ) 2 Γ(θ)Γ(θ)
′ Γ(θ)b(θ)
σ =σ .
b(θ)′ 1 0 1 b′ Γ(θ) 1 + θ2
Since the joint density is normal, the conditional density for yT +1 given y1:T is also normal,
and we only need to determine the conditional mean and conditional variance. By standard
results for the normal density, we have
In other words, the conditional density f (yT +1 | IT , θ) is normal with mean ybT +1 and
variance σ 2 .
yt = µ + θεt−1 + εt ,
where ε1 , . . . , εT are iid N(0, σ 2 ) random variables and ε0 = 0. Derive a density forecast
for yT +1 .
Exercise 2.9. For the MA(1) model in (2.9), derive a two-step-ahead density forecast for
yT +2 .
It is important to emphasize that so far all the forecasts produced are conditioned on the
model parameters θ. For example, we use the conditional expectation E(yT +h | IT , θ) as
an h-step-ahead point forecast, where the condition set contains θ. Of course in a real
world application one never knows the values of the model parameters, and they have to
be estimated from the data. We will defer the estimation problem to later chapters, where
we will also discuss how one handles parameter uncertainty.
Forecasts are not made in a vacuum; they are made to guide decision making under
uncertainty. As such, in making a forecast, one needs to take into account the associated
loss if the forecast is wrong. The loss or economic cost associated with a forecast error
is summarized by the loss function. Consider, for example, the problem of predicting
whether or not it will rain tomorrow. There are only two possible outcomes: rainy or
sunny. If we think that it will rain tomorrow, we will bring an umbrella; otherwise we
will not. If it rains and we bring an umbrella, or when it is sunny and we do not have
an umbrella, then our loss is zero. On the other hand, if it rains and we do not have an
umbrella, we will get wet and incur a cost (disutility) of, say, 10 units. If it is sunny and
we bring an umbrella, though not as bad as getting wet, it is still a hassle and we put a
disutility of, say, 2 units on this forecast error.
Rainy Sunny
Bring an umbrella 0 2
No umbrella 10 0
Obviously, our action (bring an umbrella or not) will not only depend on the weather fore-
cast, but also on the costs of the forecast errors, i.e., the loss function. To see this, suppose
we take it as given that the probability of raining tomorrow is 0.2 (so the probability that
it will be sunny is 0.8). If we bring an umbrella, our expected cost is 0.16 (0×0.2+2×0.8),
while the expected cost of not bringing an umbrella is 0.2 (10×0.2+0×0.8). Thus if we are
minimizing our expected cost, we will bring an umbrella when we leave home tomorrow,
even though we think that it is much more likely to be sunny than rainy.
The loss function in our hypothetical example in Table 2.1 is an instance of an asymmetric
loss structure, where the losses are different under the two bad outcomes. On the other
hand, should the losses incurred be the same, we say that the loss function is symmetric.
In general, under a symmetric loss function, the error above the actual value causes the
same loss as the same magnitude of error below the actual value. Two widely popular
symmetric loss functions are the squared loss or quadratic loss L(b y − y)2 , and
y , y) = (b
the absolute loss L(b y , y) = |b
y − y|, where yb is the forecast and y the actual future value.
The main difference between two loss functions is that the quadratic loss function penalizes
large forecast errors much more heavily than the absolute loss function, as illustrated in
Figure 2.3. In addition, the absolute loss function is not differentiable at 0, whereas the
quadratic loss function is differentiable on the whole real line.
25 5
20 4
15 3
10 2
5 1
0 0
−5 0 5 −5 0 5
Figure 2.3: Two popular symmetric loss functions: quadratic loss (left panel) and absolute
loss (right panel).
In Section 2.3 we presented various forecast objects, including point and density forecasts.
As discussed there, a density forecast essentially summarizes all that can be learned about
the future observable yT +h given the current information IT . But often it is difficult to
convey the information content in the density function f (yT +h | IT , θ). A natural question
is, given the density forecast f (yT +h | IT , θ), can we produce a point forecast ybT +h that is
optimal in a well-defined sense? To address this question, we will first need to define what
it means to be optimal. Since the usual reason of producing forecasts is to guide decision
making, it seems reasonable to define an optimal point forecast as the one that minimizes
the expected loss with respect to a given loss function.
Example 2.10 (Optimal point forecast under quadratic loss). Given the density
forecast f (yT +h | IT , θ) and the quadratic loss function L(b y , y) = (b y − y)2 , what is the
point forecast that minimizes the expected loss? To find out, first note that the expected
loss can be written as
Z
E [L(b yT +h − yT +h )2 f (yT +h | IT , θ)dyT +h ,
yT +h , yT +h ) | IT , θ] = (b
where the expectation is taken with respect to the density f (yT +h | IT , θ). Now, we want
to minimize EL(b yT +h , yT +h ) with respect to ybT +h . We assume that we can differentiate
under the integral sign. (We can do that, for instance, if f (yT +h | IT , θ) is a member of
the exponential family. If you don’t know what the exponential family is, don’t worry
about it.) We set the first order condition to 0 to find the critical point:
Z
2 (byT +h − yT +h )f (yT +h | IT , θ)dyT +h = 0,
Z Z
ybT +h f (yT +h | IT , θ)dyT +h = yT +h f (yT +h | IT , θ)dyT +h ,
ybT +h = E [yT +h | IT , θ] ,
R
where we used the fact that f (yT +h | IT , θ)dyT +h = 1 as f (yT +h | IT , θ) is a density
function. In other words, the optimal point forecast under quadratic loss is the conditional
expectation E [yT +h | IT , θ].
Exercise 2.10. For a continuous density f (yT +h | IT , θ), show that the optimal point
forecast under the absolute loss is the median mT +h , i.e.,
Z mT +h
1
f (yT +h | IT , θ)dyT +h = .
−∞ 2
R∞ R ybT +h R∞
(Hint: use the Fundamental Theorem of Calculus and the fact that −∞ = −∞ + ybT +h ).
In any applications there are typically several seemingly reasonable models that we might
use to compute forecasts, and a priori it is difficult to choose among these competing
models. For example, to forecast future GDP we might consider models with a linear or
a quadric deterministic trend, or models with stochastic trends. But how do we select
which model to use to produce forecasts? This brings us the problem of model selection,
which is of great importance in the context of forecasting. Broadly speaking, there are
two approaches to choosing the best model. Under the in-sample fitting approach, we
use the full sample to estimate the model parameters, and then compute some measures
for assessing model-fit. Typically they are based on the residuals from the forecasts, and
the two popular ones are the mean absolute error (MAE):
T
1X
MAE = |yt − ybt |,
T
t=1
where ybt is the estimate for yt under the model. Then we choose the model that has the
lowest MAE or MSE. The idea is that if we have a model that fits the historical data well,
it probably will produce reasonable forecasts in the future. One main criticism of focusing
on MAE or MSE as a model selection criterion is that they are prone to the problem of
over-fitting — the selected model is excessively complex relative to the amount of data
available such that it describes the idiosyncrasies in the data rather than the underlying
relationship.
Since both MAE and MSE only depend on how well the model fits historical data, but
ignore model complexity, both criteria tend to favor complex models. In fact, we show
in Example 2.11 that if we have two nested specifications and estimate the parameters
via the ordinary least squares (OLS) method (more on this in Chapter 3), then the MSE
for the more complex model is no larger than that of the simpler one. The fact that the
more complex model is always favored — simply by the virtue of having more parameters
and being more complex— is disconcerting. This would not have been a problem if we
had infinitely amonut of data. In practice, however, we only have a fixed number of
observations. With a finite sample, if we have a more complex model, the estimates for
the model parameters become more imprecise. Any forecasts based on these inaccurately
estimated model parameters tend to be poor. This is the main reason why overfitted
models generally have poor predictive performance.
Example 2.11 (Potential over-fitting problem). Suppose we have two trend specifi-
cations, namely, the linear trend and the quadratic trend, where the point estimates for
yt under the two specifications are respectively
ybt1 = a0 + a1 t
ybt2 = b0 + b1 t + b2 t2 .
Observe that the linear trend specification is nested within the quadratic one, i.e., the
latter becomes the former if one sets b2 = 0. We want to show that the MSE under the
quadratic trend is always no larger than the MSE under the linear specification, if the
parameters are estimated via the ordinary least squares (OLS) method (see Chapter 3 for
details). To this end, define the function
T
1X
f (c0 , c1 , c2 ) = (yt − c0 − c1 t − c2 t2 )2 .
T
t=1
The parameters (a0 , a1 ) and (b0 , b1 , b2 ) are estimated vis the OLS method. That is,
T
X
(b
a0 , b
a1 ) = argmin (yt − ybt1 )2 = argmin f (a0 , a1 , 0)
a0 ,a1 a0 ,a1
t=1
XT
(bb0 , bb1 , bb2 ) = argmin (yt − ybt2 )2 = argmin f (b0 , b1 , b2 ).
b0 ,b1 ,b2 t=1 b0 ,b1 ,b2
In view of the over-fitting problem, it seems reasonable to penalize models with more
parameters — in this way, if two models fit the data more or less equally well, the simpler
model with fewer parameters would then be favored. Two popular selection criteria that
explicitly take into account of model complexity are Akaike information criterion
(AIC) and Bayesian information criterion (BIC), which are defined as follows:
AIC = MSE × T + 2k
BIC = MSE × T + k log T,
where k is the number of model parameters and T the sample size. Given a set of competing
models, the preferred model is the one with the minimum AIC/BIC value. As is obvious
from the definition, both criteria do not only reward goodness-of-fit as measured by the
mean squared error, but they also include a penalty term that is an increasing function of
the number of model parameters. Consequently, the two criteria discourage over-fitting,
and are generally better measures for selecting a forecast model than MSE or MAE. It is
also worth noting that the penalty term in BIC is generally larger than that in AIC. As
such, BIC favors more parsimonious models. Despite their difference in the penalty, AIC
and BIC often agree on the best model. If they are in conflict, it is recommended the
more parsimonious model to be selected — in our context it means using the model with
the minimum BIC.
Instead of using a particular in-sample fitting criterion to select the best model, an al-
ternative approach considers performing a pseudo out-of-sample forecasting exercise.
Since we are not interested in fitting the historical data per se — the goal is to use the
model that best fits past data to make (hopefully good) forecasts — apparently a better
criterion for choosing a good forecast model is to select the one, well, that makes good
forecasts. Of course, we would not know which model does best in forecasting ex ante.
The idea of a pseudo out-of-sample forecasting exercise is to simulate the experience of a
real-time forecaster: we use only a portion of the dataset, say, from time t = 1 to t = T0 to
estimate the model parameters and make the forecast ybT0 +h|T0 , where we emphasize the
forecast depends only on data up to time T0 . We compare the forecast with the actual
observed value yT0 +h . Then we move forward to date T0 + 1 and repeat this exercise
recursively. Finally, we compute some summary of the forecast errors in this pseudo out-
of-sample forecasting exercise, and pick the model that has the minimal forecast error.
Similar to MAE and MSE, two popular measures of forecast error are mean absolute
forecast error
TX−h
1
MAFE = |yt+h − ybt+h|t |,
T − h − T0 + 1
t=T0
and the mean squared forecast error (MSFE):
TX
−h
1
MSFE = (yt+h − ybt+h|t )2 ,
T − h − T0 + 1
t=T0
where yt+h is the actual observed value at time t + h and ybt+h|t is the h-step-ahead forecast
given the information up to time t. In contrast to MAE and MSE, which are measures
Example 2.12 (Forecasting U.S. GDP). In Figure 2.4 we plot the U.S. seasonally
adjusted real Gross Domestic Product (GDP) from 1947Q1 to 2009Q4 (in trillions of 2005
U.S. Dollars). There is a clear trend in the U.S. GDP data, and it seems to grow faster
than a linear one. In this example we compare three different trend specifications, namely,
linear, quadratic and cubic trends:
ybt1 = a0 + a1 t
ybt2 = b0 + b1 t + b2 t2
ybt3 = c0 + c1 t + c2 t2 + c3 t3 .
14000
12000
10000
8000
6000
4000
2000
0
1950 1960 1970 1980 1990 2000 2010
Figure 2.4: U.S. seasonally adjusted real GDP from 1947Q1 to 2009Q4 (in trillions of 2005
U.S. dollars).
Note that the cubic trend specification includes the quadratic one as a special case, which
in turn nests the linear trend. We defer the problem of estimation to Chapter 3; here
we only present the estimation and forecasting results. We compute the MSE, AIC and
BIC for each specification, and the three measures of in-sampling fitting are reported in
Table 2.2.
Table 2.2: MSE, AIC and BIC under each of the three trend specifications.
Consistent with the graph in Figure 2.4, the linear specification does not fit the data that
well. In fact, by simply adding a quadratic term the MSE decreases by over 90%. However,
although the cubic specification fits the data better than the quadratic one as expected (if
it was not what you expected, see the discussion in Example 2.11), the increase in overall
fitness is marginal (a further 13% reduction in MSE). Does the reduction in MSE justify
the additional complexity of the cubic specification? Both AIC and BIC suggest that it’s
not worth the extra flexibility. In fact, both criteria indicate that best specification among
the three is the quadratic one, when balancing goodness-of-fit vs model complexity. It is
also worth nothing that BIC penalizes model complexity more heavily than AIC, and the
former favors the quadratic specification more strongly than the latter.
14 14 14
12 12 12
fitted value
10 10 10
8 8 8
6 6 6
4 4 4
actual value
2 2 2
0 0 0
1960 1980 2000 1960 1980 2000 1960 1980 2000
Figure 2.5: Fitted values of U.S. GDP under the linear trend (left panel), the quadratic
trend (middle panel) and the cubic trend (right panel) specifications.
Another way to access the goodness-of-fit is to look at the fitted values ybt1 , ybt2 and ybt3 under
the three specifications. From Figure 2.5 it is obvious that the linear trend does not fit
the observed U.S. GDP that well. Specifically, it consistently underestimates the observed
values between 1960 to late 1990s, while overestimates it before and after the said period.
On the other hand, both the quadratic and cubic trends fit the data well, and the fitted
values under the two specifications are almost indistinguishable.
According to both AIC and BIC, the quadratic specification is the best among the three.
But how does it perform in the pseudo out-of-sampling forecasting exercise? To address
this equation, we compute the MSFE for each of the three specifications with h = 4
(one-year-ahead forecast) and h = 20 (five-year-ahead forecast). The recursive forecasting
exercises start from the 40-th period (T0 = 40, 10 years of data), and the results are
reported in Table 2.3. Again, the linear specification is the worst among the three, where
its MSFE is more than 10 times larger than that of the quadratic trend for the one-year-
ahead forecast and 7 times larger for the five-year-ahead forecast. And consistent with
the in-sample-fitting result, the quadratic trend does better than the cubic specification
in both the one- and five-year-ahead pseudo out-of-sample forecasting exercises. This
example shows that the model that best fits historical data — the cubic specification has
the minimum MSE — might not necessarily have the best forecasting performance.
16 15
14 actual
quadratic
12 cubic
10
10
8
5
6
2 0
1960 1970 1980 1990 2000 2010 1960 1970 1980 1990 2000 2010
Figure 2.6: Out-of-sample forecasts under the quadratic trend and the cubic trend speci-
fications with h = 4 (left panel) and h = 20 (right panel).
Lastly, we report in Figure 2.6 the pseudo out-of-sample forecasts under the quadratic
and the cubic trend specifications. Although the out-of-sample forecasting performance
for the two specifications are similar for the one-year-ahead forecasts, the five-year-ahead
forecasts under the cubic trend specification are much volatile, resulting in much larger
MSFE compared to the quadratic trend.
Christoffersen and Diebold (1997) study the optimal prediction problem under general
asymmetric loss functions and characterize the optimal predictor. A recent paper that
compares direct and iterated forecasts can be found in Marcellino, Stock, and Watson
(2006).
Ad hoc forecasting methods are a collection of seemingly reasonable rules of thumb that
are typically easy to implement and do not require the user to specify a statistical model
for the data generating process. They are sometimes described as a “black box” fore-
casting approach where the statistical properties of the forecasts are often difficult to
analyze. When statistical statements can be made about the forecasts, they typically
involve asymptotic arguments (the law of large numbers, the central limit theorem, etc.).
In this chapter we will discuss several popular ad hoc forecasting methods, including the
Ordinary Least Squares (OLS) and the exponential smoothing methods.
Let yt be the variable of interest and let xt denote a 1 × k vector of relevant variables
for determining yt . Recall that the variables xt are often called regressors, exogenous
variables, or explanatory variables. Instead of specifying a (conditional) distribution
for yt as in the linear regression model (see Example 2.4), we only specify the (conditional)
expectation E[yt | β]. More specifically, we assume that
for some k × 1 parameter vector β. Of course, the assumption in the linear regression
model, particularly the assumption that the disturbance term εt is N(0, σ 2 )-distributed
(conditioned on xt ), implies (3.1). But assuming (3.1) alone does not commit us to a
particular distributional assumption. As such, (3.1) does not uniquely specify a model
in the sense that even if we assume (3.1), we still don’t know how one can generate the
data. In addition, it is impossible to derive the conditional density f (yT +h | IT , θ) without
further distributional assumptions.
Now, assuming only (3.1), we would like to derive a sensible estimator — a procedure
to obtain a guess on β given the data. What is a good way to obtain an estimator for
β? Our assumption 3.1 says that “on average” yt equals ybt = xt β. Hence, the error
(yt − xt β) should be small on average. Thus, one seemingly reasonable criterion of finding
an estimator for β is simply to locate the β that minimizes the sum of squared errors:
T
X
b = argmin
β (yt − xt β)2 . (3.2)
β t=1
We call the estimator β b thus obtained the Ordinary Least Squares (OLS) estimator.
The next question is, can we solve (3.2) analytically? That is, can we find an explicit
formula for βb as a function of xt and yt ? Notice that (3.2) in general is a multi-dimensional
minimization problem as the variable β = (β1 , . . . , βk )′ is a k × 1 vector. One way to
proceed is to first differentiate the sum of squared errors with respect to β1 to obtain a
first order condition. Then differentiate the sum again with respect to β2 to obtain another
equation, and so forth. At the end we obtain k simultaneous equations and we solve for the
k unknowns, namely, β = (β1 , . . . , βk )′ . In principle this approach is straightforward, but
the implementation is typically time-consuming and the resulting formula for β is messy.
Instead, we rewrite the problem in matrix notations, and use matrix differentiation to
obtain the system of k simultaneous equations, which can then be solved easily. That is
why we need to digress and discuss matrix differentiation first.
That is, Jψ is a matrix of all the first-order partial derivative of the vector-valued function
ψ with respect to the vector x.
Exercise 3.1. Let A be an n × n matrix and u an n × 1 vector. Further let v = Au and
w = u′ A. Show that the k-th elements of v and w are respectively
n
X n
X
vk = akj uj , wk = ajk uj .
j=1 j=1
To find the (i, j)-th element of the m × n matrix Jacobian matrix Jψ , we differentiate the
i-th element of ψ with respect to xj :
n
∂ψi (x) ∂ X
= aik xk = aij .
∂xj ∂xj
k=1
In other words, the (i, j)-th element of Jψ is aij , the (i, j)-th element of A. Therefore,
the claim follows.
Exercise 3.3. Let z be a 1 × n vector and let ϕ(z) denote a 1 × m vector-valued function.
The Jacobian matrix of the transformation ϕ is the n × m matrix defined as:
∂ϕ1 (z) ∂ϕm (x)
∂z1 · · · ∂z1
∂ϕ(z) .. .. ..
Jϕ (z) ≡ ≡ . . . .
∂z ∂ϕ1 (z)
∂xn · · · ∂ψ∂x
m (z)
n
First observe that the quadratic form ψ(x) = x′ Ax is a scalar, and therefore the Jacobian
Jψ is a 1 × n vector. By Exercise 3.2, we have
n X
X n
ψ(x) = aij xi xj .
i=1 j=1
By Exercise 3.1, the first term on the right-hand-side is equal to the k-element of Ax, or
equivalently the k-th element of x′ A′ , whereas the second term equals the k-th element of
x′ A.
With this digression on matrix multiplication and differentiation, we now have all the
results needed to derive the formula for the OLS estimator.
To compute the OLS estimator, we first write the sum of squared errors in matrix nota-
tions. To this end, stack the observations over t and y and X as in Example 2.4. Then (3.6)
can be rewritten as
b = argmin (y − Xβ)′ (y − Xβ).
β
β
Now, differentiate the sum of squared errors with respect to β. By setting the k first-order
derivatives to zero, we obtain the following system of k linear equations:
∂ ∂
(y − Xβ)′ (y − Xβ) = (y′ y − 2y′ Xβ + β ′ X′ Xβ)
∂β ∂β
= −2y′ X + 2β ′ X′ X = 0. (3.5)
We further assume that the system of linear equations are linear independent, i.e., none
of the equations can be derived algebraically from the others. In particular, we assume
that the matrix X has full column-rank — the k columns of X are all linear independent.
It then follows from linear algebra that β can be solved uniquely. In fact, the solution
to (3.5) is
b = (X′ X)−1 X′ y
β (3.6)
To check that βb is indeed the minimum, we compute the second-order derivative of the
sum of squared errors. Note that the first-order derivative in (3.5) is a 1 × k row vector,
and we differentiate it with respect to the row vector β ′ . By Exercise 3.3, we have
∂2 ′ ∂
′ (y − Xβ) (y − Xβ) = (−2y′ X + 2β ′ X′ X) = 2X′ X.
∂β∂β ∂β ′
Since the X′ X is positive definite, the critical point in (3.6) is in fact the minimum.
We now return to Example 2.12 where we compared three different trend specifications
for forecasting U.S. GDP. There we did not provide the computation details of the OLS
estimates and various in-sampling fitting and out-of-sample forecasting measures.
Example 3.3 (Forecasting U.S. GDP: computation details). In Example 2.12 we
computed the MSE, AIC and BIC for each of the three trend specifications, namely, the
linear, quadratic and cubic trends. We first consider the problem of computing the OLS
estimate using the full sample; later we’ll discuss the details of the recursive forecasting
exercise. Now, by defining the matrix X appropriately, the OLS estimates for each of
the specification can be easily obtained via (3.6). For instance, in the quadratic trend
specification, we have an intercept, a time index t and its square, i.e., xt = (1, t, t2 ) and
hence,
1 1 1
1 2 4
X = 1 3 9 .
.. .. ..
. . .
1 T T2
Given the OLS estimate β, b it follows from our assumption (3.1) that the fitted value is
b The following MATLAB codes compute the in-sampling fitting measures
simply ybt = xt β.
for the quadratic trend specification.
load USGDP.csv;
y = USGDP;
T = size(y,1);
t = (1:T)’; % construct a column vector (1, 2, 3, ... T)’
To compute the MSFE in the recursive forecasting exercise, we first need to obtain ybt+h|t ,
the h-step-ahead forecast given the data up to time t for each specification. But in our
b where the OLS estimate β
case it is ybt+h|t = xt+h β, b is computed using only data up to
time t. The following MATLAB codes compute the MSFEs for each of the three trend
specifications.
T0 = 40;
h = 4; % h-step-ahead forecast
syhat = zeros(T-h-T0+1,3);
ytph = y(T0+h:end); % observed y_{t+h}
for t = T0:T-h
yt = y(1:t);
s = (1:t)’;
nt = length(s); % sample size
%% linear trend
Xt = [ones(nt,1) s];
beta1 = (Xt’*Xt)\(Xt’*yt);
yhat1 = [1 t+h]*beta1;
%% quadratic trend
Xt = [ones(nt,1) s s.^2];
beta2 = (Xt’*Xt)\(Xt’*yt);
yhat2 = [1 t+h (t+h)^2]*beta2;
%% cubic trend
Xt = [ones(nt,1) s s.^2 s.^3];
beta3 = (Xt’*Xt)\(Xt’*yt);
yhat3 = [1 t+h (t+h)^2 (t+h)^3]*beta3;
In this section we briefly discuss various properties of the OLS estimator for β in the linear
regression setting:
yt = xt β + εt , (3.7)
where the error terms εt ’s are independent and identically distributed (iid). A more
thorough treatment of the statistical properties of the OLS estimator can be found in any
standard econometrics textbook. Recall that we assume the conditional expectation of
the error term εt given β and the regressors xt to be zero. That is,
E[εt | xt , β] = 0. (3.8)
In addition, we further assume that
Var[εt | xt , β] = σ 2 . (3.9)
For later reference, we stack the linear regression in (3.7) over t = 1, . . . , T and write
y = Xβ + ε.
We know that, by definition, the OLS estimator
b = (X′ X)−1 X′ y
β
minimizes the sum of squared errors of the regression. What else can we say about the
b We’ll first show that the OLS estimator is unbiased, i.e.,
estimator β?
b | X, β] = β.
E[β
In words, it means that the OLS estimator “on average” equals the unknown parameter
vector β, and for obvious reasons it is a desirable property. To show its unbiasedness, first
note that by (3.8) and the fact that the expectation operator is linear, we have
E[y | x, β] = Xβ.
Hence,
b | X, β] = E (X′ X)−1 X′ y | X, β = (X′ X)−1 X′ E[y | X, β]
E[β
= (X′ X)−1 X′ Xβ = β.
With assumption (3.9) (on the variance of εt ), we can also derive the covariance matrix
b Specifically, (3.9) implies that the covariance matrix of ε is Σε = σ 2 IT . Therefore,
for β.
the covariance matrix of β, denoted as Σβb , is given by
′
Σβb = (X′ X)−1 X′ Σε (X′ X)−1 X′
= (X′ X)−1 X′ σ 2 IT X(X′ X)−1
= σ 2 (X′ X)−1 X′ X(X′ X)−1
= σ 2 (X′ X)−1 .
So far we have shown that the OLS estimator β b is unbiased and its covariance matrix is
b is also the optimal linear unbiased estimator for
Σβb = σ 2 (X′ X)−1 . It turns out that β
β in the sense that it has the minimum variance among all linear unbiased estimators.
Indeed, this fact has a name, and it’s called the Gauss-Markov Theorem.
Theorem 3.1 (Gauss-Markov). In the linear regression (3.7) with assumptions (3.8)
and (3.9), the OLS estimator β has the minimum variance in the class of linear unbiased
estimators for the parameter vector β.
Again, more details about the proof of the Gauss-Markov Theorem can be found in any
standard econometrics textbook.
In Examples 2.12 and 3.3 we discussed modeling data that exhibit trending behavior. We
now turn our attention to study seasonality. A seasonal pattern is a regular within-a-year
pattern that repeats itself every year. Seasonality arises from, among others, preference,
technology and social institutions. Seasonal weather pattern is perhaps the most obvious
example. Any consumption or production that involves the weather, such as energy (heat-
ing), agricultural products, and cough syrups, tend to be also seasonal. Another major
factor of seasonal variation is the calendar. For example, retail sales, at least in Western
countries, tend to peak for the Christmas season, whereas Chinese usually do their holiday
shopping just before the Chinese New Year.
One of the easiest ways to model seasonal variation is to introduce seasonal dummies.
For example, suppose yt is observed quarterly and does not show trending behavior. One
might specify yt as fluctuating around the level µ:
yt = µ + εt , (3.10)
where the expected value of the error term εt is zero but otherwise unrestricted. Note
that under the specification (3.10), the conditional expectation of yt is µ, no matter which
quarter we are in. Put differently, the simple specification in (3.10) does not allow for
seasonal variation. Now, let’s say a closer look at the historical data reveals that yt
exhibits seasonal patterns. Particularly, the peaks seem to occur regularly at the fourth
quarter (Christmas season). Thus, other things equal, we expect a larger yt if it is the
fourth quarter. To allow for this seasonal variation, we consider
yt = µ + αDt + εt ,
where Dt is a dummy variable that equals 1 if it is the fourth quarter, and 0 otherwise.
What the dummy variable does is to allow the conditional expectations of yt to differ
between the fourth quarter and other quarters. More explicitly,
E[yt | µ, α] = µ if Dt = 0,
E[yt | µ, α = µ + α if Dt = 1.
By the same logic, one can include more dummy variables to allow for more complex sea-
sonal pattern. However, caution must be taken not to include too many dummy variables
due to the problem of perfect multicollinearity, which occurs where there is an exact
linear relation among the regressors. Perfect multicollinearity is a problem because if we
have an exact linear relation among the regressors — i.e., one variable can be written as
a linear combination of the others — then the matrix X will not have full column rank.
Consequently, X′ X is not invertible and the OLS estimate can’t be computed.
Returning to the previous example, let Dit be a dummy variable for the i-th quarter,
i.e., Dit = 1 if time t is the i-th quarter and Dit = 0 otherwise. Consider the following
Then the regressors are xt = (1, D1t , D2t , D3t , D4t ), and we have (why?)
an exact linear relation among the regressors. One easy fix to this problem of perfect
multicollinearity is to remove the intercept, or one of the dummy variables. For instance,
if we consider
yt = α1 D1t + α2 D2t + α3 D3t + +α4 D4t + εt ,
then the regressors xt = (D1t , D2t , D3t , D4t ) are linear independent. In addition, we have
E[yt | θ] = αi if Dt = i, where θ = (µ, α1 , α2 , α3 , α4 )′ .
Example 3.4 (Forecasting U.S. retail e-commerce sales). We plot the quarterly
U.S. retail e-commerce sales from 1999Q4 to 2011Q1 (not seasonally adjusted; in billions
U.S. dollars) in Figure 3.1. Two prominent features of the series are trend and seasonality.
Despite the downturn of retail sales around the global financial crisis in late 2008, the time
series overall exhibits a linear upward trend. In addition, there is a clear seasonal pattern
— the sales figures jump up in the fourth quarter, presumably due to Cyber Monday and
Christmas sales.
55
50
45
40
35
30
25
20
15
10
5
2000 2002 2004 2006 2008 2010
Figure 3.1: U.S. retail e-commerce sales from 1999Q4 to 2011Q1 (not seasonally adjusted;
in billions U.S. dollars).
In this example we focus on modeling seasonality and thus assume a linear trend for all
specifications; it is straightforward to consider quadratic or other deterministic trends.
Let Dit be a dummy variable for the i-th quarter, i.e., Dit = 1 if time t is the i-th quarter
and Dit = 0 otherwise. Now, consider the following three different specifications:
S1 : ybt1 = a0 + a1 t + α4 D4t
S2 : ybt2 = a0 + a1 t + α1 D1t + α4 D4t
S3 : ybt3 = a1 t + α1 D1t + α2 D2t + α3 D3t + α4 D4t
In the first specification, we include only a dummy variable for the fourth quarter. Implic-
itly, we assume that the sales in the first three quarters are the same (after controlling for
the linear trend), while allowing only the sales in the fourth quarter to be different. In the
second specification, we include dummy variables for both the first and fourth quarters.
As such, we further allow the sales in the first quarter to be different. In the last speci-
fication, the sales in each quarter can potentially be different from the rest. The dataset
USecomm.csv has two columns, where the first column contains the retail e-commerce sales
values, and the second is a quarter indicator that takes value from 1 to 4. The follow-
ing MATLAB codes compute the in-sampling fitting measures for the third seasonality
specification.
load ’USecomm.csv’;
y = USecomm(:,1); Q = USecomm(:,2);
T = length(y); t = (1:T)’;
%% construct 4 dummy variables
D1 = (Q == 1); D2 = (Q == 2); D3 = (Q == 3); D4 = (Q == 4);
%% 3rd spec: linear trend + all dummies
X = [t D1 D2 D3 D4]; % drop the intercept to avoid multicollinearity
beta = (X’*X)\(X’*y);
yhat = X*beta;
MSE = mean((y-yhat).^2);
AIC = T*MSE3 + 5*2;
BIC = T*MSE3 + 5*log(T);
Table 3.1: MSE, AIC and BIC under each of the three seasonality specifications.
S1 S2 S3
MSE 5.408 5.352 5.346
AIC 254.79 254.20 255.90
BIC 260.28 261.51 265.05
Number of parameters 3 4 5
We report in Table 3.1 the MSE, AIC and BIC under each of the three specifications. Once
again, since the three specifications are nested, the MSEs are decreasing in the complexity
of the specification. However, allowing for more complex seasonal variation by including
additional dummy variables beyond the one for fourth quarter improves little the overall
goodness-of-fit. In fact, the AIC criterion (marginally) prefers the second specification ybt2 ,
whereas the BIC criterion favors the first. We plot the fitted lines under the first (S1 )
and second (S2 ) specifications in Figure 3.2, along with the actual values of the retail e-
commerce sales. Note that both fitted lines look identical. It is also interesting to observe
that the seasonal pattern is more pronounced after about 2006 — before 2006 the fitted
lines always overestimate the increase in the sales in the fourth quarter, whereas after 2006
the increases are often underestimated. To fit the data better, it seems that we would
need specifications that allow for time-varying parameters.
60
actual
S
50 1
S2
40
30
20
10
0
2000 2002 2004 2006 2008 2010
Figure 3.2: U.S. retail e-commerce sales from 1999Q4 to 2011Q1 (not seasonally adjusted;
in billions U.S. dollars).
S1 S2 S3
h=1 11.04 11.14 11.30
h=2 12.47 12.49 12.53
55 55
50 actual 50
S
1
45 45
S
2
40 40
35 35
30 30
25 25
20 20
15 15
2004 2006 2008 2010 2012 2005 2006 2007 2008 2009 2010 2011
Figure 3.3: Out-of-sample forecasts under specifications S1 and S2 with h = 1 (left panel)
and h = 2 (right panel).
T0 = 20;
h = 2; % h-step-ahead forecast
syhat = zeros(T-h-T0+1,3);
ytph = y(T0+h:end); % observed y_{t+h}
for t = T0:T-h
yt = y(1:t);
s = (1:t)’;
nt = length(s); % sample size
D1t = D1(1:t); D2t = D2(1:t); D3t = D3(1:t); D4t = D4(1:t);
%% S1
Xt = [ones(nt,1) s D4t];
beta1 = (Xt’*Xt)\(Xt’*yt);
yhat1 = [1 t+h D4(t+h)]*beta1;
%% S2
Xt = [ones(nt,1) s D1t D4t];
beta2 = (Xt’*Xt)\(Xt’*yt);
yhat2 = [1 t+h D1(t+h) D4(t+h)]*beta2;
%% S3
Xt = [s D1t D2t D3t D4t];
beta3 = (Xt’*Xt)\(Xt’*yt);
yhat3 = [t+h D1(t+h) D2(t+h) D3(t+h) D4(t+h)]*beta3;
%% store the forecasts
syhat(t-T0+1,:) = [yhat1 yhat2 yhat3];
end
MSFE1 = mean((ytph-syhat(:,1)).^2);
MSFE2 = mean((ytph-syhat(:,2)).^2);
MSFE3 = mean((ytph-syhat(:,3)).^2);
It is also worth noting that the absolute forecast performance of all the specifications are
quite bad — the root mean squared error (square root of MSE) is about 3.3 billions U.S.
dollars, about 10% of the average retail e-commerce sales in the sample. A closer look at
Figure 3.3 also reveals that the forecasts consistently underestimate the actual values in
the first half of the sample, and overestimate the sales just after the global financial crisis.
The latter point might be expected as all three specifications considered can only allow
for a trend and seasonal variation, but not cyclical behavior (business cycle).
The basic idea of smoothing methods is to construct forecasts as weighted average of past
observations, where more weights are given to recent observations than distant ones. By
computing weighted averages as forecasts, we are “smoothing” the observed time series.
The “exponential” part of the name comes from the fact that these methods do not only
have weights that decrease with time, but they do so exponentially.
where εt ’s are iid N(0, σε2 ), and are independent of ηt at all leads and lags, i.e., εt and ηs
are independent for every s and t. If we observed µ1 , . . . , µT , we could compute
E(yT +h | µ1 , . . . , µT ) = E(µT +h | µ1 , . . . , µT ) = µT
as our h-step-ahead point forecast for yT +h . But of course we do not observe the current
level µT , and therefore we could not use it as our point forecast. Actually, the local level
model (3.11)–(3.12) is an example of a state-space model, and estimation of such models
is quite involved. We will discuss likelihood-based estimation in Chapter 6; for now we
present an easy smoothing method to obtain reasonable forecasts.
Let’s say we start with a reasonable forecast for yt given historical information available
at time t − 1. We call this forecast “the level” Lt−1 , i.e.,
The reason for doing this is that we want to adjust the current level Lt to correct for
the forecast error (yt − Lt−1 ) — if we underestimate yt , i.e., yt > Lt−1 , we make the
next forecast Lt larger; if we overestimate yt , we make Lt smaller — and the size of the
adjustment depends on α. By simple algebra, we can write (3.13) equivalently as
Lt = αyt + (1 − α)Lt−1 .
Hence, the new level Lt can be viewed as a weighted average of the old level Lt−1 and the
new observation yt . The parameter α is called the smoothing parameter. If α is large,
we give more weight to the new information. In the extreme case of α = 1, we do not
learn from history at all and just use the most recent observation yt as our forecast. If α
is small, we give more weight to the historical information, and in the extreme case where
α = 0, we do not learn from new information at all, and our forecast never changes. The
updating formula for this simple exponential smoothing method is quite straightforward.
The only thing we need is a starting point or initial value L1 . One popular choice is simply
L1 = y1 , or one could use the average of the first few observations.
Lt = αyt + (1 − α)Lt−1 .
The “smoothing” part of the method is now obvious. Let’s discuss the “exponential” part.
To this end, suppose 0 < α < 1 and let’s write the first few levels in terms of observations
y1 , y2 , . . . as follows:
L1 = y 1 ,
L2 = αy2 + (1 − α)L1 = αy2 + (1 − α)y1 ,
L3 = αy3 + (1 − α)L2 = αy3 + α(1 − α)y2 + (1 − α)2 y1 .
In fact, one can show by mathematical induction that
Lt = αyt + α(1 − α)yt−1 + α(1 − α)2 yt−2 + · · · + (1 − α)t−1 y1 . (3.14)
In other words, the simple exponential smoothing in Algorithm 3.2 takes a weighted aver-
age of all previous observations to forecast the next observation, where more weights are
given to more recent observations. In addition, the weights decline exponentially as we
go further back in history. In fact, the larger the α, the smaller the weights of distant
observations.
Exercise 3.5. Prove that
Lt = αyt + α(1 − α)yt−1 + α(1 − α)2 yt−2 + · · · + (1 − α)t−1 y1 .
To make a one-step-ahead forecast for yT +1 , we can simply use LT . How about forecasting
yT +2 ? To forecast yT +2 with the exponential smoothing method, we could replace yT +1
with the forecast ybT +1 | T in the updating formula:
ybT +2 | T = αb
yT +1 | T + (1 − α)LT = αLT + (1 − α)LT = LT .
More generally, the h-step-ahead forecast under the simple exponential smoothing is LT ,
i.e., ybT +h | T = LT .
The local level model (3.11)–(3.12) only wanders randomly up and down, and does not
have a deterministic trend. Consequently, the simple additive smoothing in Algorithm 3.2
is suitable for data that do not seem to have a deterministic trend. Now imagine that we
have not only a slowly evolving local level, but also a trend with a slowly evolving local
slope:
yt = µt + λt t + εt , εt ∼ N(0, σε2 )
µt = µt−1 + ηt , ηt ∼ N(0, ση2 )
λt = λt−1 + νt , νt ∼ N(0, σν2 ),
where the error terms are independent of each other at all leads and lags. To forecast yt ,
the Holt-Winters smoothing method suggests a two-component approach. Specifically,
the forecast ybt | t−1 consists of two components, a “level” Lt−1 as before, and a “slope” bt−1 :
ybt | t−1 = Lt−1 + bt−1 .
The updating formula for Lt is the same as before, only that now the forecast is ybt−1 | t−2 =
Lt−1 + bt−1 . The updating formula for bt is again a weighted average of the new and old
information: current change in level (Lt −Lt−1 ) and previous slope bt−1 . Lastly, to initialize
the smoother we set L1 = y1 and b1 = y2 − y1 .
ybT +1 | T = LT + bT .
LT +1 | T = α(LT + bT ) + (1 − α)(LT + bT ) = LT + bT
bT +1 | T = β(LT +1 | T − LT ) + (1 − β)bT = bT .
ybT +2 | T = LT +1 | T + bT +1 | T = LT + 2bT .
ybT +h | T = LT + hbT ,
for h > 0.
T0 = 40;
h = 4; % h-step-ahead forecast
syhat = zeros(T-h-T0+1,1);
ytph = y(T0+h:end); % observed y_{t+h}
alpha = .5; beta = .5; % set values for smoothing parameters
Lt = y(1); bt = y(2) - y(1);
for t = 2:T-h
newLt = alpha*y(t) + (1-alpha)*(Lt+bt);
newbt = beta*(newLt-Lt) + (1-beta)*bt;
yhat = newLt + h*newbt;
Lt = newLt; bt = newbt; % update Lt and bt
if t>= T0 % store the forecasts for t >= T0
syhat(t-T0+1,:) = yhat;
end
end
MSFE = mean((ytph-syhat).^2);
We compute the MSFE for each set of smoothing parameters with h = 4 (one-year-
ahead forecast) and h = 20 (five-year-ahead forecast). In order to compare the results
in Example 2.12, we also start the recursive forecasting exercises from the 40-th period
(T0 = 40). We report the MSFEs for the Holt-Winters smoothing methods in Table 3.3,
along with the MSFEs under the quadratic trend specification as a comparison. It is
interesting to note that for short-horizon forecasts (h = 4), the performance of the Holt-
Winters smoothing method is much better, having only half of the forecast errors under
the quadratic trend specification. However, its performance is much worse for long-horizon
forecasts (h = 20).
To accommodate data with seasonality, we simply add an extra component to the point
forecast ybt | t−1 :
ybt | t−1 = Lt−1 + bt−1 + St−s ,
where s is the periodicity of seasonality, i.e., s = 4 for quarterly data and s = 12 for
monthly data. The level, slope and seasonality are then updated as follows:
1
Ls = (y1 + · · · + ys ),
s
bs = 0
yi
Si = for i = 1, . . . , s.
Ls
ybT +1 | T = LT + bT + ST +1−s .
Example 3.6 (Forecasting U.S. retail e-commerce sales with Holt-Winters smooth-
ing). In Example 3.4 we compared the forecast performance of three seasonality speci-
fications for forecasting U.S. retail e-commerce sales. We found that although the three
specifications gave very similar forecasts, their absolute forecast performance was quite
bad. In this example we revisit the pseudo out-of-sample forecasting exercise, and see if
the Holt-Winters smoothing method with seasonality gives better forecasts.
S1 α = β = γ = 0.5 α = β = γ = 0.8
h=1 11.04 5.55 10.38
h=2 12.47 8.64 21.53
55 55
50 50
45 45
40 40
35 35
30 30
25 actual 25
α = β = γ = 0.5
20 α = β = γ = 0.8 20
15 15
2004 2006 2008 2010 2012 2004 2006 2008 2010 2012
Figure 3.4: Holt-Winters smoothing method forecasts with h = 1 (left panel) and h = 2
(right panel).
In Figure 3.4 we also plot the out-of-sample forecasts under the Holt-Winters smoothing
method. Comparing with the forecasts in Figure 3.3, the Holt-Winters smoothing method
with α = β = γ = 0.5 seems to better trace the actual sales values than the seasonality
specifications.
The following MATLAB codes implement Algorithm 3.4 to compute the forecasts:
T0 = 20;
h = 1; % h-step-ahead forecast
s = 4;
syhat = zeros(T-h-T0+1,1);
J. H. Stock and M. W. Watson. Why has U.S. inflation become harder to forecast? Journal
of Money, Credit and Banking, 39:3–33, 2007.