Emetnotes 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

EMET3007/EMET8012

Business and Economic Forecasting

Joshua Chan
Research School of Economics,
Australian National University

August 12, 2012


Contents

1 A MATLAB Primer 1

1.1 Matrices and Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Some Useful Built-in Functions . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Flow of Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Function handles and function files . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.6 Handling Sparse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.7 Optimization Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.8 Further Readings and References . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Principles of Forecasting 21

2.1 Data Generating Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Forecast Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Point, Interval and Density Forecast . . . . . . . . . . . . . . . . . . . . . . 29

2.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5 Evaluation of Forecast Models . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.6 Further Readings and References . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Ad Hoc Forecasting Methods 41

Copyright c 2011 J.C.C. Chan


3.1 The Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.1.1 Digression: Matrix Differentiation . . . . . . . . . . . . . . . . . . . 42

3.1.2 OLS Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.1.3 Properties of the OLS Estimator . . . . . . . . . . . . . . . . . . . . 46

3.2 Application: Modeling Seasonality . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.1 Simple Additive Smoothing . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.2 Holt-Winters Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3.3 Holt-Winters Smoothing with Seasonality . . . . . . . . . . . . . . . 57

3.4 Further Readings and References . . . . . . . . . . . . . . . . . . . . . . . . 60

4 Elements of Mathematical Statistics 61

4.1 Common Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 61

4.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3 The Maximum Likelihood Estimator and Its Properties . . . . . . . . . . . 66

4.4 Maximum Likelihood Computation . . . . . . . . . . . . . . . . . . . . . . . 67

4.4.1 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4.2 Maximization Based on Functional Evaluations . . . . . . . . . . . . 68

4.4.3 Concentrating the Likelihood Function . . . . . . . . . . . . . . . . . 70

4.4.4 Newton-Raphson Method . . . . . . . . . . . . . . . . . . . . . . . . 72

4.5 Further Readings and References . . . . . . . . . . . . . . . . . . . . . . . . 75

5 Autoregressive Moving Average Models 77

5.1 Covariance Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Wold Representation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3 Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Copyright c 2011 J.C.C. Chan


5.3.1 Moving Average Models . . . . . . . . . . . . . . . . . . . . . . . . . 88

6 State Space Models 95

6.1 Unobserved Component Model . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.4 Further Readings and References . . . . . . . . . . . . . . . . . . . . . . . . 103

Copyright c 2011 J.C.C. Chan


Chapter 1

A MATLAB Primer

MATLAB, a portmanteau of MATrix LABoratory, is an interactive matrix-based program


for numerical computations. It is a very easy to use high-level language that requires
minimal programming skills. The purpose of this chapter is to introduce you to some
basic MATLAB functions that are used in later chapters. For more detailed information
and full documentation, please visit the official documentation site

http://www.mathworks.com/help/techdoc/.

In addition, the command help function name gives information about the function func-
tion name. In general, whenever you need to look up information about a specific function
or topic, just press F1 for the MATLAB Help. Alternatively, select Help -> Product
Help in the toolbar in the MATLAB command window.

1.1 Matrices and Matrix Operations

The most fundamental objects in MATLAB are, not surprisingly, matrices. For instance,
to create the 1 × 3 row vector a, enter into the MATLAB command window:

>> a = [1 2 3]

MATLAB should return

a =
1 2 3

Copyright c 2011 J.C.C. Chan


2 A MATLAB Primer

To create a matrix with more than one row, use semi-colons to separate the rows. For
example,

A = [1 2 3; 4 5 6; 7 8 9];

creates the 3 × 3 matrix A. It is worth noting that MATLAB is case sensitive for variable
names and built-in functions. That means MATLAB treats a and A as different objects.
To display the i-th element in a vector x, just type x(i). For example,

>> a(2)

ans =

refers to the second element in a. Similarly, one can access a particular element in a matrix
X by specifying its row and column number (row first followed by column). For instance,

>> A(2,3)

ans =

displays the (2, 3)-entry of the matrix A. Alternatively, one can also refer to the elements
of a matrix with a single subscript, e.g., A(i). This is because MATLAB stores matrices
as a single column — composed of all of the columns from the matrix, each appended
to the previous. Thus, A(8) and A(2,3) refer to the same element. To display multiple
elements in the matrix, one can use expressions involving colons. For example,

>> A(1,1:2)

ans =

1 2

displays the first and second element in the first row, whereas

>> A(:,2)

ans =

Copyright c 2011 J.C.C. Chan


1.1 Matrices and Matrix Operations 3

2
5
8

displays all the elements in the second column.


Exercise 1.1. Investigate the colon operator by typing help colon. What do you get if
you type A(1:2:end)? Do you understand why?

To perform numerical computation, we will need some basic matrix operations. In MAT-
LAB, the following matrix operations are available:

+ addition / right division


− subtraction ˆ power
∗ multiplication ′ transpose
\ left division

For example,

>> a’

ans =

1
2
3

and

>> a*A

ans =

30 36 42

give respectively the transpose of a and the product of a and A. Other operations are
obvious except for the matrix divisions \ and /. If B is an invertible square matrix and
b is a compatible vector, then x = B\b is the solution of Bx = b and x = b/B is the
solution of xB = b. In other words, B\b gives the same result (in principle) as B−1 b,
though they compute their results in different ways. Specifically, the former solves the
linear system Bx = b for x by Gaussian elimination, whereas the latter first computes

Copyright c 2011 J.C.C. Chan


4 A MATLAB Primer

the inverse B−1 , then multiplies it by b. As such, the second method is in general slower
as computing the inverse of a matrix is time-consuming (and inaccurate).

It is important to note that although addition and subtraction are element-wise operations,
the other matrix operations listed above are not — they are matrix operations. For
example, Aˆ2 gives the square of the matrix A, not a matrix whose entries are the squares
of those in A. One can make the operations ∗, \, /, and ˆ to operate element-wise by
preceding them by a full stop. For example,

>> A^2

ans =

30 36 42
66 81 96
102 126 150

gives the square of the matrix A. On the other hand

>> A.^2

ans =

1 4 9
16 25 36
49 64 81

computes the squares element-wise.


Exercise 1.2. Solve the following linear system for x:
   
1 0 2 7
3 1 0 x =  5  .
0 2 3 13

1.2 Some Useful Built-in Functions

In this section we list some useful built-in functions which will be used in later chapters.
Should you want to learn more about a specific function, say, eye, just type help eye in
the command window.

Matrix building functions:

Copyright c 2011 J.C.C. Chan


1.2 Some Useful Built-in Functions 5

eye create an identity matrix


zeros create a matrix of zeros
ones create a matrix of ones
diag create a diagonal matrix or extract the diagonal from a matrix
rand uniformly distributed pseudorandom numbers in (0,1)
randn standard-normally distributed pseudorandom numbers.

For example, eye(n) creates an n × n identity matrix, and ones(n,m) produces an n × m


matrix of ones. Given the 3 × 3 matrix A,

>> diag(A)

ans =

1
5
9

extracts the diagonal of the matrix A. But for the 3 × 1 vector a,

>> diag(a)

ans =

1 0 0
0 2 0
0 0 3

builds a diagonal matrix whose main diagonal is a.


Exercise 1.3. Investigate the function diag. Use only diag to construct the matrix
 
1 4 0
4 2 5 .
0 5 3

Vector and matrix functions:

exp exponential log natural log


sqrt square root abs absolute value
sin sine cos cosine
sum sum prod product
max maximum min minimum
chol Cholesky factorization inv inverse
det determinant size size

Copyright c 2011 J.C.C. Chan


6 A MATLAB Primer

If x is a vector, sum(x) returns the sum of the elements in x. For a matrix X, sum(X)
returns a row vector consisting of sums of each column, while sum(X,2) returns a column
vector of sums of each row. For example,

>> sum(A)

ans =

12 15 18

>> sum(A,2)

ans =

6
15
24

compute the sums of each column and row, respectively. For a positive definite matrix X,
chol(X) returns the Cholesky factorization C such that C′ C = X. For example,

>> B = [ 1 2 3; 0 4 5; 0 0 6];
>> chol(B’*B)

ans =

1 2 3
0 4 5
0 0 6

gives the Cholesky factor of B′ B, which is, of course, B.

Exercise 1.4. Find the determinant of the matrix B by computing the product of the
diagonal elements. Check your result using det.

1.3 Flow of Control

MATLAB has the usual control flow statements such as if-then-else, while and for,
and they are used in the same way as in other languages. For instance, the general form
of a simple if statement is

Copyright c 2011 J.C.C. Chan


1.3 Flow of Control 7

if condition
statements
end

The statements will be executed if the condition is true. Multiple branching is done by
using elseif and else. For example, the following codes simulate rolling a four-sided
dice:

u = rand;
if u <= .25
disp(’1’);
elseif u <= .5
disp(’2’);
elseif u <= .75
disp(’3’);
else
disp(’4’);
end

Exercise 1.5. Write a MATLAB program to simulate flipping a loaded coin where the
probability of getting a head is 0.8.

The general form of a while loop is

while condition
statements
end

The statements will be repeatedly executed while the condition remains true. To illustrate
the while loop syntax, suppose we wish to generate a standard normal random variable
left truncated at 0 (i.e., positive). We can do that using the following simple while loop:

u = randn;
while u <= 0
u = randn;
end

Exercise 1.6. Luke and Leia are playing a coin-flipping game with a supposedly fair coin.
Each time it comes up head, Luke pays Leia one dollar; if it is tail, Leia pays Luke one
dollar. At the beginning of the game Luke has ten dollars and Leia five. The game is over
when either one is broke. Write a MATLAB program to simulate one such game.

Copyright c 2011 J.C.C. Chan


8 A MATLAB Primer

Another useful control flow statement is the for loop, where its general form is

for count
statements
end

Unlike a while loop, the for loop executes the statements for a pre-fixed number of times.
As an example, the following codes generate five draws from the truncated standard normal
distribution.

x = zeros(1,5); %% create a storage vector


for i=1:5
u = randn;
while u <= 0
u = randn;
end
x(i)=u;
end
>> x

x =

1.4900 2.2339 0.7921 0.4840 0.1044

Exercise 1.7. Consider the coin-flipping game between Luke and Leia in Exercise 1.6.
Simulate that game 1000 times and record the winner. What is the number of times Luke
win?

1.4 Function handles and function files

In previous sections we have introduced some built-in functions in MATLAB. For instance,
sqrt is a function that takes an argument and returns an output (its square root). Later on
we will need to create our own functions that take one or more input arguments, operate
on them, and return the results. One way to create new functions is through function
handles. For example,

f = @(x) x.^2 + 5*x - 10 ; % Note the use of the dot

Copyright c 2011 J.C.C. Chan


1.4 Function handles and function files 9

creates the function f (x) = x2 + 5x − 10 with the function handle f.1 To evaluate f (10),
we can type feval(f,10) or more simply,

>> f(10)

ans =

140

Function handles can be passed to other functions as inputs. For instance, if we want to
find the minimum point of f (x) in the interval (−10, 10), we can use the build-in function
fminbnd:

>> [xmin fmin] = fminbnd(f,-10, 10)

xmin =

-2.5000

fmin =

-16.2500

Note that fminbnd takes three inputs (a function handle, and the two end-points of the
interval) and returns two outputs (the minimizer and the minimal value). In our example,
the minimizer of f (x) in (−10, 10) is −2.5, where its value is −16.25. To create a function
that takes more than one input is as easy. For example,

g = @(x,y) x.^2 + y.^3 + x.*y

defines the two-variable function g(x, y) = x2 + y 3 + xy.

Exercise 1.8. Create a function in MATLAB that can be used to evaluate the standard
normal density
1 1 2
φ(x) = (2π)− 2 e− 2 x .

What is the value of φ(2)?


1
Technically, the above command creates an anonymous function with a function handle called f. The
function handle gives you a means of invoking the function.

Copyright c 2011 J.C.C. Chan


10 A MATLAB Primer

For more complex functions that involve multiple lines and intermediate variables, we need
the command function. For example, the following codes take a column vector of data
and compute its mean and standard deviation:

% function STAT
% INPUT: a column vector of data x
% OUTPUT: mean and standard deviation of x.
function [meanx, stdevx] = stat(x)
n = length(x);
meanx = sum(x)/n;
stdevx = sqrt(sum(x.^2)/n - meanx.^2);

It is important to note that all the codes must be written and saved in a separate m-file.
Also, the name of the file should coincide with the name of the function; in this case the
file must be called stat.m. After saving the file, it can be used the same way as other
built-in functions, for example:

>> [meanx stdx] = stat(randn(100,1))

meanx =

0.0530

stdx =

0.9902

Exercise 1.9. Write a function in MATLAB that takes two series of the same length, say,
x = (x1 , . . . , xT )′ and y = (y1 , . . . , yT )′ , and computes their sample correlation coefficient
rxy defined as
PT
(xt − x̄)(yt − ȳ)
rxy = qP t=1 PT ,
T 2 2
t=1 (xt − x̄) t=1 (yt − ȳ)

where x̄ and ȳ are respectively the sample means of x and y.


Exercise 1.10. Modify the codes in Exercise 1.9 to compute the sample p-th order au-
tocorrelation coefficient of a series y = (y1 , . . . , yT )′ :
PT
t=p+1 (yt − ȳ)(yt−p − ȳ)
rp = qP PT ,
T 2 2
t=p+1 (y t − ȳ) t=p+1 (y t − ȳ)

where ȳ is the sample mean of y.

Copyright c 2011 J.C.C. Chan


1.5 Graphics 11

1.5 Graphics

MATLAB has several high-level graphical routines and very extensive plotting capabilities.
It allows users to create various graphical objects including two- and three-dimensional
graphs. In addition, one can also have a title on a graph, add a legend, change the font
and font size, label the axis, etc. For more information, in the command window click on
Help and next select Demos. Then choose Graphics followed by 2-D Plots.

0.8

0.6

0.4

0.2

0
y

−0.2

−0.4

−0.6

−0.8

−1
0 1 2 3 4 5 6 7
x

Figure 1.1: A plot of the graph y = sin(x) from 0 to 2π.

In MATLAB the most basic function used to create 2-D graphs is plot. For example, to
make a graph of y = sin(x) on the interval from x = 0 to x = 2π, we use the following
codes:

x = 0:.01:2*pi;
y = sin(x);
plot(x,y);
xlabel(’x’);
ylabel(’y’);

The graph produced is reported in Figure 1.1. Note that the command x = 0:.01:2*pi;
creates a vector with components ranging from 0 to 2π in steps of 0.01. Another useful
command to create a grid is linspace (use help linspace to learn more about this
function).

Exercise 1.11. Investigate the function plot. Try plot(x,y,’r’,’linewidth’,2).

Copyright c 2011 J.C.C. Chan


12 A MATLAB Primer

Exercise 1.12. Plot the standard normal density function in Exercise 1.8 from −3 to 3.

Another useful function is hist, which allows us to plot histograms. Fox example,

hist(randn(1000,1),50);

creates a histogram where the 1000 standard normal draws are put into 50 equally spaced
bins.

70

60

50

40

30

20

10

0
−4 −3 −2 −1 0 1 2 3 4

Figure 1.2: A histogram of 1000 standard normal draws.

Exercise 1.13. Generate 1000 standard normal draws and use the function histc to
count the number of draws that fall into the 5 bins: (−∞, −2), [−2, −1), [−1, 1), [1, 2) and
[2, ∞).

It is often desirable to plot several graphs in the same figure window. For this purpose
we need the function subplot. The function subplot(i,j,k) takes three arguments: the
first two tell MATLAB that an i × j array of plots will be created, and the third is the
running index that indicates the k-th subplot is currently generated. Suppose we wish
to plot the functions y = sin(0.5x2 ) and y = sin(2x) in the same figure window. A little
modification of the above codes will accomplish this goal:

x = 0:.01:2*pi;
y1 = sin(x.^2/2);
y2 = sin(2*x);
subplot(1,2,1); plot(x,y1); title(’y = sin(0.5x^2)’);
subplot(1,2,2); plot(x,y2); title(’y = sin(2x)’);

Copyright c 2011 J.C.C. Chan


1.5 Graphics 13

and the results are reported in Figure 1.3.

2
y = sin(0.5x ) y = sin(2x)
1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
0 2 4 6 8 0 2 4 6 8

Figure 1.3: Plots of the graphs y = sin(0.5x2 ) and y = sin(2x) from 0 to 2π.

In addition, one can also easily produce 3-D graphical objects in MATLAB. To illustrate
various useful routines, suppose we want to plot the density function of the bivariate
normal distribution (see Section 4.1) given by

1 − 1
2(1−ρ2 )
(x2 −2ρxy+y 2 )
f (x, y; ρ) = p e .
2π 1 − ρ2

As in plotting a 2-D graph, we first need to build a grid, and this can be done with the
function meshgrid. After computing the values of the function at each point on the grid,
we can plot the 3-D graph using mesh. For example, we use the following codes

rho = .6;
[x y] = meshgrid(-2:.1:2, -2:.1:2); % to build a 2D grid
z = 1/(2*pi*sqrt(1-rho^2)) * exp(-(x.^2 -2*rho*x.*y + y.^2)/(2*(1-rho^2)));
mesh(x,y,z);

to plot the the bivariate normal density function with ρ = 0.6 in Figure 1.4.

Copyright c 2011 J.C.C. Chan


14 A MATLAB Primer

0.2

0.15

0.1

0.05

0
2
1 2
0 1
0
−1 −1
−2 −2

Figure 1.4: The density function of the bivariate normal distribution with ρ = 0.6.

Moreover, a contour plot can be easily produced by the function contour:

contour(x,y,z);

1.5

0.5

−0.5

−1

−1.5

−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 1.5: A contour plot of the bivariate normal density function with ρ = 0.6.

Copyright c 2011 J.C.C. Chan


1.6 Handling Sparse Matrices 15

Exercise 1.14. Plot the bivariate t density function (see Section 4.1) given by
 −( ν2 +1)
1 x2 + y 2
g(x, y; ν) = 1+ .
2π ν
Create also a contour plot.

1.6 Handling Sparse Matrices

A sparse matrix is simply a matrix that contains a large proportion of zeros. As such,
computations involving sparse matrices can typically be done much faster than those
involving full matrices. In addition, as most of the elements in a sparse matrix are zeros,
storage cost of a sparse matrix is also often small. In later chapters we will deal with some
large but sparse matrices. So it would be useful to first discuss how to handle them in
MATLAB.

Perhaps the most useful function for creating sparse matrices is sparse. For example,
suppose the matrix  
1 0 0 0 0
0 1 0 0 0
W= 0 0 2 0 0

0 0 0 3 1
is stored as a full matrix in MATLAB. The command sparse(W) converts W to sparse
form by squeezing out any zero elements:

>> sparse(W)

ans =

(1,1) 1
(2,2) 1
(3,3) 2
(4,4) 3
(4,5) 1

Notice that only the non-zero elements in W are stored. In general, we can create a matrix
S by the command S = sparse(i,j,s,m,n), which uses vectors i, j, and s to generate
an m × n sparse matrix such that S(i(k), j(k)) = s(k). For example, to create the matrix
W above, we first need to build a vector s that stores all the non-zero elements:

s = [1 1 2 3 1]’;

Copyright c 2011 J.C.C. Chan


16 A MATLAB Primer

Next, we create a vector i that stores the row position for each element in s. For example,
the first element in s should be in the first row, the second element in second row, and so
on. We then do the same thing for the column positions and store them in the vector j:

i = [1 2 3 4 4]’;
j = [1 2 3 4 5]’;

Finally,

W = sparse(i,j,s,4,5);

creates the 4 × 5 matrix W above.


Exercise 1.15. Create the matrix
 
1 0 0 0 0
-1 1 0 0 0
 
H=
 0 -1 1 0 0
 0 0 -1 1 0
0 0 0 -1 1

using the function sparse.

There are several useful build-in functions for creating special sparse matrices. For exam-
ple,

I = speye(100);

creates the 100 × 100 sparse identity matrix. Of course we can accomplish the same goal
by using

I = sparse(1:100,1:100,ones(1,100));

though the latter is more clumsy. Another useful function is spdiags, the sparse version of
diag, that can be used to extract and create sparse diagonal matrices. Use help spdiags
to learn more about this function.
Exercise 1.16. Create the matrix H in Exercise 1.15 by using only the function spdiags.

As mentioned earlier, one main advantage of working with sparse rather than full matrices
is that computations involving sparse matrices are usually much quicker. For instance,
it takes about 2.7 seconds to obtain the Cholesky decomposition of the full 5000 × 5000
identity matrix:

Copyright c 2011 J.C.C. Chan


1.7 Optimization Routines 17

tic; chol(eye(5000)); toc;


Elapsed time is 2.728245 seconds.

whereas the same operation takes only 0.15 second for a sparse 5000×5000 identity matrix:

tic; chol(speye(5000)); toc;


Elapsed time is 0.014867 seconds.

1.7 Optimization Routines

MATLAB provides various built-in optimization routines. In this section we discuss some
of them that are useful in later chapters. Note is that all the optimization routines in
MATLAB are framed in terms of minimization. In order to perform maximization, some
minor changes to the objective function are required. More precisely, suppose we want to
maximize the function f (x) and find a maximizer xmax = argmaxx f (x). Instead of the
original maximization problem, consider minimizing −f (x) and noting that

xmax = argmax f (x) = argmin −f (x).


x x

Hence, without loss of generality, we will focus on minimization routines. One basic
minimization function is fminbnd, which finds the minimum of a single-variable function
on a fixed interval. To illustrate its usage, suppose we wish to minimize the function
f (x) = sin x2 over the interval [0, 3] (see Figure 1.6).
2
sin(x )

0.5

−0.5

−1

0 0.5 1 1.5 2 2.5 3


x

Figure 1.6: A plot of f (x) = sin x2 from 0 to 3.

After defining the function f (x) = sin x2 using the command

Copyright c 2011 J.C.C. Chan


18 A MATLAB Primer

f = @(x) sin(x.^2);

we pass f to fminbnd, which takes three inputs — the function handle, lower and upper
bounds of the interval — and gives two outputs — the minimizer and value of the function
evaluated at the minimizer:

[xmin fmin] = fminbnd(f,0,3);

For this example, we have

>> [xmin fmin]

ans =

2.1708 -1.0000

The function fminbnd can only be used to minimize univariate function in a closed in-
terval. For multivariate minimization, one very useful function is fminsearch that finds
the unconstrained minimum of a function of several variables. fminsearch takes two in-
puts, namely, the function handle and a starting value. Like fminbnd, fminsearch gives
two outputs: the minimizer and value of the function evaluated at the minimizer. As an
example, suppose we wish to maximize the bivariate normal pdf
1 − 1
2(1−ρ2 )
(x21 −2ρx1 x2 +y22 )
f (x1 , x2 ; ρ) = p e
2π 1 − ρ2
with respect to x = (x1 , x2 ) while fixing ρ. To this end, first define g(x1 , x2 ) = −f (x1 , x2 ; ρ):

rho = .6;
g = @(x) -1/(2*pi*sqrt(1-rho^2)) ...
*exp(-(x(1).^2 -2*rho*x(1).*x(2) +x(2).^2)/(2*(1-rho^2)));

Note that the variable x is a 2 × 1 vector. Then, we pass g to fminsearch with starting
values, say, [1 − 1]:

[xmin gmin] = fminsearch(g, [1 -1]);

For this example, we have

>> [xmin gmin]

ans =

0.0000 0.0000 -0.1989

Copyright c 2011 J.C.C. Chan


1.8 Further Readings and References 19

That is, the mode of f (x1 , x2 ; ρ) is x = (0, 0), and f (0, 0) = 0.1989.

1.8 Further Readings and References

As mentioned in the text, the official documentation site is

http://www.mathworks.com/help/techdoc/

A good place to learn more about the major functionality in MATLAB is the MATLAB
Getting Started Guide available at

http://www.mathworks.com/help/pdf doc/matlab/getstart.pdf.

Copyright c 2011 J.C.C. Chan


20 A MATLAB Primer

Copyright c 2011 J.C.C. Chan


Chapter 2

Principles of Forecasting

Broadly speaking, a forecast is simply a statement about certain future observables. To


make our discussion concrete, let yt be the quantity of interest observed at time t. It might
be some measure of the rate of inflation, yield on a certain U.S. Treasury bond, AUD/USD
exchange rate, annual number of tourists arriving in Australia, etc. In typical situations
we would also have observed yt over a certain time horizon, say, from time t = 1 to t = T .
Suppose we wish to forecast the yet to be observed quantity yT +h for some integer h > 1.
Let ybT +h denote the point forecast or our “best guess” given all the available information
up to time t = T . How do we produce a reasonable point forecast ybT +h ?

One simple forecasting rule is simply to use the last observed value yT as a prediction
for yT +h . This rule of thumb is trivial to implement and might be reasonable in some
situations. However, one obvious drawback is that it does not take advantage of all
available information — it ignores all the data except the last observed value yT — as
such, the point forecast ybT +h = yT is likely to be poor in most applications. A better rule
of thumb might be to construct forecasts as weighted averages of past observations, where
higher weights are given to more recent observations than distant ones. Seemingly this is
more reasonable, and includes the first forecasting rule as a special case, where the weight
for the last observation is one and all other weights are zero. In fact, this is the main
idea behind a class of popular forecasting methods called exponential smoothing. Notice
that neither forecasting rules requires the user to specify a statistical model — the data
generating process for the observations y1 , . . . , yT is not specified — and we don’t know
how the data arise or what features they have. For this reason, these forecasting rules
are sometimes described as ad hoc forecasting methods. The main advantages of these ad
hoc rules of thumb are that they are seemingly reasonable, and are typically very easy to
implement. The main drawback, however, is that without a formal model it is difficult to
analyze the statistical properties of the forecasting rules.

The other approach of forecasting is model-based — before we can make a forecast for
yT +h , we must first build a statistical model that summarizes and dynamic patterns in the

Copyright c 2011 J.C.C. Chan


22 Principles of Forecasting

historical data, and estimate the model parameters using past data. This model-based
approach forces us to think carefully about how the data are generated, and what key
features of the data we want to model. The idea is that if we have a good model that fits
observed data well, then this model might give reasonable forecasts provided that there is
anything to be learned in the past data. The main advantage of this approach is that it
gives us a framework to model relevant features of the data important in producing good
forecasts. In addition, since we have a statistical model, we can analyze the statistical
properties of the point forecast or even go beyond a mere point forecast. The price of the
model-based approach, on the other hand, is that it requires some background in statistical
inference, and the implementation is often computationally intensive.

Before we go into the details of both approaches — ad hoc methods are discussed in
Chapter 3 and model-based forecasting is covered in Chapters 4–6 — we will discuss in
this chapter various fundamental concepts in forecasting, including forecast horizon, point
vs density forecasts, direct vs iterated forecasts, loss function, and methods for comparing
non-nested forecast models. In what follows, we will first have an overview of several
popular forecasting models. Note that throughout this chapter we assume the model
parameters are known, and the estimation problem is discussed in later chapters.

2.1 Data Generating Process

A model is a simply a stylized description of the phenomenon of interest. One way to


formulate a model is through a description of the underlying data generating process, i.e.,
a set of assumptions or equations that dictate how the data are produced. In this section
we will introduce several popular time-series models, or data generating processes, which
can generate certain features generally found in time-series data, such as autocorrelation,
trends and cycles. These simple specifications will serve as building blocks to help us
construct more sophisticated models later on.

Example 2.1 (AR(1) process). Suppose we generate a time series y0 , y1 , y2 , . . . , as


follows: first we set y0 = c where c is a fixed constant. Then, for t = 1, 2, . . . , we generate
yt according to the recursive equation:

yt = ρyt−1 + εt , εt ∼ N(0, σ 2 ). (2.1)

This data generating process is called a first-order autoregressive process or simply


AR(1). For the special case ρ = 1, the mode in (2.1) is called a random walk. The
following MATLAB codes generate two AR(1) processes with ρ = 0.5 and ρ = 1. Both
are initialized with y0 = 1.

T = 50;
y1 = zeros(T,1); y2 = zeros(T,1);
rho1 = .5; rho2 = 1;

Copyright c 2011 J.C.C. Chan


2.1 Data Generating Process 23

sig = 1;
for t=1:T
if t == 1
y1(t) = 1; y2(t) = 1;
else
y1(t) = rho1*y1(t-1) + sig*randn;
y2(t) = rho2*y2(t-1) + sig*randn;
end
end

4 20

3
15
2
10
1

0 5
−1
0
−2

−3 −5
0 10 20 30 40 50 0 10 20 30 40 50

Figure 2.1: Two realizations of the AR(1) processes with ρ = 0.5 (left panel) and ρ = 1
(right panel).

We use the above codes to generate two realizations for each AR(1) process, and the results
are plotted in Figure 2.1. Note that the two AR(1) processes behave quite differently. In
particular, the AR(1) process with ρ = 0.5 (left panel) tends to fluctuate around 0, whereas
the one with ρ = 1 has an apparent trend (right panel).
Exercise 2.1. Suppose X ∼ N(0, In ), where In is the n × n identity matrix. If we let
Y = µ + CX, show that Y ∼ N(µ, Σ), where Σ = CC′ .

Exercise 2.2. Write a MATLAB program to generate data from the following AR(2)
process with a drift:
yt = µ + ρ1 yt−1 + ρ2 yt−2 + εt , εt ∼ N(0, σ 2 ),
for t = 1, 2, . . . , T , where the process is initialized with y−1 = y0 = µ. Plot one realization
of the AR(2) process with T = 50, µ = 5, ρ1 = 0.5, ρ2 = 0.1 and σ 2 = 2.
Example 2.2 (MA(1) process). Suppose ε1 , ε2 , . . . are independent and identically
distributed (iid) N(0, σ 2 ) random variables and ε0 = 0. If we generate yt according to the
equation
yt = θεt−1 + εt , (2.2)

Copyright c 2011 J.C.C. Chan


24 Principles of Forecasting

then {y1 , y2 , . . .} is called a first-order moving average process or simply MA(1).

By construction, the data generated from the AR(1) and MA(1) models exhibit autocorre-
lation (and quite distinct autocorrelation patterns). In addition, the random walk process
is also able to produce data with a stochastic trend. In the following example, we describe
a class of simple models that can generate both trends and cycles.

Example 2.3 (Models with Trend and Cycle). Typical features of time-series data
include trends and cycles. In this example we study a class of simple models that can
generate both features. Suppose yt is generated via the following equation:

yt = mt + ct + εt , εt ∼ N(0, σ 2 ), (2.3)

where mt is a slowing changing function known as the trend component and ct is a


periodic function of t referred to as the cycle component.

10 30

25
5
20

15
0
10

5
−5
0

−10 −5
0 10 20 30 40 50 0 10 20 30 40 50

Figure 2.2: The trend-cycle models with mt = 0 and ct = 4 cos(π/5t) + 2 sin(π/5t) (left
panel); and mt = 0.5t and ct = 4 cos(π/5t) + 2 sin(π/5t) (right panel).

As an example, consider the linear trend

mt = a0 + a1 t,

where a0 is the level and a1 is the slope of the trend. A popular choice for ct is the
following periodic function:

ct = b1 cos(ωt) + b2 sin(ωt),

where b1 and b2 control the magnitude of the cycle component whereas ω determines the
frequency with which ct completes a cycle. We use the following codes to generate two
realizations from (2.3), and the series are plotted in Figure 2.2.

Copyright c 2011 J.C.C. Chan


2.1 Data Generating Process 25

T = 50; sig = 1;
t = (1:T)’; % construct the vector [1 2 3 ...]’

a0 = 0; a1 = 0; b1 = 4; b2 = 2; omega = 2*pi/10;
y1 = a0 + a1*t + b1*cos(omega*t) + b2*sin(omega*t) + sig*randn(T,1);

a0 = 0; a1 = .5; b1 = 4; b2 = 2; omega = 2*pi/10;


y2 = a0 + a1*t + b1*cos(omega*t) + b2*sin(omega*t) + sig*randn(T,1);

For the trend-cycle model on the left, the trend component is set to be zero: mt = 0, and
the time series only exhibits cyclical behavior, whereas the time series on the right has a
linear trend mt = 0.5t. Moreover, in both examples, it takes 10 (2π/ω) periods for ct to
complete one cycle.

Exercise 2.3. Write a MATLAB program to generate data from the trend-cycle model
in Example 2.3 with a quadratic trend component

mt = a0 + a1 t + a2 t2 ,

and the same cycle component

ct = b1 cos(ωt) + b2 sin(ωt).

Plot one realization of the model with T = 50, a0 = 0, a1 = 0.5, a2 = 0.1, b1 = 4, b2 =


0, ω = 2π/4 and σ 2 = 2.

In addition to the time series of interest, yt , often we observe other relevant variables. For
instance, in forecasting the international visitor arrivals in Australia, we might also have
other seemingly relevant variables such as AUD/USD exchange rate, oil price (fuel costs),
GDP growth in major tourist markets, etc. By utilizing this additional information, one
might be able to produce better forecasts. One of the popular ways to incorporate these
additional variables is via the linear regression model discussed in the following example.

Example 2.4 (Linear regression model). Let xt−1 = (xt−1,1 , . . . , xt−1,k ) be a 1×k vec-
tor of relevant variables in predicting yt . The variables xt−1 are often called regressors,
exogenous variables, or explanatory variables. For instance, if yt is the unemploy-
ment rate at time t, xt−1 might contain inflation rate, GDP growth rate and unemployment
rate at time t − 1. The linear regression model specifies the linear relationship between yt
and the regressors as follows:

yt = xt−1,1 β1 + · · · + xt−1,k βk + εt , εt ∼ N(0, σ 2 ), (2.4)

where β = (β1 , . . . , βk )′ is a k × 1 vector of regression coefficients. For later reference,


we write the above linear regression in matrix form. Specifically, by stacking (2.4) over

Copyright c 2011 J.C.C. Chan


26 Principles of Forecasting

t = 1, . . . , T, we can write it more succinctly as


     
y1 x0,1 x0,2 ··· x0,k   ε1
 y2   x1,1  β 1  
   x1,2 · · · x 1,k   .   ε2 
=
 ..   .. .. .. . +
..   .   ..  ,
 .   . . . .  . 
βk
yT xT −1,1 xT −1,1 · · · xT −1,1 εT

or equivalently,
y = Xβ + ε, ε ∼ N(0, σ 2 IT ), (2.5)
where IT is the T × T identity matrix, and
     
y1 x0,1 x0,2 ··· x0,k ε1
 y2   x1,1 x ··· x1,k   ε2 
   1,2   
y =  . , X =  . . .. .. , ε =  . .
 ..   .. .. . .   .. 
yT xT −1,1 xT −1,1 · · · xT −1,1 εT

By a simple change of variable, (2.5) implies that y follows a multivariate normal distri-
bution: y ∼ N(Xβ, σ 2 IT ). In particular, the expected value of y is Xβ.

Exercise 2.4. Write the trend-cycle model in Example 2.3 as a linear regression model
by defining X and β appropriately.

2.2 Forecast Horizon

The forecast horizon is simply the number of periods between today and the date of the
forecast. For example, if we have data on quarterly inflation rates and we want to forecast
the inflation rate one year from now, then the forecast horizon is four (four quarters), and
we speak of a four-step-ahead forecast. Should we have monthly data, a forecast one year
from now is a 12-step-ahead forecast. In general, given a particular data frequency, an
h-step-ahead forecast makes a statement about the future observable that is h periods
ahead into the future. Consideration of the forecast horizon is important for two main
reasons. First, forecast horizon affects the forecast, and how the forecast is made. For
instance, two different approaches for producing an h-step-ahead forecast with h > 1 is
available, namely, iterated forecast and direct forecast. We will illustrate these two
approaches via the following examples.

Example 2.5 (h-step-ahead forecasts for the AR(1) process). Consider the AR(1)
process in Example 2.1:
yt = ρyt−1 + εt , εt ∼ N(0, σ 2 ), (2.6)
and suppose we wish to make a one-step-ahead forecast, denoted as ybT +1 . Given the model
parameters θ = (ρ, σ 2 )′ and the last observed data point yT , one could use the conditional
expectation of yT +1 given θ and all the information up to time T as a point forecast.

Copyright c 2011 J.C.C. Chan


2.2 Forecast Horizon 27

Specifically, let It denote the information set at time t that summarizes all the available
information up to time t. Then the point forecast for yT +1 is ybT +1 = E(yT +1 | IT , θ).
Later we will show that this conditional expectation of yT +1 is optimal in a well-defined
sense. In our AR(1) setting, the information set It is simply all the observed data up to
time t, i.e., It = {y0 , y1 , . . . , yt }. Moreover, using (2.6) with t = T + 1, we have

yT +1 = ρyT + εT +1 , εT +1 ∼ N(0, σ 2 ).

Therefore, it follows that E(yT +1 | IT , θ) = ρyT .

Now, let’s consider how we can produce a two-step-ahead forecast. Using (2.6) once again
with t = T +2, one might think that the conditional expectation E(yT +2 | IT +1 , θ) = ρyT +1
could be used as a point forecast for yT +2 . However, this approach is not feasible as it
requires the information at time t = T + 1. Particularly, we need to know yT +1 in order
to compute E(yT +2 | IT +1 , θ), but of course we only have information up to time t = T .
One feasible approach is to iterate (2.6) to get

yT +2 = ρyT +1 + εT +2 = ρ(ρyT + εT +1 ) + εT +2
= ρ2 yT + ρεT +1 + εT +2 ,

where εT +1 and εT +2 are independent N(0, σ 2 ). By taking the expectation of the above ex-
pression given the model parameters θ and the information set IT , we can use E(yT +2 | IT , θ) =
ρ2 yT as a 2-step-ahead forecast. By exactly the same reasoning, an h-step-ahead iterated
forecast is given by E(yT +h | IT , θ) = ρh yT .

An alternative approach is to change the original model in (2.6) to

yt+h = ρeyt + εt+h , e2 ),


εt+h ∼ N(0, σ (2.7)
e = ρeyT as an h-step-ahead direct forecast. Strictly speaking (2.7)
and use E(yT +h | IT , θ)
is no longer an AR(1) model (though it might be viewed as a restricted AR(h) model
where all but one coefficients are zero), and it has quite distinct dynamics compared to
the AR(1) model in (2.6). As such, the two approaches — iterated vs direct — might give
quite different forecasts.

For one-step-ahead forecasts, the iterated and direct forecasts coincide. But for longer
forecast horizons, the two approaches generally give distinct results. In empirical ap-
plications, it is often found that iterated forecasts perform better than direct forecasts,
especially when the forecast horizon is long. However, there are situations where com-
puting iterated forecasts are not feasible, and we will discuss one such case in the next
example.

Example 2.6 (h-step-ahead forecasts for linear regression model). For the linear
regression model in Example 2.4

yt = xt−1 β + εt , εt ∼ N(0, σ 2 ),

Copyright c 2011 J.C.C. Chan


28 Principles of Forecasting

the iterated forecasting approach is not feasible for h > 1. This is because in predicting
yT +h , we need the variables xT +h−1 that are not in the information set IT available at
time t = T . Moreover, the regression model does not specify a recursive relation between
xt and xt−1 . Instead, we could follow the direct forecasting approach and consider the
following linear regression model:
e + εt ,
yt = xt−h β εt ∼ N(0, σ 2 ).

e = x′ β,
We can then use the conditional expectation E(yT +h | IT , θ) e where θ e ′ , σ 2 )′
e = (β
T
as an h-step-ahead forecast.

Exercise 2.5. Suppose we wish to make an iterated two-step-ahead forecast using the
AR(2) model in Exercise 2.2. Compute the conditional expectation E(yT +2 | IT , θ), where
IT is the information set at time t = T and θ = (µ, ρ1 , ρ2 , σ 2 )′ .

The second reason why forecast horizon is important is that it affects the types of models
used for making forecasts. For example, in forecasting GDP of a growing economy in
the near future (e.g. next quarter), short-term fluctuations over the business cycle are
important for making an accurate forecast. On the other hand, should we want to make
a long horizon forecast over, say, the next decade, short-term fluctuations become less
important. Instead, it is far more essential to model accurately the long-term trend, as
the next example illustrates.

Example 2.7 (Short- vs long-term forecast in the trend-cycle model). Consider


the trend-cycle model in Example 2.3:

yt = mt + ct + εt , εt ∼ N(0, σ 2 ),

where mt is the trend component and

ct = b1 cos(ωt) + b2 sin(ωt),

is the cycle component. It is easy to see that cycle component ct is bounded in t. In fact,
we have

|ct | = |b1 cos(ωt) + b2 sin(ωt)| 6 |b1 cos(ωt)| + |b2 sin(ωt)| 6 |b1 | + |b2 |.

In other words, no matter how large t becomes, the value of |ct | is at most |b1 | + |b2 |. On
the other hand, the trend component, by construction, is increasing (or decreasing). For
example, if we specify a linear trend

mt = a0 + a1 t,

then the value of |mt | is unbounded. Since |ct | is bounded while |mt | is not, this means
that |Eyt /mt | approaches 1 when t is sufficiently large. Since the trend component dom-
inates over the long run, it is paramount to model it correctly for long horizon forecasts.

Copyright c 2011 J.C.C. Chan


2.3 Point, Interval and Density Forecast 29

To appreciate this point, suppose instead of the linear trend above, we use a different
specification for mt , say, a quadratic trend

mt = a0 + a1 t + a2 t2 .

Then we would in general have a very different forecast compared to the original linear
specification, as a quadratic trend grows (or declines) much faster than a linear one over
a long horizon.

2.3 Point, Interval and Density Forecast

In discussing the forecast horizon in Section 2.2, we focus on one particular type of forecast,
namely, point forecast. That is, we produce a single number as our best guess for the
quantity we are interested in predicting. To be more precise, we use the conditional
expectation E(yT +h | IT , θ) as our h-step-ahead forecast for the future observable yT +h ,
where IT is the information set at time t = T and θ is the set of model parameters.
However, in many situations a point forecast is not informative enough to aid decision
making. As a simple example, consider the problem of forecasting the GDP growth rate
next quarter. We might give a point forecast of 1% as our best guess. But how much
confidence do we place on this number? Or put differently, what is the variability of the
point forecast? Suppose the growth rate can be as low as -5% in the worst scenario and
5% in the best turn of events. Specifically, let’s say with probability 0.95 the growth rate
falls within the interval (−5%, 5%), and we call this an interval forecast. Even though
the point forecast represents our best guess, the additional information manifested in the
interval forecast is obviously valuable as it helps our decision making (see the discussion
on loss function in the next section).

Can we do better than an interval forecast? Or to ask the question in a different way:
what is the largest amount of information we can realistically obtain regarding the quantity
yT +h ? Since the future observable yT +h is a random variable, we can characterize all the
information about yT +h by a probability density function, and we call this a density
forecast. Now the question is, how can we obtain a suitable density function for yT +h ?
Suppose for the moment that the values of the model parameters θ are known (we will
relax this seemingly unrealistic assumption in later chapters). Since at time t = T all the
available information is summarized by the information set IT , the conditional density
f (yT +h | IT , θ) represents the best available information about yT +h at time t = T , and
we will use this as a density forecast for yT +h .

Example 2.8 (Interval and density forecasts for the AR(1) process). Consider
again the AR(1) process in Example 2.1:

yt = ρyt−1 + εt , εt ∼ N(0, σ 2 ). (2.8)

Copyright c 2011 J.C.C. Chan


30 Principles of Forecasting

Given the time-series history IT = {y0 , y1 , . . . , yT } and the model parameters θ = (ρ, σ 2 )′ ,
recall that we showed in Example 2.5 that the conditional expectation E(yT +1 | IT , θ) is
simply ρyT , and we used this as a point forecast for yT +1 . Of course, we have much more
information than a mere point forecast. Once again we use (2.8) with t = T + 1 to obtain

yT +1 = ρyT + εT +1 , εT +1 ∼ N(0, σ 2 ).

By a simple change of variable (see Exercise 2.1), it follows that (yT +1 | yT , θ) ∼ N(ρyT , σ 2 ).
That is, a density forecast for yT +1 is the normal density f (yT +1 | IT , θ) with mean ρyT
and variance σ 2 . From this one can easily derive an interval forecast. For instance, a 95%
interval forecast is simply: (ρyT − 1.96σ, ρyT + 1.96σ).
Exercise 2.6. Under the AR(1) model in (2.8), derive a two-step-ahead 95% interval
forecast for yT +2 .
Exercise 2.7. Instead of the normal AR(1) model, let’s consider the following Student-t
variant:
yt = ρyt−1 + εt , εt ∼ t(ν, 0, σ 2 ).
Derive a one-step-ahead point forecast and a 95% interval forecast for yT +1 .
Example 2.9 (Density forecast for the MA(1) process). Suppose we wish to obtain
a density forecast for yT +1 under the MA(1) process in Example 2.2:

yt = θεt−1 + εt , (2.9)

where ε1 , . . . , εT are iid N(0, σ 2 ) random variables and ε0 = 0. To this end, we want to
compute the conditional density f (yT +1 | IT , θ), where IT = {y1 , . . . , yT } and θ = (θ, σ 2 )′ .
We first introduce the following useful notation: for a time series z1 , z2 , . . . , zt , we let
z1:t = (z1 , z2 , . . . , zt )′ . For example, z1:T denotes the whole sample from t = 1 to t = T .
Now, note that y1:T = (y1 , . . . , yT )′ is a linear combination of normal random variables,
and therefore y1:T follows a multivariate normal distribution. More specifically, write (2.9)
in the following matrix form:
    
y1 1 0 0 ··· 0 ε1
 y2  θ 1 0 · · · 0  ε2 
    
 y3  0 θ 1 · · · 0  ε3 
 =   ,
 ..   .. . . ..   .. 
 .  . . .  . 
yT 0 0 ··· θ 1 εT
| {z } | {z }
Γ(θ) ε1:T

where ε1:T ∼ N(0, σ 2 IT ). By a change of variable, we have y1:T ∼ N(0, σ 2 Γ(θ)Γ(θ)′ ).

Now to compute the conditional density f (yT +1 | IT , θ), note that the information set IT
is nothing other than the history of the time series up to time t = T , i.e., y1:T . Hence,
f (yT +1 | IT , θ) is simply the conditional density of yT +1 given y1:T and θ, and it can
be obtained in two steps. First, we show that the joint density of (y1:T , yT +1 ) is normal.

Copyright c 2011 J.C.C. Chan


2.3 Point, Interval and Density Forecast 31

Then, by using some standard results for the normal density, we can derive the conditional
density of yT +1 given y1:T . Now, write
    
y1:T Γ(θ) 0 ε1:T
= ,
yT +1 b(θ)′ 1 εT +1

where b(θ) is a T × 1 vector of zeros except the last element that is θ. In particular,
we have Γ(θ)b(θ) = b(θ), Γ(θ)−1 b(θ) = b(θ), and b(θ)′ b(θ) = θ 2 . Once again the joint
density for (y1:T , yT +1 ) is normal with mean vector 0 and covariance matrix
    
2 Γ(θ) 0 Γ(θ)′ b(θ) 2 Γ(θ)Γ(θ)
′ Γ(θ)b(θ)
σ =σ .
b(θ)′ 1 0 1 b′ Γ(θ) 1 + θ2

Since the joint density is normal, the conditional density for yT +1 given y1:T is also normal,
and we only need to determine the conditional mean and conditional variance. By standard
results for the normal density, we have

ybT +1 = E [yT +1 | y1:T , θ] = b(θ)′ (Γ(θ)Γ(θ)′ )−1 y1:T


= b(θ)′ Γ(θ)−1 y1:T ,
Var [yT +1 | y1:T , θ] = σ 2 (1 + θ 2 − b(θ)′ (Γ(θ)Γ(θ)′ )−1 b(θ))
= σ2 .

In other words, the conditional density f (yT +1 | IT , θ) is normal with mean ybT +1 and
variance σ 2 .

Exercise 2.8. Consider the following MA(1) model with an intercept:

yt = µ + θεt−1 + εt ,

where ε1 , . . . , εT are iid N(0, σ 2 ) random variables and ε0 = 0. Derive a density forecast
for yT +1 .

Exercise 2.9. For the MA(1) model in (2.9), derive a two-step-ahead density forecast for
yT +2 .

It is important to emphasize that so far all the forecasts produced are conditioned on the
model parameters θ. For example, we use the conditional expectation E(yT +h | IT , θ) as
an h-step-ahead point forecast, where the condition set contains θ. Of course in a real
world application one never knows the values of the model parameters, and they have to
be estimated from the data. We will defer the estimation problem to later chapters, where
we will also discuss how one handles parameter uncertainty.

Copyright c 2011 J.C.C. Chan


32 Principles of Forecasting

2.4 Loss Function

Forecasts are not made in a vacuum; they are made to guide decision making under
uncertainty. As such, in making a forecast, one needs to take into account the associated
loss if the forecast is wrong. The loss or economic cost associated with a forecast error
is summarized by the loss function. Consider, for example, the problem of predicting
whether or not it will rain tomorrow. There are only two possible outcomes: rainy or
sunny. If we think that it will rain tomorrow, we will bring an umbrella; otherwise we
will not. If it rains and we bring an umbrella, or when it is sunny and we do not have
an umbrella, then our loss is zero. On the other hand, if it rains and we do not have an
umbrella, we will get wet and incur a cost (disutility) of, say, 10 units. If it is sunny and
we bring an umbrella, though not as bad as getting wet, it is still a hassle and we put a
disutility of, say, 2 units on this forecast error.

Table 2.1: Losses associated with weather forecasting errors.

Rainy Sunny
Bring an umbrella 0 2
No umbrella 10 0

Obviously, our action (bring an umbrella or not) will not only depend on the weather fore-
cast, but also on the costs of the forecast errors, i.e., the loss function. To see this, suppose
we take it as given that the probability of raining tomorrow is 0.2 (so the probability that
it will be sunny is 0.8). If we bring an umbrella, our expected cost is 0.16 (0×0.2+2×0.8),
while the expected cost of not bringing an umbrella is 0.2 (10×0.2+0×0.8). Thus if we are
minimizing our expected cost, we will bring an umbrella when we leave home tomorrow,
even though we think that it is much more likely to be sunny than rainy.

The loss function in our hypothetical example in Table 2.1 is an instance of an asymmetric
loss structure, where the losses are different under the two bad outcomes. On the other
hand, should the losses incurred be the same, we say that the loss function is symmetric.
In general, under a symmetric loss function, the error above the actual value causes the
same loss as the same magnitude of error below the actual value. Two widely popular
symmetric loss functions are the squared loss or quadratic loss L(b y − y)2 , and
y , y) = (b
the absolute loss L(b y , y) = |b
y − y|, where yb is the forecast and y the actual future value.
The main difference between two loss functions is that the quadratic loss function penalizes
large forecast errors much more heavily than the absolute loss function, as illustrated in
Figure 2.3. In addition, the absolute loss function is not differentiable at 0, whereas the
quadratic loss function is differentiable on the whole real line.

Copyright c 2011 J.C.C. Chan


2.4 Loss Function 33

25 5

20 4

15 3

10 2

5 1

0 0
−5 0 5 −5 0 5

Figure 2.3: Two popular symmetric loss functions: quadratic loss (left panel) and absolute
loss (right panel).

In Section 2.3 we presented various forecast objects, including point and density forecasts.
As discussed there, a density forecast essentially summarizes all that can be learned about
the future observable yT +h given the current information IT . But often it is difficult to
convey the information content in the density function f (yT +h | IT , θ). A natural question
is, given the density forecast f (yT +h | IT , θ), can we produce a point forecast ybT +h that is
optimal in a well-defined sense? To address this question, we will first need to define what
it means to be optimal. Since the usual reason of producing forecasts is to guide decision
making, it seems reasonable to define an optimal point forecast as the one that minimizes
the expected loss with respect to a given loss function.
Example 2.10 (Optimal point forecast under quadratic loss). Given the density
forecast f (yT +h | IT , θ) and the quadratic loss function L(b y , y) = (b y − y)2 , what is the
point forecast that minimizes the expected loss? To find out, first note that the expected
loss can be written as
Z
E [L(b yT +h − yT +h )2 f (yT +h | IT , θ)dyT +h ,
yT +h , yT +h ) | IT , θ] = (b

where the expectation is taken with respect to the density f (yT +h | IT , θ). Now, we want
to minimize EL(b yT +h , yT +h ) with respect to ybT +h . We assume that we can differentiate
under the integral sign. (We can do that, for instance, if f (yT +h | IT , θ) is a member of
the exponential family. If you don’t know what the exponential family is, don’t worry
about it.) We set the first order condition to 0 to find the critical point:
Z
2 (byT +h − yT +h )f (yT +h | IT , θ)dyT +h = 0,
Z Z
ybT +h f (yT +h | IT , θ)dyT +h = yT +h f (yT +h | IT , θ)dyT +h ,

ybT +h = E [yT +h | IT , θ] ,

Copyright c 2011 J.C.C. Chan


34 Principles of Forecasting

R
where we used the fact that f (yT +h | IT , θ)dyT +h = 1 as f (yT +h | IT , θ) is a density
function. In other words, the optimal point forecast under quadratic loss is the conditional
expectation E [yT +h | IT , θ].

Exercise 2.10. For a continuous density f (yT +h | IT , θ), show that the optimal point
forecast under the absolute loss is the median mT +h , i.e.,
Z mT +h
1
f (yT +h | IT , θ)dyT +h = .
−∞ 2
R∞ R ybT +h R∞
(Hint: use the Fundamental Theorem of Calculus and the fact that −∞ = −∞ + ybT +h ).

2.5 Evaluation of Forecast Models

In any applications there are typically several seemingly reasonable models that we might
use to compute forecasts, and a priori it is difficult to choose among these competing
models. For example, to forecast future GDP we might consider models with a linear or
a quadric deterministic trend, or models with stochastic trends. But how do we select
which model to use to produce forecasts? This brings us the problem of model selection,
which is of great importance in the context of forecasting. Broadly speaking, there are
two approaches to choosing the best model. Under the in-sample fitting approach, we
use the full sample to estimate the model parameters, and then compute some measures
for assessing model-fit. Typically they are based on the residuals from the forecasts, and
the two popular ones are the mean absolute error (MAE):
T
1X
MAE = |yt − ybt |,
T
t=1

and the mean squared error (MSE):


T
1X
MSE = (yt − ybt )2 ,
T t=1

where ybt is the estimate for yt under the model. Then we choose the model that has the
lowest MAE or MSE. The idea is that if we have a model that fits the historical data well,
it probably will produce reasonable forecasts in the future. One main criticism of focusing
on MAE or MSE as a model selection criterion is that they are prone to the problem of
over-fitting — the selected model is excessively complex relative to the amount of data
available such that it describes the idiosyncrasies in the data rather than the underlying
relationship.

Since both MAE and MSE only depend on how well the model fits historical data, but
ignore model complexity, both criteria tend to favor complex models. In fact, we show

Copyright c 2011 J.C.C. Chan


2.5 Evaluation of Forecast Models 35

in Example 2.11 that if we have two nested specifications and estimate the parameters
via the ordinary least squares (OLS) method (more on this in Chapter 3), then the MSE
for the more complex model is no larger than that of the simpler one. The fact that the
more complex model is always favored — simply by the virtue of having more parameters
and being more complex— is disconcerting. This would not have been a problem if we
had infinitely amonut of data. In practice, however, we only have a fixed number of
observations. With a finite sample, if we have a more complex model, the estimates for
the model parameters become more imprecise. Any forecasts based on these inaccurately
estimated model parameters tend to be poor. This is the main reason why overfitted
models generally have poor predictive performance.
Example 2.11 (Potential over-fitting problem). Suppose we have two trend specifi-
cations, namely, the linear trend and the quadratic trend, where the point estimates for
yt under the two specifications are respectively
ybt1 = a0 + a1 t
ybt2 = b0 + b1 t + b2 t2 .
Observe that the linear trend specification is nested within the quadratic one, i.e., the
latter becomes the former if one sets b2 = 0. We want to show that the MSE under the
quadratic trend is always no larger than the MSE under the linear specification, if the
parameters are estimated via the ordinary least squares (OLS) method (see Chapter 3 for
details). To this end, define the function
T
1X
f (c0 , c1 , c2 ) = (yt − c0 − c1 t − c2 t2 )2 .
T
t=1

The parameters (a0 , a1 ) and (b0 , b1 , b2 ) are estimated vis the OLS method. That is,
T
X
(b
a0 , b
a1 ) = argmin (yt − ybt1 )2 = argmin f (a0 , a1 , 0)
a0 ,a1 a0 ,a1
t=1
XT
(bb0 , bb1 , bb2 ) = argmin (yt − ybt2 )2 = argmin f (b0 , b1 , b2 ).
b0 ,b1 ,b2 t=1 b0 ,b1 ,b2

In words, the P OLS estimates (b a0 , b


a1 ), for example, are obtained by minimizing the sum
of squares Tt=1 (yt − a0 − a1 t)2 with respect to a0 and a1 , or equivalently, by minimizing
the function f (a0 , a1 , 0) with respect to a0 and a1 , while fixing the third argument to 0.
Also note that the MSE under the linear trend specification is MSE1 = f (b a0 , b
a1 , 0), while
the MSE under the quadratic trend specification is MSE2 = f (bb0 , bb1 , bb2 ). By construction,
(bb0 , bb1 , bb2 ) is the minimizer of the function f , and therefore f (bb0 , bb1 , bb2 ) 6 f (c0 , c1 , c2 ) for
all values of c0 ,c1 and c2 . In particular,
MSE2 = f (bb0 , bb1 , bb2 ) 6 f (b
a0 , b
a1 , 0) = MSE1 .
Hence, the MSE under the quadratic trend is no larger than the MSE under the linear
one as claimed.

Copyright c 2011 J.C.C. Chan


36 Principles of Forecasting

In view of the over-fitting problem, it seems reasonable to penalize models with more
parameters — in this way, if two models fit the data more or less equally well, the simpler
model with fewer parameters would then be favored. Two popular selection criteria that
explicitly take into account of model complexity are Akaike information criterion
(AIC) and Bayesian information criterion (BIC), which are defined as follows:
AIC = MSE × T + 2k
BIC = MSE × T + k log T,
where k is the number of model parameters and T the sample size. Given a set of competing
models, the preferred model is the one with the minimum AIC/BIC value. As is obvious
from the definition, both criteria do not only reward goodness-of-fit as measured by the
mean squared error, but they also include a penalty term that is an increasing function of
the number of model parameters. Consequently, the two criteria discourage over-fitting,
and are generally better measures for selecting a forecast model than MSE or MAE. It is
also worth noting that the penalty term in BIC is generally larger than that in AIC. As
such, BIC favors more parsimonious models. Despite their difference in the penalty, AIC
and BIC often agree on the best model. If they are in conflict, it is recommended the
more parsimonious model to be selected — in our context it means using the model with
the minimum BIC.

Instead of using a particular in-sample fitting criterion to select the best model, an al-
ternative approach considers performing a pseudo out-of-sample forecasting exercise.
Since we are not interested in fitting the historical data per se — the goal is to use the
model that best fits past data to make (hopefully good) forecasts — apparently a better
criterion for choosing a good forecast model is to select the one, well, that makes good
forecasts. Of course, we would not know which model does best in forecasting ex ante.
The idea of a pseudo out-of-sample forecasting exercise is to simulate the experience of a
real-time forecaster: we use only a portion of the dataset, say, from time t = 1 to t = T0 to
estimate the model parameters and make the forecast ybT0 +h|T0 , where we emphasize the
forecast depends only on data up to time T0 . We compare the forecast with the actual
observed value yT0 +h . Then we move forward to date T0 + 1 and repeat this exercise
recursively. Finally, we compute some summary of the forecast errors in this pseudo out-
of-sample forecasting exercise, and pick the model that has the minimal forecast error.
Similar to MAE and MSE, two popular measures of forecast error are mean absolute
forecast error
TX−h
1
MAFE = |yt+h − ybt+h|t |,
T − h − T0 + 1
t=T0
and the mean squared forecast error (MSFE):
TX
−h
1
MSFE = (yt+h − ybt+h|t )2 ,
T − h − T0 + 1
t=T0

where yt+h is the actual observed value at time t + h and ybt+h|t is the h-step-ahead forecast
given the information up to time t. In contrast to MAE and MSE, which are measures

Copyright c 2011 J.C.C. Chan


2.5 Evaluation of Forecast Models 37

of in-sampling goodness-of-fit, both MAFE and MSFE are out-of-sample performance


measures. As such, even if we pick the model with the smallest MAFE or MSFE, we
would not have the same over-fitting problem. In the following example, we compare
three models with different trend specifications for forecasting U.S. GDP.

Example 2.12 (Forecasting U.S. GDP). In Figure 2.4 we plot the U.S. seasonally
adjusted real Gross Domestic Product (GDP) from 1947Q1 to 2009Q4 (in trillions of 2005
U.S. Dollars). There is a clear trend in the U.S. GDP data, and it seems to grow faster
than a linear one. In this example we compare three different trend specifications, namely,
linear, quadratic and cubic trends:

ybt1 = a0 + a1 t
ybt2 = b0 + b1 t + b2 t2
ybt3 = c0 + c1 t + c2 t2 + c3 t3 .

14000

12000

10000

8000

6000

4000

2000

0
1950 1960 1970 1980 1990 2000 2010

Figure 2.4: U.S. seasonally adjusted real GDP from 1947Q1 to 2009Q4 (in trillions of 2005
U.S. dollars).

Note that the cubic trend specification includes the quadratic one as a special case, which
in turn nests the linear trend. We defer the problem of estimation to Chapter 3; here
we only present the estimation and forecasting results. We compute the MSE, AIC and
BIC for each specification, and the three measures of in-sampling fitting are reported in
Table 2.2.

Copyright c 2011 J.C.C. Chan


38 Principles of Forecasting

Table 2.2: MSE, AIC and BIC under each of the three trend specifications.

Linear trend Quadratic trend Cubic trend


MSE 0.543 0.045 0.039
AIC 140.88 17.39 17.95
BIC 147.94 27.98 32.07
Number of parameters 2 3 4

Consistent with the graph in Figure 2.4, the linear specification does not fit the data that
well. In fact, by simply adding a quadratic term the MSE decreases by over 90%. However,
although the cubic specification fits the data better than the quadratic one as expected (if
it was not what you expected, see the discussion in Example 2.11), the increase in overall
fitness is marginal (a further 13% reduction in MSE). Does the reduction in MSE justify
the additional complexity of the cubic specification? Both AIC and BIC suggest that it’s
not worth the extra flexibility. In fact, both criteria indicate that best specification among
the three is the quadratic one, when balancing goodness-of-fit vs model complexity. It is
also worth nothing that BIC penalizes model complexity more heavily than AIC, and the
former favors the quadratic specification more strongly than the latter.

14 14 14

12 12 12
fitted value
10 10 10

8 8 8

6 6 6

4 4 4
actual value
2 2 2

0 0 0
1960 1980 2000 1960 1980 2000 1960 1980 2000

Figure 2.5: Fitted values of U.S. GDP under the linear trend (left panel), the quadratic
trend (middle panel) and the cubic trend (right panel) specifications.

Another way to access the goodness-of-fit is to look at the fitted values ybt1 , ybt2 and ybt3 under
the three specifications. From Figure 2.5 it is obvious that the linear trend does not fit
the observed U.S. GDP that well. Specifically, it consistently underestimates the observed
values between 1960 to late 1990s, while overestimates it before and after the said period.
On the other hand, both the quadratic and cubic trends fit the data well, and the fitted
values under the two specifications are almost indistinguishable.

Copyright c 2011 J.C.C. Chan


2.5 Evaluation of Forecast Models 39

Table 2.3: MSFE under each of the three trend specifications.

Linear trend Quadratic trend Cubic trend


h=4 0.746 0.073 0.079
h = 20 1.335 0.188 0.394

According to both AIC and BIC, the quadratic specification is the best among the three.
But how does it perform in the pseudo out-of-sampling forecasting exercise? To address
this equation, we compute the MSFE for each of the three specifications with h = 4
(one-year-ahead forecast) and h = 20 (five-year-ahead forecast). The recursive forecasting
exercises start from the 40-th period (T0 = 40, 10 years of data), and the results are
reported in Table 2.3. Again, the linear specification is the worst among the three, where
its MSFE is more than 10 times larger than that of the quadratic trend for the one-year-
ahead forecast and 7 times larger for the five-year-ahead forecast. And consistent with
the in-sample-fitting result, the quadratic trend does better than the cubic specification
in both the one- and five-year-ahead pseudo out-of-sample forecasting exercises. This
example shows that the model that best fits historical data — the cubic specification has
the minimum MSE — might not necessarily have the best forecasting performance.

16 15

14 actual
quadratic
12 cubic
10
10

8
5
6

2 0
1960 1970 1980 1990 2000 2010 1960 1970 1980 1990 2000 2010

Figure 2.6: Out-of-sample forecasts under the quadratic trend and the cubic trend speci-
fications with h = 4 (left panel) and h = 20 (right panel).

Lastly, we report in Figure 2.6 the pseudo out-of-sample forecasts under the quadratic
and the cubic trend specifications. Although the out-of-sample forecasting performance
for the two specifications are similar for the one-year-ahead forecasts, the five-year-ahead
forecasts under the cubic trend specification are much volatile, resulting in much larger
MSFE compared to the quadratic trend.

Copyright c 2011 J.C.C. Chan


40 Principles of Forecasting

2.6 Further Readings and References

Christoffersen and Diebold (1997) study the optimal prediction problem under general
asymmetric loss functions and characterize the optimal predictor. A recent paper that
compares direct and iterated forecasts can be found in Marcellino, Stock, and Watson
(2006).

Copyright c 2011 J.C.C. Chan


Chapter 3

Ad Hoc Forecasting Methods

Ad hoc forecasting methods are a collection of seemingly reasonable rules of thumb that
are typically easy to implement and do not require the user to specify a statistical model
for the data generating process. They are sometimes described as a “black box” fore-
casting approach where the statistical properties of the forecasts are often difficult to
analyze. When statistical statements can be made about the forecasts, they typically
involve asymptotic arguments (the law of large numbers, the central limit theorem, etc.).
In this chapter we will discuss several popular ad hoc forecasting methods, including the
Ordinary Least Squares (OLS) and the exponential smoothing methods.

3.1 The Ordinary Least Squares

Let yt be the variable of interest and let xt denote a 1 × k vector of relevant variables
for determining yt . Recall that the variables xt are often called regressors, exogenous
variables, or explanatory variables. Instead of specifying a (conditional) distribution
for yt as in the linear regression model (see Example 2.4), we only specify the (conditional)
expectation E[yt | β]. More specifically, we assume that

ybt ≡ E[yt | β] = xt β, (3.1)

for some k × 1 parameter vector β. Of course, the assumption in the linear regression
model, particularly the assumption that the disturbance term εt is N(0, σ 2 )-distributed
(conditioned on xt ), implies (3.1). But assuming (3.1) alone does not commit us to a
particular distributional assumption. As such, (3.1) does not uniquely specify a model
in the sense that even if we assume (3.1), we still don’t know how one can generate the
data. In addition, it is impossible to derive the conditional density f (yT +h | IT , θ) without
further distributional assumptions.

Now, assuming only (3.1), we would like to derive a sensible estimator — a procedure

Copyright c 2011 J.C.C. Chan


42 Ad Hoc Forecasting Methods

to obtain a guess on β given the data. What is a good way to obtain an estimator for
β? Our assumption 3.1 says that “on average” yt equals ybt = xt β. Hence, the error
(yt − xt β) should be small on average. Thus, one seemingly reasonable criterion of finding
an estimator for β is simply to locate the β that minimizes the sum of squared errors:
T
X
b = argmin
β (yt − xt β)2 . (3.2)
β t=1

We call the estimator β b thus obtained the Ordinary Least Squares (OLS) estimator.
The next question is, can we solve (3.2) analytically? That is, can we find an explicit
formula for βb as a function of xt and yt ? Notice that (3.2) in general is a multi-dimensional
minimization problem as the variable β = (β1 , . . . , βk )′ is a k × 1 vector. One way to
proceed is to first differentiate the sum of squared errors with respect to β1 to obtain a
first order condition. Then differentiate the sum again with respect to β2 to obtain another
equation, and so forth. At the end we obtain k simultaneous equations and we solve for the
k unknowns, namely, β = (β1 , . . . , βk )′ . In principle this approach is straightforward, but
the implementation is typically time-consuming and the resulting formula for β is messy.
Instead, we rewrite the problem in matrix notations, and use matrix differentiation to
obtain the system of k simultaneous equations, which can then be solved easily. That is
why we need to digress and discuss matrix differentiation first.

3.1.1 Digression: Matrix Differentiation

Let x be an n × 1 vector and let ψ(x) denote an m × 1 vector-valued function. In other


words, ψ is a mapping that maps
   
x1 ψ1 (x)
 ..   . 
 .  →  ..  .
xn ψm (x)
The Jacobian matrix of the transformation ψ is the m × n matrix defined as:
 
∂ψ1 (x) ∂ψ1 (x)
∂x1 · · · ∂xn
∂ψ(x)  .. .. .. 
Jψ (x) ≡ ≡ . . . .

∂x ∂ψm (x)
∂x1 · · · ∂ψ∂x
m (x)
n

That is, Jψ is a matrix of all the first-order partial derivative of the vector-valued function
ψ with respect to the vector x.
Exercise 3.1. Let A be an n × n matrix and u an n × 1 vector. Further let v = Au and
w = u′ A. Show that the k-th elements of v and w are respectively
n
X n
X
vk = akj uj , wk = ajk uj .
j=1 j=1

Copyright c 2011 J.C.C. Chan


3.1 The Ordinary Least Squares 43

Exercise 3.2. Let A be an n × n matrix, u an n × 1 vector and v = u′ Au. Show that


n X
X n
v= aij ui uj .
i=1 j=1

Example 3.1 (Differentiating a linear transformation). Let ψ(x) = Ax for some


m × n constant matrix A. We claim that
∂ψ(x)
= A. (3.3)
∂x
To prove (3.3), we first let aij denote the (i, j)-th element of A, and we write ψ as
 Pn 
k=1 a1k xk
 .. 
ψ(x) = Ax =  . .
Pn
k=1 amk xk

To find the (i, j)-th element of the m × n matrix Jacobian matrix Jψ , we differentiate the
i-th element of ψ with respect to xj :
n
∂ψi (x) ∂ X
= aik xk = aij .
∂xj ∂xj
k=1

In other words, the (i, j)-th element of Jψ is aij , the (i, j)-th element of A. Therefore,
the claim follows.

Exercise 3.3. Let z be a 1 × n vector and let ϕ(z) denote a 1 × m vector-valued function.
The Jacobian matrix of the transformation ϕ is the n × m matrix defined as:
 
∂ϕ1 (z) ∂ϕm (x)
∂z1 · · · ∂z1
∂ϕ(z)  .. .. .. 
Jϕ (z) ≡ ≡ . . . .

∂z ∂ϕ1 (z)
∂xn · · · ∂ψ∂x
m (z)
n

If A is an n × n constant matrix, show that



zA = A.
∂z
Example 3.2 (Differentiating a quadratic form). Let ψ(x) = x′ Ax for some n × n
constant matrix A. We claim that
∂ψ(x)
= x′ (A + A′ ). (3.4)
∂x
It follows immediately that if A is symmetric, i.e., A = A′ , then
∂ψ(x)
= 2x′ A.
∂x

Copyright c 2011 J.C.C. Chan


44 Ad Hoc Forecasting Methods

First observe that the quadratic form ψ(x) = x′ Ax is a scalar, and therefore the Jacobian
Jψ is a 1 × n vector. By Exercise 3.2, we have
n X
X n
ψ(x) = aij xi xj .
i=1 j=1

The k-th element of Jψ is obtained by differentiating ψ(x) with respect to xk :


n n n n
∂ψ(x) ∂ XX X X
= aij xi xj = akj xj + aik xi .
∂xk ∂xk
i=1 j=1 j=1 i=1

By Exercise 3.1, the first term on the right-hand-side is equal to the k-element of Ax, or
equivalently the k-th element of x′ A′ , whereas the second term equals the k-th element of
x′ A.

With this digression on matrix multiplication and differentiation, we now have all the
results needed to derive the formula for the OLS estimator.

3.1.2 OLS Computation

To compute the OLS estimator, we first write the sum of squared errors in matrix nota-
tions. To this end, stack the observations over t and y and X as in Example 2.4. Then (3.6)
can be rewritten as
b = argmin (y − Xβ)′ (y − Xβ).
β
β
Now, differentiate the sum of squared errors with respect to β. By setting the k first-order
derivatives to zero, we obtain the following system of k linear equations:
∂ ∂
(y − Xβ)′ (y − Xβ) = (y′ y − 2y′ Xβ + β ′ X′ Xβ)
∂β ∂β
= −2y′ X + 2β ′ X′ X = 0. (3.5)
We further assume that the system of linear equations are linear independent, i.e., none
of the equations can be derived algebraically from the others. In particular, we assume
that the matrix X has full column-rank — the k columns of X are all linear independent.
It then follows from linear algebra that β can be solved uniquely. In fact, the solution
to (3.5) is
b = (X′ X)−1 X′ y
β (3.6)
To check that βb is indeed the minimum, we compute the second-order derivative of the
sum of squared errors. Note that the first-order derivative in (3.5) is a 1 × k row vector,
and we differentiate it with respect to the row vector β ′ . By Exercise 3.3, we have
∂2 ′ ∂
′ (y − Xβ) (y − Xβ) = (−2y′ X + 2β ′ X′ X) = 2X′ X.
∂β∂β ∂β ′
Since the X′ X is positive definite, the critical point in (3.6) is in fact the minimum.

Copyright c 2011 J.C.C. Chan


3.1 The Ordinary Least Squares 45

b denote the OLS estimator. Show that


Exercise 3.4. Let PX = X(X′ X)−1 X′ and let β
PX is idempotent, i.e., PX PX = PX , and
b ′ (y − Xβ)
(y − Xβ) b = y′ (In − PX )y.

We now return to Example 2.12 where we compared three different trend specifications
for forecasting U.S. GDP. There we did not provide the computation details of the OLS
estimates and various in-sampling fitting and out-of-sample forecasting measures.
Example 3.3 (Forecasting U.S. GDP: computation details). In Example 2.12 we
computed the MSE, AIC and BIC for each of the three trend specifications, namely, the
linear, quadratic and cubic trends. We first consider the problem of computing the OLS
estimate using the full sample; later we’ll discuss the details of the recursive forecasting
exercise. Now, by defining the matrix X appropriately, the OLS estimates for each of
the specification can be easily obtained via (3.6). For instance, in the quadratic trend
specification, we have an intercept, a time index t and its square, i.e., xt = (1, t, t2 ) and
hence,  
1 1 1
1 2 4 
 
 
X = 1 3 9  .
 .. .. .. 
. . . 
1 T T2
Given the OLS estimate β, b it follows from our assumption (3.1) that the fitted value is
b The following MATLAB codes compute the in-sampling fitting measures
simply ybt = xt β.
for the quadratic trend specification.

load USGDP.csv;
y = USGDP;
T = size(y,1);
t = (1:T)’; % construct a column vector (1, 2, 3, ... T)’

%% quadratic trend model


X = [ones(T,1) t t.^2];
beta = (X’*X)\(X’*y);
yhat = X*beta;
MSE = mean((y-yhat).^2);
AIC = T*MSE + 3*2;
BIC = T*MSE + 3*log(T);

To compute the MSFE in the recursive forecasting exercise, we first need to obtain ybt+h|t ,
the h-step-ahead forecast given the data up to time t for each specification. But in our
b where the OLS estimate β
case it is ybt+h|t = xt+h β, b is computed using only data up to
time t. The following MATLAB codes compute the MSFEs for each of the three trend
specifications.

Copyright c 2011 J.C.C. Chan


46 Ad Hoc Forecasting Methods

T0 = 40;
h = 4; % h-step-ahead forecast
syhat = zeros(T-h-T0+1,3);
ytph = y(T0+h:end); % observed y_{t+h}
for t = T0:T-h
yt = y(1:t);
s = (1:t)’;
nt = length(s); % sample size

%% linear trend
Xt = [ones(nt,1) s];
beta1 = (Xt’*Xt)\(Xt’*yt);
yhat1 = [1 t+h]*beta1;

%% quadratic trend
Xt = [ones(nt,1) s s.^2];
beta2 = (Xt’*Xt)\(Xt’*yt);
yhat2 = [1 t+h (t+h)^2]*beta2;

%% cubic trend
Xt = [ones(nt,1) s s.^2 s.^3];
beta3 = (Xt’*Xt)\(Xt’*yt);
yhat3 = [1 t+h (t+h)^2 (t+h)^3]*beta3;

%% store the forecasts


syhat(t-T0+1,:) = [yhat1 yhat2 yhat3];
end
MSFE1 = mean((ytph-syhat(:,1)).^2);
MSFE2 = mean((ytph-syhat(:,2)).^2);
MSFE3 = mean((ytph-syhat(:,3)).^2);

3.1.3 Properties of the OLS Estimator

In this section we briefly discuss various properties of the OLS estimator for β in the linear
regression setting:
yt = xt β + εt , (3.7)
where the error terms εt ’s are independent and identically distributed (iid). A more
thorough treatment of the statistical properties of the OLS estimator can be found in any
standard econometrics textbook. Recall that we assume the conditional expectation of

Copyright c 2011 J.C.C. Chan


3.1 The Ordinary Least Squares 47

the error term εt given β and the regressors xt to be zero. That is,
E[εt | xt , β] = 0. (3.8)
In addition, we further assume that
Var[εt | xt , β] = σ 2 . (3.9)
For later reference, we stack the linear regression in (3.7) over t = 1, . . . , T and write
y = Xβ + ε.
We know that, by definition, the OLS estimator
b = (X′ X)−1 X′ y
β
minimizes the sum of squared errors of the regression. What else can we say about the
b We’ll first show that the OLS estimator is unbiased, i.e.,
estimator β?
b | X, β] = β.
E[β
In words, it means that the OLS estimator “on average” equals the unknown parameter
vector β, and for obvious reasons it is a desirable property. To show its unbiasedness, first
note that by (3.8) and the fact that the expectation operator is linear, we have
E[y | x, β] = Xβ.
Hence,
 
b | X, β] = E (X′ X)−1 X′ y | X, β = (X′ X)−1 X′ E[y | X, β]
E[β
= (X′ X)−1 X′ Xβ = β.
With assumption (3.9) (on the variance of εt ), we can also derive the covariance matrix
b Specifically, (3.9) implies that the covariance matrix of ε is Σε = σ 2 IT . Therefore,
for β.
the covariance matrix of β, denoted as Σβb , is given by
′
Σβb = (X′ X)−1 X′ Σε (X′ X)−1 X′
= (X′ X)−1 X′ σ 2 IT X(X′ X)−1
= σ 2 (X′ X)−1 X′ X(X′ X)−1
= σ 2 (X′ X)−1 .
So far we have shown that the OLS estimator β b is unbiased and its covariance matrix is
b is also the optimal linear unbiased estimator for
Σβb = σ 2 (X′ X)−1 . It turns out that β
β in the sense that it has the minimum variance among all linear unbiased estimators.
Indeed, this fact has a name, and it’s called the Gauss-Markov Theorem.
Theorem 3.1 (Gauss-Markov). In the linear regression (3.7) with assumptions (3.8)
and (3.9), the OLS estimator β has the minimum variance in the class of linear unbiased
estimators for the parameter vector β.

Again, more details about the proof of the Gauss-Markov Theorem can be found in any
standard econometrics textbook.

Copyright c 2011 J.C.C. Chan


48 Ad Hoc Forecasting Methods

3.2 Application: Modeling Seasonality

In Examples 2.12 and 3.3 we discussed modeling data that exhibit trending behavior. We
now turn our attention to study seasonality. A seasonal pattern is a regular within-a-year
pattern that repeats itself every year. Seasonality arises from, among others, preference,
technology and social institutions. Seasonal weather pattern is perhaps the most obvious
example. Any consumption or production that involves the weather, such as energy (heat-
ing), agricultural products, and cough syrups, tend to be also seasonal. Another major
factor of seasonal variation is the calendar. For example, retail sales, at least in Western
countries, tend to peak for the Christmas season, whereas Chinese usually do their holiday
shopping just before the Chinese New Year.

One of the easiest ways to model seasonal variation is to introduce seasonal dummies.
For example, suppose yt is observed quarterly and does not show trending behavior. One
might specify yt as fluctuating around the level µ:

yt = µ + εt , (3.10)

where the expected value of the error term εt is zero but otherwise unrestricted. Note
that under the specification (3.10), the conditional expectation of yt is µ, no matter which
quarter we are in. Put differently, the simple specification in (3.10) does not allow for
seasonal variation. Now, let’s say a closer look at the historical data reveals that yt
exhibits seasonal patterns. Particularly, the peaks seem to occur regularly at the fourth
quarter (Christmas season). Thus, other things equal, we expect a larger yt if it is the
fourth quarter. To allow for this seasonal variation, we consider

yt = µ + αDt + εt ,

where Dt is a dummy variable that equals 1 if it is the fourth quarter, and 0 otherwise.
What the dummy variable does is to allow the conditional expectations of yt to differ
between the fourth quarter and other quarters. More explicitly,

E[yt | µ, α] = µ if Dt = 0,
E[yt | µ, α = µ + α if Dt = 1.

By the same logic, one can include more dummy variables to allow for more complex sea-
sonal pattern. However, caution must be taken not to include too many dummy variables
due to the problem of perfect multicollinearity, which occurs where there is an exact
linear relation among the regressors. Perfect multicollinearity is a problem because if we
have an exact linear relation among the regressors — i.e., one variable can be written as
a linear combination of the others — then the matrix X will not have full column rank.
Consequently, X′ X is not invertible and the OLS estimate can’t be computed.

Returning to the previous example, let Dit be a dummy variable for the i-th quarter,
i.e., Dit = 1 if time t is the i-th quarter and Dit = 0 otherwise. Consider the following

Copyright c 2011 J.C.C. Chan


3.2 Application: Modeling Seasonality 49

specification where we include all four quarter dummy variables:

yt = µ + α1 D1t + α2 D2t + α3 D3t + +α4 D4t + εt .

Then the regressors are xt = (1, D1t , D2t , D3t , D4t ), and we have (why?)

D1t + D2t + D3t + D4t = 1,

an exact linear relation among the regressors. One easy fix to this problem of perfect
multicollinearity is to remove the intercept, or one of the dummy variables. For instance,
if we consider
yt = α1 D1t + α2 D2t + α3 D3t + +α4 D4t + εt ,
then the regressors xt = (D1t , D2t , D3t , D4t ) are linear independent. In addition, we have
E[yt | θ] = αi if Dt = i, where θ = (µ, α1 , α2 , α3 , α4 )′ .

Example 3.4 (Forecasting U.S. retail e-commerce sales). We plot the quarterly
U.S. retail e-commerce sales from 1999Q4 to 2011Q1 (not seasonally adjusted; in billions
U.S. dollars) in Figure 3.1. Two prominent features of the series are trend and seasonality.
Despite the downturn of retail sales around the global financial crisis in late 2008, the time
series overall exhibits a linear upward trend. In addition, there is a clear seasonal pattern
— the sales figures jump up in the fourth quarter, presumably due to Cyber Monday and
Christmas sales.

55

50

45

40

35

30

25

20

15

10

5
2000 2002 2004 2006 2008 2010

Figure 3.1: U.S. retail e-commerce sales from 1999Q4 to 2011Q1 (not seasonally adjusted;
in billions U.S. dollars).

In this example we focus on modeling seasonality and thus assume a linear trend for all
specifications; it is straightforward to consider quadratic or other deterministic trends.

Copyright c 2011 J.C.C. Chan


50 Ad Hoc Forecasting Methods

Let Dit be a dummy variable for the i-th quarter, i.e., Dit = 1 if time t is the i-th quarter
and Dit = 0 otherwise. Now, consider the following three different specifications:
S1 : ybt1 = a0 + a1 t + α4 D4t
S2 : ybt2 = a0 + a1 t + α1 D1t + α4 D4t
S3 : ybt3 = a1 t + α1 D1t + α2 D2t + α3 D3t + α4 D4t

In the first specification, we include only a dummy variable for the fourth quarter. Implic-
itly, we assume that the sales in the first three quarters are the same (after controlling for
the linear trend), while allowing only the sales in the fourth quarter to be different. In the
second specification, we include dummy variables for both the first and fourth quarters.
As such, we further allow the sales in the first quarter to be different. In the last speci-
fication, the sales in each quarter can potentially be different from the rest. The dataset
USecomm.csv has two columns, where the first column contains the retail e-commerce sales
values, and the second is a quarter indicator that takes value from 1 to 4. The follow-
ing MATLAB codes compute the in-sampling fitting measures for the third seasonality
specification.

load ’USecomm.csv’;
y = USecomm(:,1); Q = USecomm(:,2);
T = length(y); t = (1:T)’;
%% construct 4 dummy variables
D1 = (Q == 1); D2 = (Q == 2); D3 = (Q == 3); D4 = (Q == 4);
%% 3rd spec: linear trend + all dummies
X = [t D1 D2 D3 D4]; % drop the intercept to avoid multicollinearity
beta = (X’*X)\(X’*y);
yhat = X*beta;
MSE = mean((y-yhat).^2);
AIC = T*MSE3 + 5*2;
BIC = T*MSE3 + 5*log(T);

Table 3.1: MSE, AIC and BIC under each of the three seasonality specifications.

S1 S2 S3
MSE 5.408 5.352 5.346
AIC 254.79 254.20 255.90
BIC 260.28 261.51 265.05
Number of parameters 3 4 5

We report in Table 3.1 the MSE, AIC and BIC under each of the three specifications. Once
again, since the three specifications are nested, the MSEs are decreasing in the complexity

Copyright c 2011 J.C.C. Chan


3.2 Application: Modeling Seasonality 51

of the specification. However, allowing for more complex seasonal variation by including
additional dummy variables beyond the one for fourth quarter improves little the overall
goodness-of-fit. In fact, the AIC criterion (marginally) prefers the second specification ybt2 ,
whereas the BIC criterion favors the first. We plot the fitted lines under the first (S1 )
and second (S2 ) specifications in Figure 3.2, along with the actual values of the retail e-
commerce sales. Note that both fitted lines look identical. It is also interesting to observe
that the seasonal pattern is more pronounced after about 2006 — before 2006 the fitted
lines always overestimate the increase in the sales in the fourth quarter, whereas after 2006
the increases are often underestimated. To fit the data better, it seems that we would
need specifications that allow for time-varying parameters.

60
actual
S
50 1
S2

40

30

20

10

0
2000 2002 2004 2006 2008 2010

Figure 3.2: U.S. retail e-commerce sales from 1999Q4 to 2011Q1 (not seasonally adjusted;
in billions U.S. dollars).

We also perform a pseudo out-of-sampling forecasting exercise to determine which speci-


fication has the lowest forecasting errors. Specifically, we compute the MSFE for each of
the three specifications with h = 1 and h = 2. The recursive forecasting exercises start
from the 30-th period (T0 = 20, 5 years of data). The results of the recursive forecasting
exercise are reported in Table 3.2. It turns out that the simplest specification — the one
with a linear trend and a dummy on the fourth quarter only — has the smallest MSFE,
though the difference is small. In fact, the out-of-sample forecasts in Figure 3.3 show that
the first two specifications give almost identical forecasts for both forecast horizons.

Copyright c 2011 J.C.C. Chan


52 Ad Hoc Forecasting Methods

Table 3.2: MSFE under each of the three seasonality specifications.

S1 S2 S3
h=1 11.04 11.14 11.30
h=2 12.47 12.49 12.53

55 55

50 actual 50
S
1
45 45
S
2
40 40

35 35

30 30

25 25

20 20

15 15
2004 2006 2008 2010 2012 2005 2006 2007 2008 2009 2010 2011

Figure 3.3: Out-of-sample forecasts under specifications S1 and S2 with h = 1 (left panel)
and h = 2 (right panel).

The following MATLAB codes are used:

T0 = 20;
h = 2; % h-step-ahead forecast
syhat = zeros(T-h-T0+1,3);
ytph = y(T0+h:end); % observed y_{t+h}
for t = T0:T-h
yt = y(1:t);
s = (1:t)’;
nt = length(s); % sample size
D1t = D1(1:t); D2t = D2(1:t); D3t = D3(1:t); D4t = D4(1:t);
%% S1
Xt = [ones(nt,1) s D4t];
beta1 = (Xt’*Xt)\(Xt’*yt);
yhat1 = [1 t+h D4(t+h)]*beta1;
%% S2
Xt = [ones(nt,1) s D1t D4t];
beta2 = (Xt’*Xt)\(Xt’*yt);
yhat2 = [1 t+h D1(t+h) D4(t+h)]*beta2;

Copyright c 2011 J.C.C. Chan


3.3 Exponential Smoothing 53

%% S3
Xt = [s D1t D2t D3t D4t];
beta3 = (Xt’*Xt)\(Xt’*yt);
yhat3 = [t+h D1(t+h) D2(t+h) D3(t+h) D4(t+h)]*beta3;
%% store the forecasts
syhat(t-T0+1,:) = [yhat1 yhat2 yhat3];
end
MSFE1 = mean((ytph-syhat(:,1)).^2);
MSFE2 = mean((ytph-syhat(:,2)).^2);
MSFE3 = mean((ytph-syhat(:,3)).^2);

It is also worth noting that the absolute forecast performance of all the specifications are
quite bad — the root mean squared error (square root of MSE) is about 3.3 billions U.S.
dollars, about 10% of the average retail e-commerce sales in the sample. A closer look at
Figure 3.3 also reveals that the forecasts consistently underestimate the actual values in
the first half of the sample, and overestimate the sales just after the global financial crisis.
The latter point might be expected as all three specifications considered can only allow
for a trend and seasonal variation, but not cyclical behavior (business cycle).

3.3 Exponential Smoothing

The basic idea of smoothing methods is to construct forecasts as weighted average of past
observations, where more weights are given to recent observations than distant ones. By
computing weighted averages as forecasts, we are “smoothing” the observed time series.
The “exponential” part of the name comes from the fact that these methods do not only
have weights that decrease with time, but they do so exponentially.

3.3.1 Simple Additive Smoothing

The basic framework of exponential smoothing is simple. Suppose a series µ1 , µ2 , . . .


follows a random walk, i.e.,
µt = µt−1 + ηt , (3.11)
where η1 , . . . , ηT are iid N(0, ση2 ). As we discussed previously in Example 2.1, the level
µt wanders randomly up and down, and in Example 2.5 we showed that the optimal
h-step-ahead point forecast (under quadratic loss) is the conditional expectation
E(µT +h | µ1 , . . . , µT ) = µT .
Put differently, the best forecast of any future value is simply the current value. Let’s
suppose, however, that we don’t observe µt ; instead we observe yt :
yt = µt + εt , (3.12)

Copyright c 2011 J.C.C. Chan


54 Ad Hoc Forecasting Methods

where εt ’s are iid N(0, σε2 ), and are independent of ηt at all leads and lags, i.e., εt and ηs
are independent for every s and t. If we observed µ1 , . . . , µT , we could compute

E(yT +h | µ1 , . . . , µT ) = E(µT +h | µ1 , . . . , µT ) = µT

as our h-step-ahead point forecast for yT +h . But of course we do not observe the current
level µT , and therefore we could not use it as our point forecast. Actually, the local level
model (3.11)–(3.12) is an example of a state-space model, and estimation of such models
is quite involved. We will discuss likelihood-based estimation in Chapter 6; for now we
present an easy smoothing method to obtain reasonable forecasts.

Let’s say we start with a reasonable forecast for yt given historical information available
at time t − 1. We call this forecast “the level” Lt−1 , i.e.,

ybt | t−1 = Lt−1 .

At time t, yt is observed, and we update the level by adding a fraction 0 6 α 6 1 of the


forecast error to it:
Lt = Lt−1 + α(yt − Lt−1 ). (3.13)

The reason for doing this is that we want to adjust the current level Lt to correct for
the forecast error (yt − Lt−1 ) — if we underestimate yt , i.e., yt > Lt−1 , we make the
next forecast Lt larger; if we overestimate yt , we make Lt smaller — and the size of the
adjustment depends on α. By simple algebra, we can write (3.13) equivalently as

Lt = αyt + (1 − α)Lt−1 .

Hence, the new level Lt can be viewed as a weighted average of the old level Lt−1 and the
new observation yt . The parameter α is called the smoothing parameter. If α is large,
we give more weight to the new information. In the extreme case of α = 1, we do not
learn from history at all and just use the most recent observation yt as our forecast. If α
is small, we give more weight to the historical information, and in the extreme case where
α = 0, we do not learn from new information at all, and our forecast never changes. The
updating formula for this simple exponential smoothing method is quite straightforward.
The only thing we need is a starting point or initial value L1 . One popular choice is simply
L1 = y1 , or one could use the average of the first few observations.

Algorithm 3.2 (Simple Additive Smoothing). 1. Initialize with L1 = y1 .

2. Then for t = 2, . . . , T update the level Lt via:

Lt = αyt + (1 − α)Lt−1 .

The “smoothing” part of the method is now obvious. Let’s discuss the “exponential” part.
To this end, suppose 0 < α < 1 and let’s write the first few levels in terms of observations

Copyright c 2011 J.C.C. Chan


3.3 Exponential Smoothing 55

y1 , y2 , . . . as follows:
L1 = y 1 ,
L2 = αy2 + (1 − α)L1 = αy2 + (1 − α)y1 ,
L3 = αy3 + (1 − α)L2 = αy3 + α(1 − α)y2 + (1 − α)2 y1 .
In fact, one can show by mathematical induction that
Lt = αyt + α(1 − α)yt−1 + α(1 − α)2 yt−2 + · · · + (1 − α)t−1 y1 . (3.14)
In other words, the simple exponential smoothing in Algorithm 3.2 takes a weighted aver-
age of all previous observations to forecast the next observation, where more weights are
given to more recent observations. In addition, the weights decline exponentially as we
go further back in history. In fact, the larger the α, the smaller the weights of distant
observations.
Exercise 3.5. Prove that
Lt = αyt + α(1 − α)yt−1 + α(1 − α)2 yt−2 + · · · + (1 − α)t−1 y1 .

To make a one-step-ahead forecast for yT +1 , we can simply use LT . How about forecasting
yT +2 ? To forecast yT +2 with the exponential smoothing method, we could replace yT +1
with the forecast ybT +1 | T in the updating formula:
ybT +2 | T = αb
yT +1 | T + (1 − α)LT = αLT + (1 − α)LT = LT .
More generally, the h-step-ahead forecast under the simple exponential smoothing is LT ,
i.e., ybT +h | T = LT .

3.3.2 Holt-Winters Smoothing

The local level model (3.11)–(3.12) only wanders randomly up and down, and does not
have a deterministic trend. Consequently, the simple additive smoothing in Algorithm 3.2
is suitable for data that do not seem to have a deterministic trend. Now imagine that we
have not only a slowly evolving local level, but also a trend with a slowly evolving local
slope:
yt = µt + λt t + εt , εt ∼ N(0, σε2 )
µt = µt−1 + ηt , ηt ∼ N(0, ση2 )
λt = λt−1 + νt , νt ∼ N(0, σν2 ),
where the error terms are independent of each other at all leads and lags. To forecast yt ,
the Holt-Winters smoothing method suggests a two-component approach. Specifically,
the forecast ybt | t−1 consists of two components, a “level” Lt−1 as before, and a “slope” bt−1 :
ybt | t−1 = Lt−1 + bt−1 .

Copyright c 2011 J.C.C. Chan


56 Ad Hoc Forecasting Methods

The level and the slope are updated via

Lt = αyt + (1 − α)(Lt−1 + bt−1 ),


bt = β(Lt − Lt−1 ) + (1 − β)bt−1 .

The updating formula for Lt is the same as before, only that now the forecast is ybt−1 | t−2 =
Lt−1 + bt−1 . The updating formula for bt is again a weighted average of the new and old
information: current change in level (Lt −Lt−1 ) and previous slope bt−1 . Lastly, to initialize
the smoother we set L1 = y1 and b1 = y2 − y1 .

Algorithm 3.3 (Holt-Winters Smoothing). 1. Initialize with L1 = y1 and b1 =


y2 − y1 .

2. Then for t = 2, . . . , T update the level Lt and slope bt via

Lt = αyt + (1 − α)(Lt−1 + bt−1 ),


bt = β(Lt − Lt−1 ) + (1 − β)bt−1 .

Given the values of LT and bT , the one-step-ahead forecast ybT +1 | T is given by

ybT +1 | T = LT + bT .

To produce a two-step-ahead forecast, we simply use the forecast ybT +1 | T instead of yT +1


to update the level and the slope:

LT +1 | T = α(LT + bT ) + (1 − α)(LT + bT ) = LT + bT
bT +1 | T = β(LT +1 | T − LT ) + (1 − β)bT = bT .

Hence, the two-step-ahead forecast ybT +2 | T is given by

ybT +2 | T = LT +1 | T + bT +1 | T = LT + 2bT .

Generally, the h-step-ahead forecast is given by

ybT +h | T = LT + hbT ,

for h > 0.

Example 3.5 (Forecasting U.S. GDP with Holt-Winters Smoothing). In Exam-


ple 2.12 we considered three trend specifications — namely, linear, quadratic and cubic
trends — for forecasting U.S. GDP, and we found that the quadratic trend specification
performed best in terms of both in-sampling-fitting and out-of-sample forecasting mea-
sures. In this example we use the Holt-Winters smoothing method to produce forecasts
over the horizons h = 4 and h = 20, with two sets of smoothing parameters: α = β = 0.8
and α = β = 0.5. The following MATLAB codes are used to compute the MSFEs:

Copyright c 2011 J.C.C. Chan


3.3 Exponential Smoothing 57

T0 = 40;
h = 4; % h-step-ahead forecast
syhat = zeros(T-h-T0+1,1);
ytph = y(T0+h:end); % observed y_{t+h}
alpha = .5; beta = .5; % set values for smoothing parameters
Lt = y(1); bt = y(2) - y(1);
for t = 2:T-h
newLt = alpha*y(t) + (1-alpha)*(Lt+bt);
newbt = beta*(newLt-Lt) + (1-beta)*bt;
yhat = newLt + h*newbt;
Lt = newLt; bt = newbt; % update Lt and bt
if t>= T0 % store the forecasts for t >= T0
syhat(t-T0+1,:) = yhat;
end
end
MSFE = mean((ytph-syhat).^2);

Table 3.3: MSFEs under the Holt-Winters smoothing methods.

quadratic trend α = β = 0.8 α = β = 0.5


h=4 0.073 0.038 0.038
h = 20 0.188 0.828 0.647

We compute the MSFE for each set of smoothing parameters with h = 4 (one-year-
ahead forecast) and h = 20 (five-year-ahead forecast). In order to compare the results
in Example 2.12, we also start the recursive forecasting exercises from the 40-th period
(T0 = 40). We report the MSFEs for the Holt-Winters smoothing methods in Table 3.3,
along with the MSFEs under the quadratic trend specification as a comparison. It is
interesting to note that for short-horizon forecasts (h = 4), the performance of the Holt-
Winters smoothing method is much better, having only half of the forecast errors under
the quadratic trend specification. However, its performance is much worse for long-horizon
forecasts (h = 20).

3.3.3 Holt-Winters Smoothing with Seasonality

To accommodate data with seasonality, we simply add an extra component to the point
forecast ybt | t−1 :
ybt | t−1 = Lt−1 + bt−1 + St−s ,

Copyright c 2011 J.C.C. Chan


58 Ad Hoc Forecasting Methods

where s is the periodicity of seasonality, i.e., s = 4 for quarterly data and s = 12 for
monthly data. The level, slope and seasonality are then updated as follows:

Lt = α(yt − St−s ) + (1 − α)(Lt−1 + bt−1 )


bt = β(Lt − Lt−1 ) + (1 − β)bt−1
St = γ(yt − Lt ) + (1 − γ)St−s .

Finally, the three components are initialized at t = s:

1
Ls = (y1 + · · · + ys ),
s
bs = 0
yi
Si = for i = 1, . . . , s.
Ls

Algorithm 3.4 (Holt-Winters Smoothing with Seasonality). 1. Initialize at t =


s:
1
Ls = (y1 + · · · + ys ),
s
bs = 0
yi
Si = for i = 1, . . . , s.
Ls

2. Then for t = s + 1, . . . , T update the level Lt , slope bt and seasonality St via

Lt = α(yt − St−s ) + (1 − α)(Lt−1 + bt−1 )


bt = β(Lt − Lt−1 ) + (1 − β)bt−1
St = γ(yt − Lt ) + (1 − γ)St−s .

Then the one-step-ahead forecast is given by

ybT +1 | T = LT + bT + ST +1−s .

More generally, the h-step-ahead forecast is

ybT +h | T = LT + hbT + ST +h−s .

Example 3.6 (Forecasting U.S. retail e-commerce sales with Holt-Winters smooth-
ing). In Example 3.4 we compared the forecast performance of three seasonality speci-
fications for forecasting U.S. retail e-commerce sales. We found that although the three
specifications gave very similar forecasts, their absolute forecast performance was quite
bad. In this example we revisit the pseudo out-of-sample forecasting exercise, and see if
the Holt-Winters smoothing method with seasonality gives better forecasts.

Copyright c 2011 J.C.C. Chan


3.3 Exponential Smoothing 59

Table 3.4: MSFEs under the Holt-Winters smoothing method.

S1 α = β = γ = 0.5 α = β = γ = 0.8
h=1 11.04 5.55 10.38
h=2 12.47 8.64 21.53

We consider two sets of smoothing parameters, namely, α = β = γ = 0.5 and α = β =


γ = 0.8. Obviously, different combinations of α, β and γ can be entertained. The pseudo
out-of-sample forecasting results are reported in Table 3.4 with h = 1 and h = 2. We
also include the results for the seasonality specification S1 in Example 3.4 for comparison.
Once again, for short-horizon forecasts (h = 1), the Holt-Winters smoothing with both
sets of smoothing parameters outperform S1 . But for longer horizon (h = 2), the forecasts
with larger smoothing parameters seem to be quite unstable, leading to large forecast
errors.

55 55

50 50

45 45

40 40

35 35

30 30

25 actual 25
α = β = γ = 0.5
20 α = β = γ = 0.8 20

15 15
2004 2006 2008 2010 2012 2004 2006 2008 2010 2012

Figure 3.4: Holt-Winters smoothing method forecasts with h = 1 (left panel) and h = 2
(right panel).

In Figure 3.4 we also plot the out-of-sample forecasts under the Holt-Winters smoothing
method. Comparing with the forecasts in Figure 3.3, the Holt-Winters smoothing method
with α = β = γ = 0.5 seems to better trace the actual sales values than the seasonality
specifications.

The following MATLAB codes implement Algorithm 3.4 to compute the forecasts:

T0 = 20;
h = 1; % h-step-ahead forecast
s = 4;
syhat = zeros(T-h-T0+1,1);

Copyright c 2011 J.C.C. Chan


60 Ad Hoc Forecasting Methods

ytph = y(T0+h:end); % observed y_{t+h}


alpha = .5; beta = .5; gamma = .5; % set values for smoothing parameters
St = zeros(T-h,1);
Lt = mean(y(1:s)); bt = 0; St(1:4) = y(1:s)/Lt; % initialize
for t = s+1:T-h
newLt = alpha*(y(t) - St(t-s)) + (1-alpha)*(Lt+bt);
newbt = beta*(newLt-Lt) + (1-beta)*bt;
St(t) = gamma*(y(t)-newLt) + (1-gamma)*St(t-s);
yhat = newLt + h*newbt + St(t+h-s);
Lt = newLt; bt = newbt; % update Lt and bt
if t>= T0 % store the forecasts for t >= T0
syhat(t-T0+1,:) = yhat;
end
end
MSFE = mean((ytph-syhat).^2);

3.4 Further Readings and References

Copyright c 2011 J.C.C. Chan


Bibliography

P. F. Christoffersen and F. X. Diebold. Optimal prediction under asymmetric loss. Econo-


metric Theory, 13:808–817, 1997.

M. Marcellino, J. H. Stock, and M. W. Watson. A comparison of direct and iterated mul-


tistep AR methods for forecasting macroeconomic time series. Journal of Econometrics,
135:499–526, 2006.

J. H. Stock and M. W. Watson. Why has U.S. inflation become harder to forecast? Journal
of Money, Credit and Banking, 39:3–33, 2007.

Copyright c 2011 J.C.C. Chan


Index

Akaike information criterion, 36 first-difference operator, 81


AR process, 22, 82 Fisher information matrix, 67
forecasting U.S. inflation rate, 85 forecast horizon, 26–29
ARMA process, 89
autocovariance function, 78 Gamma distribution, 62
autoregressive moving average models, 77 Gauss-Markov Theorem, 47
grid search, 67
Bayesian information criterion, 36
Bernoulli distribution, 62 h-step-ahead forecast, 26
Beta distribution, 62 AR(1), 26–27
Binomial distribution, 62 linear regression, 28
regression, 27
Chi-squared distribution, 62 Holt-Winters smoothing, 55
relation to Gamma distribution, 62 forecasting U.S. GDP, 56
concentrated log-likelihood function, 70 forecasting U.S. retail e-commerce sales,
correlation coefficient, 10 58
covariance stationarity, 77 seasonality, 57
AR(1), 82
in-sample fitting, 34
AR(2), 84
Akaike information criterion, 36
MA(q), 88
Bayesian information criterion, 36
white noise, 78
mean absolute error, 34
cycle component, 24
mean squared error, 34
data generating process, 22–26 information set, 27
density forecast, 29–31 integrated moving average, 91
direct forecast, 26 interval forecast, 29–31
distributed lag, 81 Inverse-Gamma distribution, 62
iterated forecast, 26
exogenous variables, see also regressor
explanatory variables, see also regressor Jacobian matrix, 42
exponential smoothing, 53 Kalman filter, 89
additive smoothing, 53
forecasting U.S. GDP, 56 lag operator, 80
forecasting U.S. retail e-commerce sales, polynomial, 80
58 likelihood function, see also log-likelihood
Holt-Winters, 55, 57 function
smoothing parameter, 54 linear independence, 44

Copyright c 2011 J.C.C. Chan


log-likelihood function, 62 Ordinary Least Squares, 41–42, 44
AR(p) process, 85 property, 46–47
AR(1) process, 63 over-fitting, 34
concentrated, 70
MA(1) process, 63 point forecast, 29
MA(2) process, 90 optimal, 33
regression model, 65 probability density function (pdf), 61
trend-cycle model, 65 pseudo out-of-sample forecasting, 36
loss function, 32 mean absolute forecast error, 36
absolute loss, 32 mean squared forecast error, 36
quadratic loss, 32 random walk, 22
rational polynomial, 82
MA process, 24, 88
regression model, 25
forecasting U.S. inflation rate, 91
maximum likelihood estimator, 66
MATLAB
regressor, 25, 41
basic matrix operations, 1–4
built-in functions, 4–6 score function, 72
for-loop, 8 seasonality, 48
function, 10 dummy variable, 48
function handle, 8 forecasting U.S. retail e-commerce sales,
graphics, 11–15 49–53
if-then-else, 6–7 serial correlation, 78
optimization routines, 17–19 sparse matrix, 89
sparse matrix routines, 15–17 state space model
while-loop, 7–8 initial condition, 96
matrix differentiation, 42–44 state space models, 95
maximum likelihood estimator, 66 support, 61
AR(p) process, 85
MA(1) process, 68, 69 trend
regression model, 66 forecasting U.S. GDP, 45–46
trend-cycle model, 71 forecasting U.S. retail e-commerce sales,
mean absolute error, 34 49–53
mean absolute forecast error, 36 trend component, 24
mean squared error, 34
unbiased, 47
mean squared forecast error, 36
Uniform distribution, 62
model selection, 34
unobserved component model, 95
multicollinearity, 48
white noise, 78
Normal distribution, 62 independent, 79
linear transform, 23 Wold representation theorem, 80
numerical maximization
concentrated likelihood, 70
functional evaluations, 68
grid search, 67

Copyright c 2011 J.C.C. Chan

You might also like