Automatic Forecasting SnapStat

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

STATGRAPHICS – Rev.

7/3/2009

Automatic Forecasting SnapStat

Summary
The Automatic Forecasting SnapStat creates a one-page summary of forecasts generated for a
time series. Like the Automatic Forecasting procedure, this SnapStat tries a collection of
forecasting models and selects the one that gives the best fit according to a specified criterion.
Unlike that procedure, however, the SnapStat output is preformatted to fit on a single page.

Sample StatFolio: autocastsnapstat.sgp

Sample Data:
The file baseball.sgd contains the leading batting average in U. S. Major League Baseball for
each year between 1901 and 2004. Batting averages represent the proportion of times that a
player gets a hit out of all at-bats that result in either a hit or an out. The table below shows a
partial list of the data from that file. The batting averages are expressed as the number of points
out of 1000, such that a player batting 333 would have gotten a hit one-third of the time.

Year Leading average


1901 422
1902 376
1903 355
1904 381
1905 377
1906 358
1907 350
1908 354
1909 377
1910 385
… …
2004 372

Forecasts are desired for the next several years.

 2009 by StatPoint Technologies, Inc. Automatic Forecasting SnapStat - 1


STATGRAPHICS – Rev. 7/3/2009
Data Input
The data input dialog box requests the name of the column containing the time series data and
information about how it was sampled:

 Data: numeric column containing n equally spaced numeric observations.

 Time indices: time, date or other index associated with each observation. Each value in this
column must be unique and arranged in ascending order.

 Sampling Interval: If time indices are not provided, this defines the interval between
successive observations. For example, the baseball data were collected once every year,
beginning in 1901.

 Seasonality: the length of seasonality s, if any. The data is seasonal if there is a pattern that
repeats at a fixed period. For example, monthly data typically have a seasonality of s = 12.
Hourly data that repeat every day have a seasonality of s = 24. If no entry is made, the data is
assumed to be nonseasonal (s = 1).

 Trading Days Adjustment: a numeric variable with n observations used to normalize the
original observations, such as the number of working days in a month. The observations in
the Data column will be divided by these values before being plotted or analyzed. There

 2009 by StatPoint Technologies, Inc. Automatic Forecasting SnapStat - 2


STATGRAPHICS – Rev. 7/3/2009
must be enough entries in this column to cover both the observed data and the number of
periods for which forecasts are requested.

 Select: subset selection.

 Number of Forecasts: number of periods following the end of the data for which forecasts
are desired.

 2009 by StatPoint Technologies, Inc. Automatic Forecasting SnapStat - 3


STATGRAPHICS – Rev. 7/3/2009

Output
The output from the SnapStat consists of a single page pf graphs and numerical statistics.

SnapStat: Automatic Forecasting Time Series Plot


Data variable: Leading average ARIMA(0,1,1)
440
actual
RMSE=17.7 MAE=13.96 MAPE=3.81% 420

Leading average
forecast
ME=-1.077 MPE=-0.48%
95.0% lim
400
Lower Upper
Period Forecast 95% Limit 95% Limit 380
2005 366.743 330.772 402.715 360
2006 365.715 327.642 403.788
2007 365.591 326.667 404.515 340
2008 365.58 325.925 405.235
2009 365.579 325.216 405.943 320
2010 365.579 324.519 406.64 1900192019401960198020002020

Residual Autocorrelations Residual Plot


1 48
Autocorrelations

0.6 28
Residual

0.2
8
-0.2
-0.6 -12

-1 -32
0 5 10 15 20 25 1900192019401960198020002020
lag

Residual Periodogram Normal Probability Plot


2500 99.9
99
2000 95
percentage
Ordinate

1500 80
50
1000 20
500 5
1
0 0.1
0 0.1 0.2 0.3 0.4 0.5 -32 -12 8 28 48
frequency Residual

 2009 by StatPoint Technologies, Inc. Automatic Forecasting SnapStat - 4


STATGRAPHICS – Rev. 7/3/2009
Model Statistics and Forecasts (top left)
The top left section of the output summarizes the selected forecasting model, which in this case
is an ARIMA(0,1,1) model. Included are:

 Summary Statistics: table of summary statistics calculated from the one-period ahead
forecast errors (error made in forecasting the value at time t given all data through time t-1).
The statistics include the root mean squared error (RMSE), the mean absolute percentage
error (MAPE), and the mean absolute error (MAE), all of which measure the variability of
the one-period ahead forecast errors. Small values are preferred. The mean error (ME) and
mean percentage error (MPE) measure bias and should be close to zero.

 Forecasts: table of forecasted values and probability limits. The forecasts are made given all
available data. The probability limits are calculated at the level specified on the Forecasting
tab of the Preferences dialog box, accessible via the Edit menu.

Time Sequence Plot (top right)


The plot shows:

1. The observed data Yt, shown as point symbols, including any replacements for missing
values.

2. The one-step ahead forecasts Ft(1), displayed as a solid line through the points. These are
created using the fitted model, forecasting each time period t+1 using only the
information available at time t. The one-ahead forecast errors et are observable as the
vertical distance between the observations and the solid line.

3. Forecasts for future values Fn(k) made at time t = n, the last time at which observed data
is available. These are shown by the extension of the solid forecast line beyond the last
observation.

4. Probability limits for the forecasts at the 100(1-)% confidence level, calculated
assuming that the noise in the system follows a normal distribution.

For mathematical details regarding the calculations, see the documentation for the Forecasting
procedure.

Residual Autocorrelations (center left)


The residual autocorrelations measure the correlations amongst the residuals from the fitted
forecasting model. If the model has captured all of the dynamic structure in the data, then the
residuals should be random (white noise). In such a case, all of the estimates should be within the
probability limits, as in the above plot.

 2009 by StatPoint Technologies, Inc. Automatic Forecasting SnapStat - 5


STATGRAPHICS – Rev. 7/3/2009
Residual Plot (center right)
This plot shows the data in sequential order. It can be helpful in finding outliers or identifying
trends that the forecasting model has missed. Ideally, the residuals should behave like a random
set of observations from a normal distribution.

Residual Periodogram (bottom left)


The residual periodogram can be used to identify cyclical components that have not been
captured by the forecasting model. The periodogram plots the power remaining at each of the
Fourier frequencies. If the residuals are random, there should approximately equal power at all
frequencies, which is why a random time series is often called “white noise”. Any large spikes
could indicate a cycle at a fixed frequency that, if modeled, might improve the forecasts.

Residual Normal Probability Plot (bottom right)


The normal probability plot is used to determine whether the residuals left behind by the
forecasting model follow a normal distribution. If so, they should fall approximately along the
reference line. A plot such as that displayed above, which shows some curvature in the tails, is
indicative of a situation where the data have some positive skewness. In such cases, it made be
helpful to transform the data using Analysis Options.

Analysis Options

 Display: if desired, the plot may be limited to the specified number of most recent
observations.

 Transformation: the transformation to be applied to the data, if any. If Box-Cox is selected,


the program will automatically determine an appropriate power transformation to normalize
the data, after adding the specified Addend to each data value. Note: the Box-Cox option can
be very time-consuming if many models are being compared, since the program will fit every
model at each iteration of the Box-Cox optimization algorithm.

 2009 by StatPoint Technologies, Inc. Automatic Forecasting SnapStat - 6


STATGRAPHICS – Rev. 7/3/2009

SnapStat Defaults
The defaults used by the Automatic Forecasting SnapStat are set on the Forecasting tab of the
Preferences dialog box under the Edit menu:

 Models Included: specify the models that should be fit to the data. These are the models
from which the “best” model will be selected. Descriptions of each of the models are given
in the Forecasting documentation. For several of the models, additional options are provided:

Random walk model – check include constant to consider a model containing a constant
as well as one without a constant.

Moving average model – select the maximum span to consider. Models will be fit of
spans 2 through the number indicated.

ARIMA AR Terms – specify the maximum order p of the autoregressive terms in the
model.

ARIMA MA Terms – specify the maximum order q of the moving average terms in the
model. You may elect instead to consider only models for which q = p – 1.

ARIMA Differencing – specify the maximum order of differencing d. Select Include


constant to consider models that include a constant term when differencing is performed.

 2009 by StatPoint Technologies, Inc. Automatic Forecasting SnapStat - 7


STATGRAPHICS – Rev. 7/3/2009

 Information Criterion: the criterion used to select the best model.

 Forecast Limits: percentage used for the forecast probability limits.

The procedure fits each of the models indicated and selects the model that gives the smallest
value of the selected criterion. They are three criteria to choose from:

Akaike Information Criterion


The Akaike Information Criterion (AIC) is calculated from

AIC  2 lnRMSE  
2c
(1)
n

where RMSE is the root mean squared error during the estimation period, c is the number of
estimated coefficients in the fitted model, and n is the sample size used to fit the model. Notice
that the AIC is a function of the variance of the model residuals, penalized by the number of
estimated parameters. In general, the model will be selected that minimizes the mean squared
error without using too many coefficients (relative to the amount of data available).

Hannan-Quinn Criterion
The Hannan Quinn Criterion (HQC) is calculated from

2 p lnln(n) 
HQC  2 lnRMSE  (2)
n

This criterion uses a different penalty for the number of estimated parameters.

Schwarz-Bayesian Information Criterion


The Schwarz-Bayesian Information Criterion (SBIC) is calculated from

p lnn 
SBIC  2 lnRMSE  (3)
n

Again, the penalty for the number of estimated parameters is different than for the other criteria.

 2009 by StatPoint Technologies, Inc. Automatic Forecasting SnapStat - 8

You might also like