Artigo Trafego CBA14

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

MODELS ON ROAD TRAFFIC FORECASTING: IDENTIFICATION AND DISCUSSION OF DIFFERENT TIME SERIES

MODELS
FERNANDO FERNANDES NETO

Instituto de Pesquisas Tecnolgicas do Estado de So Paulo IPT / Secretaria do Planejamento e
Desenvolvimento Regional do Estado de So Paulo
Palcio dos Bandeirantes - Av. Morumbi, 4500, 1 Andar, Sala 142, Morumbi, So Paulo/SP
E-mails: fernandofernandes@planejamento.sp.gov.br /
nando.fernandes.neto@gmail.com

CLAUDIO GARCIA
Escola Politcnica da Universidade de So Paulo Departamento de Engenharia de Telecomunicaes e
Controle
Avenida Prof. Luciano Gualberto, trav. 3, 158, Butant, So Paulo/SP, Brasil - 05508900
E-mails: clgarcia@lac.usp.br
Abstract In this paper are discussed and calibrated univariate models (scalar approach, SARIMAX) and multivariate models
(vector approach, VAR and VEC) aiming traffic forecasts of equivalent axles in the Anchieta-Imigrantes system. The best per-
formance models in the backtesting procedure were those of the second type (vector), having a mean absolute error of approxi-
mately 3%, in a monthly frequency.
Keywords VAR, VEC, ARIMA, SARIMA, identification, time series, toll roads
Resumo Neste artigo so discutidos e calibrados modelos univariados (abordagem escalar, SARIMAX) e multivariados
(abordagem vetorial, VAR e VEC) para a previso de trfego em eixos equivalentes no sistema Anchieta-Imigrantes. Os mode-
los que tiveram melhor desempenho no backtesting foram os do segundo tipo (vetorial), tendo erro mdio absoluto de aproxima-
damente 3% em uma frequncia mensal.
Palavras-chave VAR, VEC, ARIMA, SARIMA, identificao, sries temporais, rodovias
1 Introduction
One of the main problems in the toll road sector is
the cash flow planning and its forecasting, due to its
idiosyncratic complexity, e.g. levels of service, sea-
sonal effects and the inertial evolution of the traffic;
and to the impact of other variables like the Gross
Domestic Product.
There is a wide range of methods applied to traffic
forecasting, from Time Series models, Kalman Filter
based models, Neural Networks; to Markov Chain
models, simulation models and linear regression
models, as shown by Bolshinsky and Freidman
(2012), or a combination of them according to Filla-
tre et al. (2005), varying from high-frequency to low-
frequency data.
Also, it is important to notice that despite the rich
existing literature on traffic forecasting, little atten-
tion has been paid to the prediction ability of most of
these methods, as can be seen in (Bain, 2009). In
fact, there is a considerable error range in the U.S.
traffic forecasts, as pointed by the same author:

actual traffic turned out to lie between
86% below forecast to 51% above forecast. This con-
siderable error range illustrates the possible magni-
tude of uncertainty when traffic risk is passed to the
private sector.

Hence, planning and forecasting play a fundamen-
tal role in this field, in the sense that most of the nec-
essary investments and, consequently, their respective
decision-makings and cash outflows, must take into
account a very long timeline conception, construc-
tion, maturation of the project until plain capacity,
etc.
Thus, the main goal of this paper is the discussion
of an alternative traffic forecasting method in toll
roads in this case Vectorial Autoregressive models
(namely VAR and VEC) and Univariate time series
based on Seasonal ARIMAX models, discussed in
the next session illustrating one of the most im-
portant highway systems in Brazil, the Anchieta-
Imigrantes System.
This paper is divided into the following sections:
introduction, methodology, presentation of the prob-
lem, results, analysis of the results and conclusion.

2 Methodology
2.1 Univariate Models
The Univariate approach in the present paper is based
on Seasonal ARIMAX models, which are a natural
extension to the classical ARIMAX models, which is
a product of two ARIMAX polynomials, one with the
regular structure of the time series, and the other one
with the seasonal structure of the time series, as can
be seen in (Morettin and Toli, 2004; Box and Jen-
kins, 1978 and Hamilton 1994).
2.2 Multivariate Models
The Multivariate Models are mainly based on Vec-
tor Autoregression models. These are nothing more
than a multivariable extension of the classical scalar
auto regression models (AR), in the sense that the
process is described in terms of matrices and vectors,
instead of scalars. Thus, there is a mutual causality
relationship between all variables in this dynamic
system. For example, a VAR(p) process can be writ-
ten as:

Y
t
=
1
Y
t1
+
2
Y
t2
+... +
p
Y
tp
+a
t

(1)

where the
i
terms are square matrices of order n;
Y
tn
are 1 x n vectors of endogenous variables;
a
t

is a 1 x n vector of uncorrelated residuals; n is
the endogenous variable number and p is the num-
ber of lags.
In addition to that, as the classical scalar auto re-
gression models (AR), if all variables are stationary,
this model can be estimated using the Ordinary Least
Squares (OLS) method. On the other hand, when one
or more variables in VAR models are non-stationary,
the OLS results may be not valid anymore. Conse-
quently, the Theory of Cointegration was developed
in order to analyze these possible relationships be-
tween non-stationary time series.
Furthermore, Granger and Newbold (1974) dis-
cussed and exposed the problems of spurious regres-
sions over non-stationary time series. They also veri-
fied that given two series completely uncorrelated,
and non-stationary, the regression between them may
produce a significant apparent relationship.
Therefore, if two variables are non-stationary and
have a long-run equilibrium relationship, they may be
cointegrated that is, both are uncorrelated, non-
stationary, but with a relationship between them as
exposed by Engle and Granger (1987), Ashley and
Granger (1979) and Johansen (1988).
Thus Vector Error Correction Models (VEC) were
developed, where they can be seen as extensions to
VAR according to Hendry and Juselius (2000, 2001)
and Ltkepohl (1991), where it is introduced an error
correction term.
In order to verify the cointegration assumption, in
the current paper the approach that was made is the
verification that all variables are non-stationary, us-
ing the Augmented Dickey-Fuller (1979) test, using a
95% confidence interval; then if and only if the
variables are non-stationary following Engle and
Granger (1987), the cointegration residuals are ob-
tained by running a regression over the variables and
these residuals are tested for stationarity. If these
residuals are stationary (tested using the Augmented
Dickey-Fuller test again) the time series are cointe-
grated, otherwise they are not cointegrated.
In order to explain how the VEC model structure is
obtained, one can start from a two variable dynamic
system, where both are cointegrated (by hypothesis),
following (Hendry and Juselius, 2000, 2001; Lt-
kepohl, 1991, 2004 and Morettin, 2011).
Be Y
1,t
and Y
2,t
two non-stationary cointegrated
variables, and assume that there is an equilibrium
relation between them given by:

Y
1,t
Y
2,t
=
t
~ N(0, ) (2)

And if considered that the variations in Y
1,t
and
Y
2,t
depend on the deviations of this equilibrium in t-
1, it follows that:

Y
1,t
=
1
(Y
1,t1
Y
2,t1
) + a
1,t
: a
1,t
~ N(0,
1
)
(3.1)

Y
2,t
=
2
(Y
1,t1
Y
2,t1
) +a
2,t
: a
2,t
~ N(0,
2
)
(3.2)

One can generalize this error correction model into
a more general form, where these corrections in the
equilibrium may depend on previous changes in the
equilibrium due to possible autocorrelations, like:

Y
1,t
=
1
(Y
1,t1
Y
2,t1
) +
1,1
Y
1,t1
+
1,2
Y
2,t1
+ a
1,t
: a
1,t
~ N(0,
1
)

(4.1)

Y
2,t
=
2
(Y
1,t1
Y
2,t1
) +
2,1
Y
1,t1
+
2,2
Y
2,t1
+a
2,t
: a
2,t
~ N(0,
2
)
(4.2)

where this model actually is a VAR(1) model. In or-
der to verify that, one can simply put these pair of
equations into matrix form, resulting in:

Y
t
=
'
Y
t1
+ AY
t1
+a
t
(5)

where:

=

1

, ' = 1

, A=

1,1

1,2

2,1

2,2


(6)

or rewriting as:

Y
t
=
'
+ A+ I
( )
Y
t1
AY
t2
+a
t
(7)

Actually, according to Gujarati et al. (2011) such
relationship can be generalized and guaranteed by the
Granger Representation Theorem, that shows that any
VAR(p) can be written as a VEC(q) and vice-versa.
Depending on the autocorrelation structure, one
might find interesting having a VEC(q) model and its
respective VAR(p). More details can be found in
(Greene, 2005).
3 Presentation of the Problem
In this paper, it is considered a VAR and a VEC
model with the following variables: traffic and Gross
Domestic Product all of them endogenous, and two
kinds of univariate SARIMAX models, one with a
seasonal difference plus an stochastic seasonal shock,
and another one with an autoregressive seasonal term.
The GDP is available at IPEA (Instituto de
Pesquisas Econmicas Aplicadas Brazilian Insti-
tute of Applied Economic Research) site, while the
other series are publicly available upon request to
ARTESP Transportation Regulatory Agency of So
Paulo State, Brazil (Agncia Reguladora de Trans-
portes do Estado de So Paulo). The time series
encompasses observations from March 31
st
, 1998
until July 31
st
, 2013. The last six observations are left
to test the prevision accuracy of the model.
In addition to that, it is possible to point out as a
main concern the fact that considering the Gross Do-
mestic Product as an endogenous variable may be
counter-intuitive. However, it is known that traffic
can act as a leading indicator for the GDP behavior,
and actually, such assumption is tested in this paper,
through the verification of cointegration between
both variables.
The traffic was normalized under an equivalent
vehicle base, in order to transform different types of
vehicles in cars, e.g. a heavy truck is equivalent to
n cars, while a light truck is equivalent to n-2
cars.
The Seasonality in the vector models was consid-
ered by including a vector of dummy variables, since
the data is on a monthly basis.
Then, having all the time series normalized, con-
sidered the seasonal effects, the rank of cointegration
and the number of lags must be established.
In this case, the rank of cointegration is the number
of cointegrating vectors which is tested according
to (Johansen, 1988) and the least Information Criteri-
on number determines the number of lags, in both
univariate and multivariate models, as suggested in
(Ltkepohl and Krtzig, 2004). For multivariate
models, Bayesian Information Criterion was chosen,
due to the fact that it imposes stronger penalties for
the inclusion of new parameters, as this kind of mod-
el naturally happens to have a larger number of pa-
rameters. On the other hand, for univariate models,
Akaike Information Criterion was used, due to the
fact that these models generally have less parameters
than the multivariate ones.
The estimation of the parameters and all tests men-
tioned are computed using GRETL Gnu Regres-
sion, Econometrics and Time Library (for multivari-
ate models) and R (univariate models).
4 Results
In Table 1, the results of the Bayesian Information
Criteria lag-search for multivariate models.

Table 1. Bayesian Information Criterion of the Lag Search
lags BIC
1 46.740746*
2 46.811174
3 46.868567
4 46.958411
5 46.970219
6 47.066916

So, as can be seen in this table, the multivariate
models must have only one lag.
For the univariate models, it was tested down for
the most common lag compositions over shocks and
autoregressive terms, according to the auto.arima
function, provided in forecast package, within the
R statistical software, to check the optimal ARIMA
regular structure. It resulted in an ARIMA polynomi-
al of the form ARIMA (p=1, d=1, q=4).
Then, the two most usual seasonal polynomials
were calibrated, SARIMA (p=1, d=0, q=0) and
SARIMA (p=0, d=1, q=1).
The Rank of cointegration was determined accord-
ing to the Johansen test (1988), and for a null rank
matrix (null hypothesis), there is a p-value of 0.03.
So, the statistical evidence points out that there is no
cointegrating relationship between the variables. De-
spite that, in this paper the VEC model was still esti-
mated for comparison purposes.
Thus, 4 different models were obtained as follows.

Seasonal Model with Seasonal Difference:

34710.72 a 0.6753 -
a 0.5514 - a 0.0978 -
a 0.2215 - a 0.0447
4864 . 0
12 - t
4 - t 3 - t
2 - t 1 - t
1 12
+

+
=
t t t
Y Y Y
(8)

Seasonal Model with Autoregressive Seasonal com-
ponents:

039 . 2679 1 0.8141
a 0.5641 - a 0.1227 -
a 0.2902 - a 0.0231
5280 , 0
12
4 - t 3 - t
2 - t 1 - t
1
+ +


=

t
t t
Y
Y Y
(9)

VAR Model with Seasonal Dummies:

2
1
1
9735 0 0019 0
7520 9 2523 0
K
K
PIB
Y
. .
. .
PIB
Y
t t (10)
where
1
K and
2
K are the seasonal dummies as fol-
lows in the table:

Table 2. Seasonal Parameters Estimates of the VAR Model
K1 K2
S1 181443 -7270.76
S2 -623254 -4400.54
S3 -214817 12191.9
S4 -460430 4863.24
S5 -560794 12545.8
S6 -653837 5743.71
S7 -296413 2613.97
S8 -481878 5324.9
S9 -451521 -1374.43
S10 -195290 13806.3
S11 -395400 6612.35
Constant 1259780 -5468.02

Thus, if the month to be predicted is January, one
must sum up the coefficient S1 plus the constant, and
so on according to the respective predicted month.
Finally, the VEC model with seasonal dummies is
presented as follows.

Y
PIB

t
=
0.74791 9.769
0.0019 0.0247

Y
PIB

t1
+
9.769
0.0247

PIB0.0765Y [ ]
t1
+
K
1
K
2

(11)
where
1
K and
2
K are the seasonal dummies as fol-
lows in the table:

Table 3. Seasonal Parameter Estimates of the VEC Model
K1 K2
S1 181590 -7256,59
S2 -622809 -4357,79
S3 -214438 12228,3
S4 -460173 4887,3
S5 -560589 12565,6
S6 -653788 5748,41
S7 -296368 2618,3
S8 -481723 5339,82
S9 -451418 -1364,57
S10 -195095 13825
S11 -395336 6618
Constant 911177 -1494,86

Analysis of the Results
Aiming the selection of the best model, the out-of-
sample forecasting accuracy is measured in terms of
the absolute error mean, as follows.

Table 4. Out-of-sample Errors of the Models
Model
Mean Absolute
Error
ARIMA(1,1,4) - Seasonal
IMA(1) 11.28%
ARIMA(1,1,4) - Seasonal
AR(1) 10.70%
VAR(1) 3.23%
VEC(1) 3.14%

Thus, the very surprising result is that the VEC(1)
model, that shouldnt be even estimated according to
the existing literature is the best model in terms of
out-of-sample performance. Nonetheless, it was al-
ready expected that a multivariate model should per-
form better than an univariate model due to the fact
that more information is being included.

Another interesting fact, is that the log-likelihood of
the univariate models are far better than the multivar-
iate ones, as can be seen in the following table the
model which has the least log-likelihood is the best
one.
Table 5. Log-Likelihood of the Models
Model
Log-
Likelihood
ARIMA(1,1,4) - Seasonal IMA(1) -2272.78
ARIMA(1,1,4) - Seasonal AR(1) -2451.21
VAR(1) -1268.54
VEC(1) -1268.54

Hence, based on these results, it seems that the
backtesting procedure is a very important part of the
modeling process, since the log-likelihood estimate
does not provide all necessary information to analyze
which model is the best.
When analyzing the models fitted values against the
observed values (Obs in Figure 2), it is possible to
see that it is possible to verify that Seasonal
ARIMAX models converge slower towards to the
observed values than the vector based models. This
happening can be explained due the fact that these
univariate seasonal models rely on past observed
values to forecast the seasonal factors. On the other
hand, vector based models (Figure 1) are relying on
seasonal deterministic dummy variables. Thus, de-
spite past values are unknown to the autoregressive
part, there are already values being inserted in the
model, providing estimates of the seasonal fluctua-
tions.
Another interesting point is the fact that, despite
having a larger number of variables (multivariate),
they had a poorer performance within the sample, so
basically, the models which were actually overfitted
were the univariate ones.
Finally, here it is shown the most important feature
of vector models in terms of policy analysis, which is
the impulse response structure that can be retrieved
of the system, following (Sims, 1980).
This method is based on the decomposition of the
covariance matrix using a Cholesky algorithm, to
obtain what is called a Structural VAR/VEC. Think-
ing of the a VAR with contemporaneous relation-
ships, as in the expression below:

0
Y
t
=
1
Y
t1
+
2
Y
t2
+... +
n
Y
tn
+ K + a
t
(12)

and multiplying the whole equation by the inverse
of
0
one gets a VAR as in Equation (1) that can be
estimated using the traditional OLS algorithm.
Therefore, after decomposing the covariance matrix,
it is possible to impose causal restrictions, in order to
retrieve the contemporary relationship matrix.
So, for example, if thought that the economy (GDP)
is expected to cause the traffic in the road, one may
infer how the dynamics between the time series may
behave with an impulse-response of the traffic against
the GDP.
This is a powerful tool that enables the researcher to
verify dynamic effects instead of just applying a first-
order (linear) as in the traditional simple linear re-
gression over the logarithms of the variables (this
procedure is actually called elasticity calculation).


Figure 3. Impulse-Response of Trafego to a Shock in PIB

As can be seen in Figure 3, a standard shock (a uni-
tary shock in terms of the covariance matrix retrieved
in the VAR/VEC models) in the evolution of the
GDP causes an increase of 50 thousand vehicles,
after 4 months and reaches stability after 5 months.
Conclusion
In this paper it was shown that it is possible to
build an autoregressive multivariable model to de-
scribe the traffic data in one of the most important
Toll Road in Brazil, with significant seasonal effects
and a large amount of vehicles.
Then, four kinds of models were estimated: a
VAR, a VEC and two kinds of Seasonal ARIMAX
models. Furthermore, it were discussed methodolo-
gies for testing the cointegration between the varia-
bles, unitary root and optimal lag structure obtention.
Thus, it is possible to observe that both multivari-
ate methodologies produced very similar forecasts
between them, as occurred between both univariate
models too. Despite that, both kinds of models were
significantly different in the long-run and in the short-
run, being the first kind (multivariate) the best of
them, producing reasonable forecasts 3% mean
absolute error.
Nonetheless, it is important to notice that this pa-
per shows the usefulness of impulse-response analy-
sis, which seems to be far more reasonable than the
traditional elasticity measures applied over simple
linear regression based models in policy analysis.
As perspective for future analysis and work, it is
suggested expanding this analysis to other large road
systems in Brazil and other countries, continuing to
update the existing database and verifying possible
structural and parameter changes in these models,
and include in this comparison the performance of
NARX models (nonlinear autoregressive models) and
standard neural-network based models, using only
autoregressive components of the dependent variable,
or evaluate the inclusion of other possible candidate
independent variables (e.g. GDP).
References
ASHLEY, R.A., GRANGER, C.W.J. (1979). Time
series analysis of residuals from St. Louis model.
In Journal of Macroeconomics, 1, 373-394.
BAIN, R. (2009). Error and optimism bias in toll
road traffic forecasts, Working Paper, RePEC.
BOLSHINSKI, E., FREIDMAN, R. (2012). Traffic
flow forecast survey. Tech. rep., Technion
Israel Institute of Technology.
BOX, G.E.P., JENKINS, G.M. (1976). Times Series
Analysis: Forecasting and Control. 1
st
Edition,
San Francisco Holden Day.
DICKEY, D.A., FULLER, W.A. (1979) Distribution
of the estimators for autoregressive time seires
with a unit root. In European Journal of Finance,
vol. 15, p. 619-637.
ENGLE, R.F., GRANGER, C.W.J. (1987).
Cointegration and error correction:
Representation, estimation and testing. In
Econometrica, vol. 55, 251-276.
FILLATRE, L., MARAKOV, D., VATON, S.
December (2005). Forecasting Seasonal Traffic
Flows. Workshop EuroNGI, Paris.
GRANGER, C.W.J., NEWBOLD, P. (1974).
Spurious Regressions in Econometrics, Journal
of Econometrics, vol. 2, 111-120.
GREENE, W.H. (2002). Econometric Analysis, 5
th

Edition, Upper Saddle River, New Jersey,
Prentice Hall.
GUJARATI, D.N., PORTER, D.C. (2011)
Econometria Bsica, Editora Bookman, So
Paulo.
HAMILTON, J.D. (1994). Time Series Analysis, 1
st

Edition, Princeton, New Jersey, Princeton
University Press.
HENDRY, D.F., JUSELIUS, K. (2000). Explaining
Cointegration Analysis: Part 1. In The Energy
Journal, International Association for Energy
Economics, vol. 0 (Number 1), 1-42
HENDRY, D.F., JUSELIUS, K. (2001). Explaining
Cointegration Analysis: Part 2. Em The Energy
Journal, International Association for Energy
Economics, vol. 0 (Number 1), 75-120.
IPEADATA, no stio http://www.ipeadata.gov.br,
visitado em 01/11/2013.
JOHANSEN, S. (1988). Statistical Analysis of
cointegration vectors. In Journal of Economic
Dynamics and Control, vol. 12, 231-254.
LTKEPOHL, H. (2004). Applied Time Series
Econometrics, 1
st
Edition, New York,
Cambridge University Press.
LTKEPOHL, H. (1991). Introduction to Multiple
Time Series Analysis, Heidelberg, Springer
Verlag.
MORETTIN, P.A. (2011). Econometria Financeira:
Um Curso em Sries Temporais Financeiras, 1
Edio, So Paulo, Editora Edgar Blcher.
MORETTIN, P.A., TOLI, C. (2004). Anlise de
Sries Temporais, 1 Edio, So Paulo, Editora
Edgar Blcher.
SCHWARZ, G. (1978). Estimating the dimension of
a model. In The Annals of Statistics, vol. 6, 461-
464.
SIMS, C. (1980). Macroeconomics and Reality. In
Econometrica, vol. 48, no. 1, 1-48.

You might also like