Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

1

Improving Neural Networks for Time-Series


Forecasting using Data Augmentation and AutoML
Indrajeet Y. Javeri, Mohammadhossein Toutiaee, Ismailcem B. Arpinar, and John A. Miller
Department of Computer Science
University of Georgia, Athens, Georgia
Email: {iyj83163, mt31429, budak, jamill}@uga.edu
Tom W. Miller
Department of Economics, Finance & Quantitative Analysis
Kennesaw State University, Kennesaw, Georgia
arXiv:2103.01992v2 [cs.LG] 5 Mar 2021

Email: tmiller@kennesaw.edu

Abstract—Statistical methods such as the Box-Jenkins method competitions [2], [3] and have been competitive and, in some
for time-series forecasting have been prominent since their cases, better than the traditional statistical models.
development in 1970. Many researchers rely on such models as A major challenge with the neural network-based models
they can be efficiently estimated and also provide interpretability.
However, advances in machine learning research indicate that arises when we are looking at scenarios where less training
neural networks can be powerful data modeling techniques, data is available such that the networks are not fully able to
as they can give higher accuracy for a plethora of learning learn the information from the limited number of samples.
problems and datasets. In the past, they have been tried on Short and intermediate length time-series forecasting is one
time-series forecasting as well, but their overall results have not of such areas where neural networks struggle to compete with
been significantly better than the statistical models especially for
intermediate length times series data. Their modeling capacities other models. However, the performance of neural networks
are limited in cases where enough data may not be available to increases significantly as we get to longer length time-series
estimate the large number of parameters that these non-linear as indicated in multiple studies, e.g., [4], [5], [6].
models require. This paper presents an easy to implement data This paper focuses on intermediate length time-series data,
augmentation method to significantly improve the performance those with a length comparable to the M4 competition average
of such networks. Our method, Augmented-Neural-Network,
which involves using forecasts from statistical models, can help of 1035 data points, as neural networks often dominate for
unlock the power of neural networks on intermediate length longer time-series and tend to be inferior for shorter time-
time-series and produces competitive results. It shows that data series. Several open datasets that record daily records concern-
augmentation, when paired with Automated Machine Learning ing the COVID-19 Pandemic fall into this intermediate length
techniques such as Neural Architecture Search, can help to find category. In addition, the highly dynamic nature of pandemic
the best neural architecture for a given time-series. Using the
combination of these, demonstrates significant enhancement for data makes it challenging for time-series forecasting. Finally,
two configurations of our technique for a COVID-19 dataset, the importance of having accurate forecasts for the pandemic
improving forecasting accuracy by 19.90% and 11.43%, respec- cannot be overestimated.
tively, over the neural networks that do not use augmented data. The paper provides a comparative analysis between tra-
Index Terms—statistical models, time-series forecasting, neural ditional statistical models and neural network models. In
networks, data augmentation, AutoML, COVID-19
particular, it shows that the performance of regular neural
networks is competitive but not better than statistical models.
However, we show that data augmentation can significantly
I. I NTRODUCTION
improve the quality of out-of-sample forecasts making them
In recent years, neural networks have been used extensively better than the traditional models.
in time-series forecasting to capture the nonlinear information We also focus on building the neural network architecture
that traditional linear statistical models cannot fully capture, through Automated Machine Learning (AutoML) by using
and where nonlinear regression models may be challenging to Neural Architecture Search (NAS) to efficiently search for the
develop. Earlier models also tried combining neural networks best performing model architecture for the given learning task
with traditional models where linear components of the time- and dataset. This combined with the data augmentation gives
series data were captured using the statistical models and the neural networks large performance boosts that help them
the neural networks were then trained to capture the leftover to perform better than the statistical models.
non-linearities. While neural networks were not competitive Much of the earlier work in time-series forecasting using
in many earlier time-series forecasting competitions, an auto- neural networks specifies the use of time-series features from
mated neural network model was judged the 9th best model in the statistical model forecasts as inputs for the neural networks.
the M3 time-series forecasting competition [1] held in 2000. This includes the top-performing models in M4 competition
More recently, several neural network models have been a part such as the hybrid method of exponential smoothing and
of the M4 in 2018 and M5 in 2020 time-series forecasting recurrent neural networks [7] and the FFORMA [8]. Although
2

these methods do not aim to increase the sample size of of the disease, and they produce state-of-the-art results even
the training data made available to the neural networks, they in the presence of many hidden variables.
aid in providing more information to the neural networks The rest of this paper is organized as follows: Section 2 fo-
helping them to better model the non-linear patterns within cuses on the COVID-19 datasets. Related work is reviewed in
the data and thus, improving the generalization performance Section 3. Section 4 describes the epidemiological dynamics.
of these models. The data augmentation approach that we use Section 5 and 6 discuss the statistical and machine learning
interleaves the forecasts collected from a statistical model to models. The methodology used for data augmentation, Au-
build a time-series that is double the length of the original toML, and rolling validation is discussed in Section 7. Finally,
time-series providing more samples for the neural networks to results are provided in Section 8, and conclusions and future
train on. work are given in Section 9.
Producing accurate forecasts for the COVID-19 Pandemic
has been challenging due to the length of the time-series data II. COVID-19 DATASETS
that are available. Only now is a year’s worth of data available Out of numerous data sources available for tracking the
for the United States. This paper examines the COVID- COVID-19 pandemic, here in the US, we have two pri-
19 modeling studies that have been conducted and applies mary data sources which collect the information from the
common modeling techniques from two main categories: (1) public health authorities throughout the country. First, The
statistical models, and (2) machine learning models. The first Johns Hopkins Coronavirus Resource Center (CRC) is a
category of models represents a robust and relatively easy to continuously updated source of COVID-19 data have their
apply set of models that should be used in modeling studies data repository stored at https://github.com/CSSEGISandData/
(at least as a point of comparison). The second category, that COVID-19. Secondly, The COVID Tracking Project is a
of machine learning, supports more intense pattern matching volunteer organization collecting COVID-19 data used by
that has the potential to produce more accurate forecasts. The multiple organizations and research groups.
disadvantages include the necessity for more data, tuning of This paper utilizes the national COVID-19 dataset com-
hyper-parameters, architecture search, and interpretability. posed of data collected and stored at https://covidtracking.com/
Modeling studies are important for several reasons. Fore- data. The dataset consists of COVID-19 case data for testing,
casting is important, so people and their governments know hospitalization, and patient outcomes from all over the United
what to expect in the next few weeks. The COVID-19 Pan- States.
demic is unique in the sense that the last pandemic with this
(or more) significance was the 1918 H1N1 Influenza Pandemic A. COVID-19 Dataset - United States
[9]. The S CALAT ION COVID-19 dataset https://github.
Modeling and forecasting studies for COVID-19 are com/scalation/data/tree/master/COVID contains data about
grouped into four major approaches: a) SEIR/SIR models, b) COVID-19 cases, hospitalizations, deaths and several more
Simulation/Agent-based models, c) Curve-fitting models, and columns in the United States. This dataset is in a CSV file
d) Predictive models. with 391 rows and 18 columns.
SEIR/SIR [10] models are common epidemiological model- This study focuses on the incident deaths or daily death
ing techniques that predict the pandemic course through the increase column using it as a time-series for the neural
system of equations that divide an estimated population into networks and statistical models.
different compartments: susceptible, exposed, infected, and The time-series includes daily death counts in the United
recovered. A set of mathematical rules govern the equations States starting from January 13, 2020 to February 6, 2021. We
as to how the assumptions about the disease process, Non- eliminate the first 44 day records and feed our models time-
Pharmaceutical Interventions, social mixing, and public vac- series data starting February 26, 2020 when the first deaths
cinations would affect the population. due to COVID-19 in the United States were recorded. After
Simulation/Agent-based [11] models simulate a community eliminating the first 44 days, the new size of the time-series
where individuals represent “agents” in that community. The is 347 days. Ending the dataset on February 6, 2021 has the
pandemic course spreads among the agents based upon the modeling advantage that vaccination effects may be ignored,
virus assumptions and rules about social mixing, governmental but after this time, models should take vaccinations under
policy, and other behavioral patterns in the studied community. consideration. Unfortunately, there will initially be limited data
Curve-fitting [12] models refer to fitting a mathematical available on vaccinations.
model by considering the current state and estimate the rela-
tionship between features of the pandemic. The future of the B. Data Preprocessing
pandemic course can be approximated fully from assumptions Data is preprocessed for smoothing the outliers that lie 3.5
about the contagion, public health policies, or other limitations standard deviations away from the rolling mean. The rolling
and uncertainties of spread disease parameters and/or experi- mean is the 6 time points local average such that we consider
ences in other communities. 3 time-points before and after the current point. Observations
Predictive models play a key role in modern pandemic predic- within 3.5 standard deviations are left unchanged while the
tion. Scientists apply different statistical and machine learning observations which are considered as outliers are replaced by
algorithms for modeling the future of morbidity and mortality the Kalman Smoothing using a Local Level Model [13].
3

C. Optimum Lag Selection Exponential Smoothing (statistical model) with Long Short-
In time-series modeling, it is common to use lagged features Term Memory (LSTM) networks to give a single hybrid
as inputs to the models. A lag value is the target value from and powerful forecasting method that was the winner of
any of its previous values. For example, the k th lag value of yt the M4 competition, producing forecasts which were nearly
will be yt−k . We need to define models such that they look for 10% better (in sMAPE) than the combination benchmarks
p previous lags while estimating their parameters and would considered in the competition. The second most accurate
also require p valued vector sequence to make forecasts once method, FFORMA [8], also involved a combination of 7
the parameters are estimated from the training data. statistical methods and 1 ML method, and the weights for
The choice of p needs to be determined before taking any averaging/combining such methods also used an ML algo-
model under consideration. Depending on the characteristics rithm. This encouraged the time-series research community to
and dynamics of the model as well as the time-series, we develop techniques that use a combination of machine learning
choose a particular value for p. The choice of p largely impacts models with time-series features as indicated in [2]. In light of
the performance of the models on forecasts, as it can control these previous works on intermediate length time-series, we
the modeling capabilities of the model. We determine the explore the performance of neural networks, compare them
optimal number of lags to be in the range of 18 to 21 using with statistical benchmarks and empirically show that using
the Auto-correlation function (ACF), as shown in Figure 1. data augmentation can help boost the performance of such
neural network models.
Beyond Fully Connected Neural Networks, many Recurrent
Neural Network (RNN) based frameworks have been proposed
for intermediate length time-series forecasting as indicated
in [16], [17]. More recent studies such as [18], show the
application of RNNs on several short and intermediate length
time-series where the results are better than statistical models
for some datasets while the RNNs performing worse than
ARIMA and ETS models on diverse datasets such as the M4,
which consists of 48,000 time-series.
Convolutional Neural Networks (CNN) have also been
referred to in [19] which uses the dilated convolutions that can
provide broader access to historical values. They claimed that
CNN performs on par with the recurrent-type network, whilst
requiring significantly fewer parameters and computation to
achieve comparable performance.
With the advent of “Attention Mechanisms” in recent years,
Fig. 1: ACF Plot of time-series these techniques allow the network to directly concentrate on
important time points in the past. Application of such networks
in longer length time-series shows improvements over RNN-
III. R ELATED W ORK based competitors which has been reflected in [20], [21],
In time-series forecasting, neural networks have been com- [22]. These type of networks often employs an encoder to
peting with many statistical models for many years. The M3 summarize historical information and a decoder to integrate
competition in the year 2000, saw a single neural network ar- them with known future values [22], [20]. However, there has
chitecture being compared with the other 23 statistical models been little evidence for the performance of such networks on
[1]. The Automated ANN model presented in this competition shorter and intermediate length time-series data.
[14] first demonstrated the potential of using an automated Although the advanced neural network architectures have
procedure for the selection of neural network architecture achieved state-of-the-art in image and text domains, a limited
paving way for AutoML techniques such as NAS to be used number of studies are available around intermediate time-
in the context of time-series forecasting. However, in the series analysis because that such techniques require a plethora
case of shorter length time-series the automatic architecture of input data for training. Our work adds to this growing
selection, even though competitive, was not better than the body of machine learning research with a novel approach to
other statistical models considered. framing, evaluating, and comparing intermediate time-series
In the continuation of previous competitions the M4 time- forecasts for the COVID-19 prediction task, and in comparison
series competition, which consisted of many short and inter- with other powerful techniques, we showed that our new
mediate length time-series, saw a greater number of neural proposed technique (AUG-NN) can indeed outperform other
network-based models some of which were able to perform competitors.
better than statistical models. One of the major findings
of this competition included the “hybrid” approach which
included both statistical and machine learning features [2], A. Modeling Studies for COVID-19
[15]. The hybrid method by Slawek Smyl [7] used a Dynamic Several papers include forecasts for COVID-19 deaths, pos-
Computational Graph Neural Network system to combine an itive cases, and hospitalizations. The authors in [23] forecasted
4

the spread of the disease in 10 largely affected countries


including the United States. They applied ARIMA and Prophet
St = St−1 + dt − et Susceptible
time-series forecasting models on the number of positive cases
and deaths, and they showed that ARIMA performed better Et = Et−1 + et − it Exposed
than FBProphet [24] on a scale of different error metrics such It = It−1 + it − ht − pr rt only Infected
as MAPE and RMSE. Another similar study that applied an Ht = Ht−1 + ht − (1 − pr )rt − dt Hospitalized
ARIMA is found in [25] where they extensively tested and
Rt = Rt−1 + rt Recovered
tuned different parameters to minimize the prediction error,
and they discussed that ARIMA models are appropriate for Dt = Dt−1 + dt Died
generating forecasts for the pandemic crisis. where pr is the probability a recovered individual was not hos-
The predictive approach by machine learning was studied pitalized, upper case letters indicate the number of individuals
by [26] where they applied the GWO-LSTM model to forecast in the state for the given t (day), and lower case letters indicate
the daily new cases, cumulative cases, and deaths for India, the individuals added to state on the given t (day).
the USA, and the UK. The network parameters of the LSTM
network are optimized by Grey Wolf Optimizer (GWO), and It−1 + Ht−1
they claimed that their hybrid architecture obtained a better et = q1 St−1
N
performance compared to the baseline (ARIMA). However,
it = q2 Et−1
they rely on access to online big data from Google Trends
to evaluate other real-world conditions, in order to feed more ht = q3 It−1
information to the LSTMs. rt = q4 It−1 + q5 Ht−1
dt = q6 It−1 + q7 Ht−1
IV. E PIDEMIOLOGICAL DYNAMICS
The relationships between pr , q4 and q5 are as follows:
The traditional Susceptible-Infected-Removed (SIR) model
proposed by Kermack and McKendrick (1927) [27], [28], [29] q4 It−1 q5 Ht−1
is a system of three ordinary differential equations that are rt = =
pr 1 − pr
non-linearly interrelated. The newer SEIR models [35] are
made more realistic by the addition of an Exposed state which Substituting for the daily increases (lower case letters) the state
can capture individuals who have been exposed but are not yet equations become the following:
infectious. Also, there are several extended versions of SEIR
models. This paper applies a minor extension that includes It−1 + Ht−1
St = St−1 + q6 Ht−1 − q1 St−1
states that match the available COVID-19 datasets. The ex- N
tended version has two extra compartments (Hospitalized and It−1 + Ht−1
Et = Et−1 + q1 St−1 − q2 Et−1
Deaths) in the equations, yielding a SEIHRD model. N
It = It−1 + q2 Et−1 − (q3 + q4 )It−1
Ht = Ht−1 + q3 It−1 − (q5 + q6 )Ht−1
A. State Transition Diagram
Rt = Rt−1 + q4 It−1 + q5 Ht−1
The state transition diagram is shown in Figure 2. If there
is a significant reinfection rate, an edge can be added from Dt = Dt−1 + q6 Ht−1
state R to state S.
V. S TATISTICAL T IME -S ERIES M ODELS
q1 q2 q3 The statistical time-series models considered in this work
S E I H do not distinguish between state and measurement variables.
They also consider longer lags beyond, e.g., dt−1 . In par-
. q5 q6 ticular, four time-series models are applied for forecasting
q4 q7
the number of deaths per day dt with a horizon of 1 to 14
R D days. Random Walk (RW) and Auto-Regressive (AR) models
are the most simple methods for solving the univariate time-
Fig. 2: State Transition Diagram for SEIHRD Models series forecasting techniques that we applied in this work.
In an effort to capture the previous shocks, AR is coupled
with Moving Average (MA) component. This extra component,
along with the Integrated (I) component in ARIMA specifies
the level of differencing for making the time-series stationary,
B. Difference Equations can facilitate the discovery of more complex patterns in data.
The difference equations indicate how the number of indi- Our preliminary experiment found usefulness in using ARIMA
viduals in each compartment (state) change with each discrete since we found ARIMA(3, 1, 2) as the best global model
time change (each day), See [36] for a derivation of these dif- with the lowest Akaike Information Criterion (AIC) criteria
ference equations from a set of ordinary differential equations. when evaluated on the training set. Later we describe how
5

the parameters of ARIMA can be dynamically calibrated for the time-series is still shorter than ideal for training neural
different horizons. Additionally, we expect to improve the networks. Therefore, in this paper, we only consider some
prediction accuracy by adding the Seasonality component in simpler network architectures, rely on AutoML to find better
ARIMA, as Figure 1 testifies such effects exist in the data models, and utilize data augmentation to compensate.
at every 7 lags. We are interested in making comparisons A simple type of Neural Network generalizes an AR(p)
between our proposed approach and current statistical methods model. We use a three-layer (one hidden) neural network,
through running multiple experiments and showing that the where the input layer has a node for each of the p lags and
AUG-NN technique can indeed produce the best results. A the current time t. The hidden layer has nh nodes while the
brief overview of the statistical methods mentioned above are output layer has no .
provided in the following:
1) The Random Walk (RW) model states that the future
value yt is the previous value yt−1 disturbed by white xt = [yt−p , . . . , yt−1 , t]
noise t ∈ N (0, Iσ 2 ) and the forecast is just yt−1 . yt = W2t f (W1t xt + b1 ) + b2 + t

yt = yt−1 + t where W1 ∈ R(p+1)×nh and W2 ∈ Rnh ×no are the weight


matrices and b1 ∈ Rnh and b2 ∈ Rno are the bias vectors.
2) The Auto-Regressive (AR) of Order p, AR(p) model There is only an activation function (vectorized f ) for the
has p autoregressive terms (lags [yt−1 , . . . , yt−p ]), hidden layer as the last layer uses Identity as the activation
function. For the hidden layer nodes, during preliminary
yt = δ + φ1 yt−1 + · · · + φp yt−p + t testing, we tried using multiple activation functions such as
where δ is the offset and φj is the parameter multiplying TanH, Sigmoid, and Exponential Linear Unit (eLU). However,
the j th lag, yt−j . we achieve the best performance with the Rectified Linear
3) An Auto-Regressive Integrated Moving Average or Unit (ReLU) activation function as specified by the NAS.
ARIMA(p, d, q), which is a generalization of an Auto- Number of hidden neurons nh is 1024 and the number of
Regressive Moving Average (ARMA) class, provides a output layer neurons no are 14 for regular time-series and 28
better prediction of future points in series, where data for the augmented time-series. While collecting out-of-sample
show evidence of non-stationarity in the mean. The new forecasts for the augmented series, we collect only the even
hyper-parameter d is the number of times the time- horizon forecast values because the odd horizons represent the
series is differenced. When there is a trend in the data half-a-day forecasts when compared to the original time-series.
(e.g., the values or levels are increasing over time), an As neural networks are supervised learning models, they
ARIMA(p, 1, q) may be effective. This model will take require training and testing data in the form of input matrix
a first difference of the values in the time-series, namely of size m × n, where m is the number of samples and n
as: is the number of features. The output/response may be a
yt0 = yt − yt−1 m dimensional vector if we want to predict a single output
per sample or a m × k dimensional matrix, where k is the
4) The Seasonal, Auto-Regressive, Integrated, Moving- number of values we want to predict for each sample in the
Average SARIMA(p, d, q)× (P, D, Q)s model has p au- training/test set. In this study, we consider the direct multi-
toregressive terms (lags [yt−1 , . . . , yt−p ]), d differences, horizon forecasting method [37] for the neural networks and
q moving average terms (base on errors/shocks). In thus, have k = 14 or k = 28 to forecast all horizons directly.
addition, it has P seasonal autoregressive terms (lags We have n = (p + 1) as we consider p lags and the current
[yt−s , . . . , yt−P s ]), D seasonal differences, Q seasonal time t.
moving average terms. The period (or season) is given In order to generate the m×(p+1) dimensional input matrix
by s. and m×k dimensional output matrix, we use a sliding window
approach [38] to convert the time-series into a supervised
VI. M ACHINE L EARNING M ODELS learning problem. Starting from the first value we take p
A. Neural Networks consecutive values in the time-series and along with the current
Neural Networks allow us to capture the non-linear patterns time t, it forms a (p + 1) dimensional vector which is a single
in the data through (1) the use of multiple hidden layers, and sample in the input matrix. Next, starting from (p + 1)th
(2) several neurons with non-linear activation functions in each timepoint, we take the next k samples which forms the first
hidden layer. With enough hidden layers and neurons to cap- k dimensional vector in the output matrix. This approach is
ture the non-linearity and enough training samples to estimate repeated till the end of the time-series is reached with a stride
the parameters through backpropagation, neural network-based of 1 timepoint between each sample.
models serve as powerful data modeling techniques in the The loss function used is mean absolute percentage error
machine learning domain. However, for pure neural network or (MAPE) and The optimizer used is Adaptive Momentum
machine learning-based models the main disadvantage remains Estimate (ADAM) [39].
the inability to train a model containing a large number of Several studies have demonstrated the viability of the deep
parameters on intermediate length time-series. For COVID- hybrid modeling over a single model [40], [41]. This type of
19 prediction with a year’s worth of daily data, the length of architecture enables the training process to benefit from the
6

advantages of both models while typically achieving results B. AutoML


better than individual CNN and LSTM. CNN can be applied to Automated machine learning (AutoML) refers to the process
time-series data since neighboring information is supposedly of automating the application of machine learning to real-
relevant for the analysis of the time-stamped data. However, world problems. It has shown promise in advancing the use of
this relevancy is short-term due to the constraint by the size of machine learning on a variety of problem areas and datasets
convolutional kernels. On the contrary, LSTM can store and as indicated in a detailed study by Xin He, Kaiyong Zhao,
output information during the training process by capturing the Xiaowen Chu [43]. Specifically, in the area of deep learning
long-term time dependency of input data, forming ConvLSTM the concept of Neural Architecture Search (NAS) has shown
architecture. ConvLSTM replaces matrix multiplication with promise in finding better neural network architectures tailored
convolution operation at each gate in the LSTM cell, leading to the specific learning problem and the dataset. Techniques
to capture informative features by convolution operations and such as Bayesian Optimization (BO) for Hyperparameters
provide the LSTM cell with sequenced data. By stacking mul- [44] improve and automate the tuning of the neural network
tiple ConvLSTM layers and forming an encoding-forecasting architectures, whereas NAS tries to search for the best network
structure, it is possible to build an end-to-end trainable model architecture using the BO as its base for every iteration.
for spatio-temporal forecasting. The AutoML framework used is Auto-Keras [45]. NAS is
carried out on the training set where we allow the last 10% of
the training set to be used as validation set while evaluating
VII. M ETHODOLOGY
the model performance. Repeated trials help us find a baseline
We use forecasts from a statistical model as the basis for our architecture suitable for the task which we further fine-tune
data augmentation approach for the neural network models. by the inclusion of the validation set into the training set to
generate the out-of-sample forecasts on the test set.

A. Data Augmentation C. Rolling Validation for Multiple Horizons


Neural Networks learn the output estimation function from Due to the dependencies of instances near each other in
the training data iteratively through the process of back- time, k-fold cross-validation will not work for time-series data.
propagation. Typically, this process is carried out by one of A simple form of rolling validation divides a dataset into an
the gradient-based optimization algorithms such as gradient initial training set and test set. For example, in this work, the
descent or ADAM and requires a large number of data samples first 60% of the samples is taken as the training set and rest
given a large number of trainable parameters in these networks. as the test set. For horizon h = 1 forecasting, the first value
This puts us at a challenging spot when the number of trainable in the test set is forecasted based on the model produced by
parameters/weights of the neural networks far exceeds the training on the training set. The error is the difference between
number of data samples that are available to us for training. the actual value in the test set and the forecasted value which
With time-series data, this is usually a common scenario is used to calculate the symmetric mean absolute percentage
and a point where we have to think about data augmentation error (sMAPE).
approaches to aid the training of such networks. One of For simple, efficient models, the process of rolling forward
the major findings from the results of the M4 competition to forecast the next value in the test set often involves
[2] was the superiority of a hybrid approach that utilizes retraining the model by including the first value in the test
both statistical and ML features [42]. Based on these find- set in the training set. To maintain the same size for the
ings, we combine the first horizon forecasts generated by training set, we remove the first value from the training set.
the SARIMA(1, 0, 0)× (3, 1, 1)7 model to build an augmented We adjust the frequency of retraining for the statistical models
version of the time-series such that these forecasts serve as such that we forecast kt samples ahead in the test set before
a basis to calculate the intermediate day or half-a-day time including them in the training set and retraining our model.
point values for our daily COVID-19 data. In order to generate In this study, the frequency of retraining kt for the statistical
these intermediate time points, the next day’s forecasted value family of models is set to 5 samples. The neural networks are
is averaged with the previous day’s true data value. Thus, this only trained once on the training data without any retraining
interleaved time-series is double the size of the original time- for generating the out-of-sample forecasts.
series allowing for better training of the neural networks. Apart
from data augmentation, this also allows for statistical feature VIII. R ESULTS
information to be fed into the neural networks in order to Our empirical results for various statistical and neural
facilitate better generalization performance and thus improving network models are shown in Table I and Table II, respectively.
their performance on out-of-sample forecasts. We collect Multi-Horizon rolling forecasts on daily deaths for
For generating the forecasts from the the next 2 weeks i.e. from h = 1 through h = 14 and use
SARIMA(1, 0, 0)× (3, 1, 1)7 model, it is trained on entire sMAPE as our primary performance metric which is one of
time-series and we collect in-sample forecasts for half the the standard performance metrics in time-series forecasting
size of the original time-series. The latter half of the forecasts [2], [37].
are collected by out-of-sample forecasts by retraining the In order to obtain the best performing model for each
model only on the first half of the time-series. horizon we train and evaluate several models of each type by
7

TABLE I: Multi-Horizon (h) Rolling Forecasts: Statistical Models


sMAPE for every even horizon output from h = 2 through
(sMAPE).
Horizon RW AR ARIMA SARIMA
h = 28 to gives us the actual 14 day ahead forecasts.
in-sample 30.00 21.62 (12) 19.25 (14) 17.46 (12)
The AUG-NN performed much better when compared to
h=1 30.28 14.80 (13) 14.90 (9) 13.50 (2) regular neural nets due to the increased amount of training
h=2 51.44 16.30 (10) 16.50 (10) 15.10 (1) samples as well as the information fed from the SARIMA
h=3 56.51 16.20 (11) 16.20 (9) 15.50 (4) model. The improvement in sMAPE was also seen on the
h=4 55.51 17.00 (11) 16.30 (9) 15.80 (4)
h=5 51.39 17.10 (9) 16.30 (8) 16.60 (4) ConvLSTM model.
h=6 33.01 16.80 (9) 16.20 (9) 16.30 (9)
h=7 17.36 16.90 (9) 16.30 (11) 16.50 (4)
h=8 31.84 21.10 (11) 19.90 (14) 18.80 (9)
h=9 49.48 23.00 (12) 22.20 (11) 20.70 (9)
h = 10 54.45 23.50 (11) 21.70 (11) 20.70 (9)
h = 11 54.36 24.30 (9) 21.50 (10) 20.80 (9)
h = 12 49.82 24.50 (9) 21.60 (10) 21.30 (9)
h = 13 33.39 24.20 (9) 21.80 (9) 21.40 (9)
h = 14 21.40 24.90 (9) 21.80 (9) 21.30 (9)

changing the auto-regressive non-seasonal order (p) of such Fig. 3: Out-Of-Sample Forecasts (h = 1)
models. The (p) in Table I indicates the auto-regressive non-
seasonal order for the model with lowest sMAPE for the given Figure 3 shows the true values (in blue) vs. the horizon 1
horizon. out-of-sample forecasts (in orange) of the AUG-NN model.
For AR(p) model, we evaluate p = 1 through p = 14. The forecasts in the initial days closely follow the true death
Similarly, for ARIMA(p, 1, 0) model, we use a fixed single count and is only off in the later days where fluctuations and
difference with no MA component for all horizons, whereas p trend is higher.
goes from 1 through 14. The SARIMA(p, 0, 0)× (3, 1, 1)7 uses
no differencing and no MA component, whereas it includes TABLE III: Improvement Due to Augmentation
P = 3 and Q = 1 seasonal components with 1 seasonal Horizon SARIMA NN AUG-NN Improvement (%)
differencing component and a seasonal period of 7 days. h=1 13.50 19.39 13.06 32.65
The highlighted values in each row indicate the best per- h=2 15.10 20.74 14.59 29.65
h=3 15.50 16.86 15.00 11.03
forming model for the given horizon having the lowest sMAPE h=4 15.80 19.34 14.22 26.47
score amongst all the statistical models. The SARIMA model h=5 16.60 22.62 15.96 29.44
gives the best in-sample fit and performs the best for most h=6 16.30 20.16 15.70 22.12
horizons. The ARIMA model is able to be competitive on h=7 16.50 22.40 16.64 25.71
h=8 18.80 23.38 18.16 22.33
horizons 5, 6, and 7. h=9 20.70 21.85 18.86 13.68
h = 10 20.70 22.92 19.81 13.57
TABLE II: Multi-Horizon (h) Rolling Forecasts: Neural Network h = 11 20.80 24.71 18.92 23.43
models (sMAPE). h = 12 21.30 22.81 19.76 13.37
Horizon NN AUG-NN ConvLSTM AUG-ConvLSTM h = 13 21.40 21.24 20.34 4.24
in-sample 11.52 11.04 14.37 10.49 h = 14 21.30 23.44 20.87 10.96
h=1 19.39 13.06 18.15 14.07
h=2 20.74 14.59 18.48 15.96
h=3 16.86 15.00 19.31 16.99 Table III shows the improvement in AUG-NN over the
h=4 19.34 14.22 20.28 18.08 neural network trained on the original time-series. In terms
h=5 22.62 15.96 19.55 18.79
h=6 20.16 15.70 21.13 18.01 of improvement due to augmentation, on average we see a
h=7 22.40 16.64 21.30 20.49 19.90% improvement across all horizons with a maximum
h=8 23.38 18.16 25.28 21.72 improvement of 32.65% seen at horizon 1 forecasts. While the
h=9 21.85 18.86 26.52 22.72
h = 10 22.92 19.81 27.89 23.56
NN model does not perform as well as the SARIMA model,
h = 11 24.71 18.92 28.47 26.24 the AUG-NN model substantially outperforms the SARIMA
h = 12 22.81 19.76 27.95 26.99 model. Also, the ConvLSTM model saw an improvement of
h = 13 21.24 20.34 29.97 25.58 11.43% on average across all horizons.
h = 14 23.44 20.87 31.54 28.78

IX. C ONCLUSIONS AND F UTURE W ORK


Table II compares the performance of two configurations of
neural network-based models. The configuration I refers to the In this paper, we present a data augmentation method in
feed forward neural network and configuration II refers to the order to improve the quality of out-of-sample forecasts for
ConvLSTM network. NN column refers to the neural networks neural networks used for time-series forecasting. Our empirical
run on regular time-series with 347 samples, whereas AUG- analysis shows that, in comparison to the traditional statistical
NN refers to the Augmented-Neural-Network run on the data models, neural networks struggle to produce better quality
augmented time-series with 692 samples. Since the augmented forecasts on intermediate length time-series forecasting where
time-series has a resolution of half-a-day, we collect the the sample size is limited.
8

However, with the data augmentation paired with AutoML [19] A. Borovykh, S. Bohte, and C. W. Oosterlee, “Conditional time series
to search for the optimal neural network architecture, the forecasting with convolutional neural networks,” arXiv preprint, 2017.
[20] C. Fan, Y. Zhang, Y. Pan, X. Li, C. Zhang, R. Yuan, D. Wu, W. Wang,
performance of such neural networks is significantly boosted J. Pei, and H. Huang, “Multi-horizon time series forecasting with
making them better than the best performing statistical models. temporal attention learning,” in Proceedings of the 25th ACM SIGKDD
We also show that not just with the feed forward networks, International Conference on Knowledge Discovery & Data Mining,
2019.
this performance improvement is also possible on other deep [21] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.-X. Wang, and X. Yan, “En-
learning models such as the ConvLSTM model. hancing the locality and breaking the memory bottleneck of transformer
When compared with the methods using hybrid approaches on time series forecasting,” arXiv preprint, 2019.
[22] B. Lim, S. O. Arik, N. Loeff, and T. Pfister, “Temporal fusion trans-
described in our related work, our method is much simpler to formers for interpretable multi-horizon time series forecasting,” arXiv
implement and offers relatively larger performance improve- preprint, 2019.
ments. The simplicity of our data augmentation method makes [23] N. Kumar and S. Susan, “Covid-19 pandemic prediction using time
series forecasting models,” in 2020 11th International Conference on
it an easy yet powerful technique to improve the performance Computing, Communication and Networking Technologies (ICCCNT).
of neural networks on time-series data. IEEE, 2020, pp. 1–7.
In the context of forecasting on intermediate time-series [24] S. J. Taylor and B. Letham, “Forecasting at scale,” The American
Statistician, 2018.
forecasting, our work serves as a baseline to improve the per- [25] O.-D. Ilie, R.-O. Cojocariu, A. Ciobica, S.-I. Timofte, I. Mavroudis, and
formance of neural networks and encourages further research B. Doroftei, “Forecasting the spreading of covid-19 across nine countries
work on this study to incorporate additional deep learning from europe, asia, and the american continents using the arima models,”
Microorganisms, 2020.
architectures along with the epidemiological models such as [26] S. Prasanth, U. Singh, A. Kumar, V. A. Tikkiwal, and P. H. Chong,
SEIR/SIR model for more in-depth analysis and comparison. “Forecasting spread of covid-19 using google trends: A hybrid gwo-
deep learning approach,” Chaos, Solitons & Fractals, 2020.
[27] W. O. Kermack and A. G. McKendrick, “A contribution to the mathemat-
R EFERENCES ical theory of epidemics,” Proceedings of the royal society of London.
Series A, 1927.
[1] S. Makridakis and M. Hibon, “The m3-competition: results, conclusions [28] W. Goffman and V. Newill, “Generalization of epidemic theory,” Nature.
and implications,” 2000. [29] R. M. Anderson, “Discussion: the kermack-mckendrick epidemic thresh-
[2] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “The m4 com- old theorem,” Bulletin of mathematical biology, 1991.
petition: Results, findings, conclusion and way forward,” International [30] M. Toutiaee, S. Amirian, J. A. Miller, and S. Li, “Stereotype-free
Journal of Forecasting, 2018. classification of fictitious faces,” arXiv preprint arXiv:2005.02157, 2020.
[3] ——, “The m5 accuracy competition: Results, findings and conclusions,” [31] M. Toutiaee, A. Keshavarzi, A. Farahani, and J. A. Miller, “Video
Int J Forecast, 2020. contents understanding using deep neural networks,” arXiv preprint
[4] H. Peng, S. U. Bobade, M. E. Cotterell, and J. A. Miller, “Forecasting arXiv:2004.13959, 2020.
traffic flow: Short term, long term, and when it rains,” in International [32] M. Toutiaee, “Occupancy detection in room using sensor data,” arXiv
Conference on Big Data. Springer, 2018. preprint arXiv:2101.03616, 2021.
[5] H. Peng, N. Klepp, M. Toutiaee, I. B. Arpinar, and J. A. Miller, [33] M. Toutiaee and J. Miller, “Gaussian function on response surface
“Knowledge and situation-aware vehicle traffic forecasting,” in 2019 estimation,” arXiv preprint, 2021.
IEEE International Conference on Big Data (Big Data). IEEE, 2019. [34] M. Toutiaee, “Olsh: Occurrence-based locality sensitive hashing.”
[6] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. Cottrell, [35] H. W. Hethcote and P. Van den Driessche, “Some epidemiological
“A dual-stage attention-based recurrent neural network for time series models with nonlinear incidence,” Journal of Mathematical Biology.
prediction,” arXiv preprint, 2017. [36] J. A. Miller, “Introduction to computational data science using scala-
[7] S. Smyl, “A hybrid method of exponential smoothing and recurrent tion,” 2020.
neural networks for time series forecasting,” IJF. [37] S. B. Taieb, R. J. Hyndman et al., Recursive and direct multi-step
[8] P. Montero-Manso, G. Athanasopoulos, R. J. Hyndman, and T. S. Ta- forecasting: the best of both worlds. Citeseer, 2012.
lagala, “Fforma: Feature-based forecast model averaging,” International [38] T. Edwards, D. Tansley, R. Frank, and N. Davey, “Traffic trends analysis
Journal of Forecasting, 2020. using neural networks,” in Proceedings of the International Workshop
[9] D. M. Morens, J. K. Taubenberger, H. A. Harvey, and M. J. Memoli, on Applications of Neural Networks to Telecommunications, 1997.
“The 1918 influenza pandemic: lessons for 2009 and the future,” Critical [39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
care medicine, 2010. arXiv preprint arXiv:1412.6980, 2014.
[10] L. López and X. Rodo, “A modified seir model to predict the covid-19 [40] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c.
outbreak in spain and italy: simulating control scenarios and multi-scale Woo, “Convolutional lstm network: A machine learning approach for
epidemics,” Available at SSRN 3576802, 2020. precipitation nowcasting,” Advances in neural information processing
[11] M. S. Shamil, F. Farheen, N. Ibtehaz, I. M. Khan, and M. S. Rahman, systems, 2015.
“An agent based modeling of covid-19: Validation, analysis, and recom- [41] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
mendations,” medRxiv, 2020. with neural networks,” arXiv preprint, 2014.
[12] Z. Zhao, K. Nehil-Puleo, and Y. Zhao, “How well can we forecast the [42] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “The m4 compe-
covid-19 pandemic with curve fitting and recurrent neural networks?” tition: 100,000 time series and 61 forecasting methods,” International
[13] J. J. Commandeur and S. J. Koopman, An introduction to state space Journal of Forecasting, 2020.
time series analysis. Oxford University Press, 2007. [43] X. He, K. Zhao, and X. Chu, “Automl: A survey of the state-of-the-art,”
[14] S. D. Balkin and J. K. Ord, “Automatic neural network modeling for Knowledge-Based Systems, 2021.
univariate time series,” International Journal of Forecasting, 2000. [44] J. Wu, X.-Y. Chen, H. Zhang, L.-D. Xiong, H. Lei, and S.-H. Deng,
[15] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “The m4 competi- “Hyperparameter optimization for machine learning models based on
tion: 100,000 time series and 61 forecasting methods,” IJF, 2020. bayesian optimization,” Journal of Electronic Science and Technology.
[16] S. S. Rangapuram, M. Seeger, J. Gasthaus, L. Stella, Y. Wang, and [45] H. Jin, Q. Song, and X. Hu, “Auto-keras: An efficient neural architecture
T. Januschowski, “Deep state space models for time series forecasting,” search system,” in Proceedings of the 25th ACM SIGKDD International
in Proceedings of the 32nd international conference on neural informa- Conference on Knowledge Discovery & Data Mining. ACM.
tion processing systems, 2018. [46] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series
[17] Y. Wang, A. Smola, D. Maddix, J. Gasthaus, D. Foster, and analysis: forecasting and control. John Wiley & Sons, 2015.
T. Januschowski, “Deep factors for forecasting,” in International Con-
ference on Machine Learning. PMLR.
[18] H. Hewamalage, C. Bergmeir, and K. Bandara, “Recurrent neural net-
works for time series forecasting: Current status and future directions,”
International Journal of Forecasting, 2021.

You might also like