Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Improving Deep Learning Forecast Using Data Augmentation

Sasan Barak1∗ , Ehsan Mirafzali2 , Mohammad Joshaghani3


1. Department of Decision Analytics and Risk, Southampton Business School, Southampton, UK
2. Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran
3. Faculty of Economic, Management and Administrative Sciences, Semnan University, Semnan, Iran

d
we
Abstract

The promising results of deep learning in data analytic competitions and real-world applica-

vie
tions have generated numerous new time series analysis and forecasting methods. However, in
most cases, deep forecasting methods require adequate amounts of time series data for learning
the essential characteristics of the time series. Therefore, in this study, we propose a novel
data augmentation-based deep forecasting framework that enhances the forecasting accuracy

re
in different time series data. We use Variational Autoencoder (VAE) as a deep generative
model for time series data augmentation and create a python library named AugmentTS for
time series forecasting using augmentation. VAE learns a robust representation of the data
er
from an encoded latent space of the time series using hierarchical neural network layers. A
hybrid bi-directional Long short-term memory-Convolutional neural network (biLSTM-Conv)
is trained in the augmented data, and then the acquired knowledge is transferred to the orig-
pe

inal dataset. In our evaluation of real-world time series datasets, we show that the proposed
method can significantly improve the accuracy of basic deep forecasting models.
We also provide empirical evidence of the approach’s efficacy against widely accepted uni-
variate forecasting methods and discuss the advantages and limitations of the proposed ap-
ot

proach.
Keywords: Forecasting, Time series augmentation, Variational autoencoder,
tn

LSTM-convolutional neural network, Transfer learning.


rin
ep
Pr

Email address: s.barak@soton.ac.uk (Sasan Barak1∗ )

Preprint submitted to Journal of LATEX Templates June 17, 2023

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
1. Introduction
It is essential to make accurate forecasts in industry and businesses as this drives efficient
decision-making regarding future goals. As the modern economy and technology have devel-
oped rapidly, techniques in business forecasting have increased, and more advanced methods

d
using deep learning provide breakthrough accuracy and popularity. The applications of deep
learning methods to calculate robust forecasts show promising results in many forecasting com-

we
petitions, e.g., M5 and M4 competitions (Makridakis et al. (2020b); Makridakis et al. (2020a)).
They also demonstrate remarkable development in other time series analysis tasks such as time
series anomaly detection (Chalapathy et al. (2020)), and time series classification (Fawaz et al.

vie
(2019)).
Because deep learning methods are inherently data-ravenous and require vast numbers of
model parameters to be estimated, it is vital to ensure a sufficient amount of related time
series when building robust methods. In the absence of adequate amounts of time series

re
data, deep learning methods cannot reach their full potential in accuracy, and they are not
capable of learning the critical characteristics of the time series. In such circumstances, the
data augmentation methods address the insufficient data issue by generating synthetic data to
increase the number of observations available for model training.
er
Data augmentation methods are used to generate new datasets artificially. They increase
the sample size in use and thereby empower the model to learn various aspects of the data
pe

characteristics better. Data augmentation attempts to raise the generalisation ability of the
trained models by reducing overfitting and expanding their decision boundary (Shorten &
Khoshgoftaar (2019)). The need for generalisation is particularly substantial for real-world
data, and augmentation can help networks to be trained on small datasets (Olson et al. (2018))
ot

or datasets with imbalanced classes (Hasibi et al. (2019)). In the time series studies, data
augmentation improves models in forecasting, classification, and fraud detection (Bandara
et al. (2021); Fons et al. (2021)). However, time series augmentation can be challenging and
tn

needs a much more effective method since most augmentation methods are not data-driven.
Generative models can play an important role in automatically generating time series data
from the learnt representation of the data.
rin

This study proposes a new deep forecasting framework using a data-driven time series
synthesiser method, which learns the time series’ robust representation and creates new data
points by sampling from an encoded representation of the time series. In this framework, the
ep

Variational Autoencoder (VAE) (Kingma & Welling (2014)) is a deep generative model used for
time series data augmentation. It applies a variational inference framework to approximate the
exact posterior distribution and encodes the latent space of the time series using hierarchical
Pr

neural network layers. VAEs can be widely used to augment time series data since they
derive meaningful features from data. This procedure also decreases the chance of overfitting
the training data by smoothing and removing outliers. After augmenting the time series
data, a hybrid bi-directional Long short-term memory-Convolutional neural network (biLSTM-
Conv) is trained in the augmented data. Then the acquired knowledge is transferred to the
original dataset. In this model, bi-LSTM layers capture the long-term dependencies of the time

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
series, and convolutional layers capture the localised shape-based features. We use residual
connections (He et al. (2016)) in some layers to maximise information flow between layers.
Also, we apply a Gaussian noise layer to ensure the noise robustness of the model.
One of the advantages of using VAEs is their ability to smooth out noisy or erratic time
series data. This smoothing effect arises due to the probabilistic nature of the VAE framework,

d
which allows it to capture the underlying distribution of the input data and generate new

we
samples from that distribution. By doing so, VAEs can remove outliers and other anomalous
data points that may hinder the accuracy of a forecasting or function approximation model.
The smoother data generated by VAEs improves the accuracy and robustness of these
models, particularly in situations where the real-world data is noisy or contains unexpected

vie
fluctuations.
Also, to illustrate the utility of the synthetic time series, we propose a new method that
compares original and synthetic time series based on their meta-features. The statistical
properties of the augmented time series and original ones are similar when they have the same
distribution of the time series meta-features.
re
In our evaluation of real-world time series datasets, we show that the proposed method
can significantly improve the accuracy of basic deep forecasting models and state-of-the-art
er
univariate forecasting methods by training the network in representative augmented data and
transferring the acquired knowledge to the real dataset. While most deep forecasting methods
work on increasing the complexity of networks to obtain better results, in this paper, we
pe

focus on augmenting time series from the latent space of VAE and employ the potential of
transfer learning to avoid overfitting in forecasting using deep neural networks. We evaluate
the robustness of the proposed approach for different types of time series datasets to highlight
the conditions better when the proposed approach performs well.
ot

As a summary of novelties of this paper:

1. This work is entirely data-driven, where every procedure is designed to enhance the
tn

quality of univariate time series data.


2. Using Variational Autoencoders (VAEs), we show that our proposed augmentation method
generates better samples than other benchmarks. Specifically, VAEs work very well be-
rin

cause they learn a low-dimensional representation of the data by minimizing the recon-
struction error between the original and generated samples. Moreover, VAEs provide
regularization through the Kullback-Leibler divergence term in the loss function, which
ep

helps prevent overfitting.


3. Our proposed forecaster, which combines a bidirectional Long Short-Term Memory (bi-
LSTM) and Convolutional Neural Network (CNN), outperforms statistical methods such
as MAPA, ARIMA, ETS, etc., in terms of forecasting accuracy.
Pr

4. Deep Learning methods generally perform better on augmented time series data, and
our proposed VAE-based augmentation technique has a smoothing effect that leads to
well-generated data for other deep learning algorithms. Therefore, this technique helps
to reduce the complexity of the approximation needed in deep learning forecasting.
5. We provide an AugmentTS library on GitHub that includes all the codes necessary
to reproduce the results presented in this paper. This library can be used by other

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
researchers to further investigate our proposed methodology and extend it to their own
datasets.

The rest of the paper is organised as follows. In section 2, we provide a comprehensive


literature review of synthetic data methods and their applications, specifically in time series. In

d
section 3, we describe the structure of the proposed deep generative model and the forecasting
framework. In section 4, we describe data, experimental design, forecasting error measures,

we
and forecasting benchmarks. Section 5 provides the empirical results and summarises our
findings. Section 6 presents a discussion of the approach and, finally, Section 7 concludes and
indicates directions for further research.

vie
2. Literature Review

2.1. Augmentation

re
Deep learning has attained remarkable developments in many fields, including computer vision,
natural language processing (NLP), and time series-related tasks- i.e., time series classification
(Fawaz et al. (2019)), time series anomaly detection (Gamboa (2017)), and time series fore-
er
casting (Han et al. (2019)). Despite the excellent performance of the deep learning models in
time series forecasting and classification, the success of the deep learning models relies heavily
on a large amount of data to eschew overfitting or underfitting. However, many time series
pe

tasks do not have a sufficient number of time series. Data augmentation is an effective tool
to enhance the training data’s size and quality. It leads to robust performance in the deep
learning models (Bandara et al. (2021)), i.e. robustness against model misspecification and
small sample issues. Wen et al. (2020) divide time series data augmentation into the basic and
ot

advanced methods. Basic data augmentation methods include time domain, frequency domain,
and time-frequency domain methods. Advanced data augmentation methods are divided into
tn

decomposition-based, statistical generative, and learning-based methods.

2.1.1. Basic Data Augmentation Methods


rin

Time-domain methods directly manipulate the original input time series, like adding Gaus-
sian noise or more complicated noise patterns, e.g., pike, step-like trend, and slope-like trend.
Moreover, window cropping (or slicing), window warping, and flipping are three more common
basic time-domain methods (Le Guennec et al. (2016); Wen & Keyes (2019)). Frequency do-
ep

main methods employ perturbations in both amplitude and phase spectrum in the frequency
domain. Lee et al. (2019) adopt the amplitude adjusted Fourier transform (AAFT) and apply
random phase shuffle in the phase spectrum after Fourier transform. Next, they implement
Pr

rank-ordering of the time series after inverse Fourier transform. Time-frequency domain meth-
ods contain attributes of both prior methods. Park et al. (2019) propose SpecAugment to make
a data augmentation in a time-frequency domain. The augmentation process consists of warp-
ing the features, making blocks of the frequency channels, and masking blocks of the time
steps. Their results show the accuracy improvement in the performance of neural networks in
the speech recognition model.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
2.1.2. Advanced Data Augmentation Methods

Wen et al. (2020) classify advanced data augmentation methods into decomposition-based,
statistical generative, and learning-based methods.
Decomposition-based methods decompose time series to trend, seasonality, and remainder

d
signal, e.g., a Seasonal-Trend decomposition procedure based on Loess (STL) and Robust STL.
Decomposition-based methods are adopted to generate augmented time series for forecasting

we
and anomaly detection. Bergmeir et al. (2016) apply bootstrap on the STL decomposed
residual to generate augmented signals, in which the trend and the seasonality of the original
time series are added back to it to create the final augmented time series. Gao et al. (2020)
decomposes time series to trend, seasonality, and residual by applying robust decomposition

vie
methods. Then they implement the frequency domain and the time domain augmentation on
the residual. Their results show an improvement in anomaly detection performance compared
with the same method without augmentation.

re
Statistical generative methods try to model the time series dynamics with statistical models.
Originally, statistical models characterise the conditional distribution of the time series by
assuming that the value at time t relies on the previous points. Once the initial value is defined,
a new time series is generated following the conditional distribution. Cao et al. (2014) propose
er
a mixture of Gaussian trees for modelling a multi-modal minority class time series dataset.
Smyl & Kuber (2016) propose a statistical algorithm that uses samples of parameters and
pe

forecast paths calculated by the LGT (Local and Global Trend) algorithm. Moreover, Kang
et al. (2020) propose GRATIS, which uses a mixture of autoregressive models to generate time
series with diverse and controllable characteristics.
Learning-based methods are divided into embedding space methods, deep generative mod-
ot

els, and automated data augmentation methods. These are based on learning the underlying
characteristic distribution of the time series data, in contrast to the previous methods, which
are not data-driven.
tn

Embedding space methods suggest that simple transformations on the encoded inputs or
the latent space, rather than the original data, create more robust synthetic data. They
showed robust performance in the time series research. DeVries & Taylor (2017) propose
rin

MODALS (Modality agnostic Automated Data Augmentation in the Latent Space) and train
a classification model jointly with the diverse compositions of the latent space augmentations
to generate synthetic data from the learnt latent space.
ep

Deep generative models show the capability of generating near-realistic high-dimensional


data like audio, text, and image. These models, developed for sequential data such as audio and
text, can usually be extended to model the time series data. Generative Adversarial Networks
Pr

(GANs) are popular deep generative models. Esteban et al. (2017) propose a Recurrent GAN
(RGAN) and a Recurrent Conditional GAN (RCGAN) to produce realistic real-valued multi-
dimensional time series data. The RGAN uses RNN in the generator and discriminator,
whereas the RCGAN adopts both RNNs conditioned on exogenous variables. Yoon et al.
(2019) introduce TimeGAN, a generative time series model, trained adversarially and jointly
via a learnt embedding space with both supervised and unsupervised losses.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
Automated data augmentation methods search for an optimal data augmentation policy
through different approaches such as reinforcement learning, meta-learning, or evolutionary
search. Cubuk et al. (2019) introduce AutoAugment to automatically search for improved
data augmentation policies in a reinforcement learning framework. Fons et al. (2021) propose
a model that learns the weight of each augmented sample that affects the prediction loss.

d
Then the model chooses an appropriate transformation based on the ranking of the predicted

we
training loss.
Although the choice of augmentation methods is vast, generating a representative time
series that can increase the forecasting accuracy of the main time series has been a focal point
for the business forecasting task. This cannot be achieved unless an adopted approach conveys

vie
the main time series characteristics into the augmentation dataset. Overcoming this issue has
been one of the central motivations for this work. None of the basic augmentation methods
has a learning process to learn from the main data, similar to the advanced methods like
STL-based methods. VAEs are a class of deep-latent variable models that employ variational

re
inference to learn the complex intractable posterior over the data distribution. VAEs produce
latent variables to generate new data and deal with high variability in complex datasets. There
is a lack of research in time series augmentation using VAEs.
er
2.1.3. Variational Autoencoder for data Augmentation

Nishizaki (2017) uses VAE to produce latent variables to generate new data for acoustic mod-
pe

elling. Wu et al. (2019) apply VAE as a data augmentation model for embedding-based speaker
verification, in which they aim to check the speaker’s identity from the speech signal. Wang
et al. (2020) use norm-VAE to generate synthetic data to solve unsupervised domain adapta-
tion problems in image classification. In the time series research, Hsu (2017) introduces a new
ot

method for time series forecasting based on Bayesian variational inference. He proposes a new
method to apply multiple latent variables with different transition steps and discard redundant
tn

input variables. Zeroual et al. (2020) show that VAEs performed better than benchmarks to
forecast COVID-19 time series data. Also, Ullah et al. (2020) apply Variational Recurrent Neu-
ral Networks (VRNNs) for forecasting clinical time series data. Demir et al. (2021) use VAE
rin

and Wasserstein generative adversarial networks to augment time series and forecast electric-
ity market prices. Although prior works investigated variants of VAEs to perform time series
augmentation, there is a lack of studies in analysing the impact of time series augmentation
ep

with VAEs in forecasting performance, studying the effects of the augmentation on different
domains, seasonality, and length of time series, and finally accessing the reasons for improving
or impairing the forecast accuracy.
Pr

2.2. Transfer Learning

To overcome the deficiency in both quantity and quality of the available datasets, for the
state-of-the-art deep learning models, transfer learning methods are developed. They are
attracting increasing attention in time series forecasting research. Ribeiro et al. (2018) adopt
transfer learning in a forecasting framework for cross-building energy prediction. Laptev et al.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
(2018) show that the feature-transfer method improves accuracy over the traditional transfer
learning methods, particularly where there is not a sufficient target dataset. Bandara et al.
(2021) implement transfer learning with data augmentation methods to increase the forecasting
performance of the global forecasting methods.
In the time series forecasting literature, only a few papers study whether data augmenta-

d
tion with transfer learning increases the forecasting capability of deep learning models. This

we
study proposes a new time series augmentation method in which transfer learning significantly
improves forecasting accuracy. This is specifically the case where the length of the original
time series is short or insufficient to train the deep forecasters’ large number of parameters.
In contrast with prior augmentation studies in time series, we build our approach based on

vie
the latent variable model that can infer hidden structures in the underlying data to generate
synthetic time series.

3. Methodology
re
This section proposes our augmentation-based forecasting model using the Variational AutoEn-
coder and hybrid bi-LSTM-Conv1D architectures. Figure 1 illustrates the whole procedure of
er
the framework that contains min-max scaling preprocessing, data generation using VAEs, ap-
plying rolling window on the data, forecasting using hybrid bi-LSTM-Conv1D, and finally,
applying inverse min-max scaling on the predicted data. In the following, we will discuss each
pe

part in separate subsections.

3.1. Preprocessing

In the data preprocessing phase, first, we split the dataset (Dn,t ) into Dn,1:t−(w+h) and Dn,t−(w+h)+1:t .
ot

The former are training data, and the latter are test data where n is the number of time series,
t is time steps, and w is the required windows size with the forecast horizon h. We apply
tn

min-max scaling to rescale the range of features to a fixed range between 0 and 1, which makes
the model training easier and avoids numerical instability. This scaling method can be defined
as follows:
X − Xmin
rin

Xscaled = , (1)
Xmax − Xmin

where X is the input data, the subscripts min and max are used to denote each column’s
minimum and maximum values, respectively. We only generate the training part of the data
ep

to avoid data leakage with the VAE. This satisfies our requirements to feed the data to the VAE
to draw samples from the latent space. Afterwards, we use a rolling window with a rolling size
of 1 to ensure a constant data size. By treating this method as a data augmentation technique,
Pr

we take advantage of collecting larger amounts of data. By performing this rolling window,
one can see that the training input becomes T rainn,i:w+i and the training output becomes
T rainn,w+i+1:w+h+i where i = 0 : length(train) − (h + w). The test input would be T estn,:w ,
and the test output would be T estn,w+1: .

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
Inputs Preprocessing Training Data Augmentation Rolling Window Forecaster Postprocessing
(Generation)
Combined
Min-Max Scaling Hybrid
Original Inverse Min-Max
LSTM+Conv1D
Scaling
Forecaster
Train-Test Split Generated

d
Variational AutoEncoder

we
μ
LSTM LSTM LSTM z LSTM LSTM LSTM

vie
Figure 1: The proposed framework

3.2. Data augmentation using VAE


re
Variational Inference is the primary building block of Variational AutoEncoders, our generative
model for creating synthetic time-series data. The whole idea of this framework comes from
er
statistical mechanics and mean field methods. In the beginning, it is preferable to scrutinize
some critical ideas of the Variational Inference framework and how they act in modelling and
generating time series. Moreover, the discussion of building a low-variance estimator to solve
pe

Variational problems leads us to the Variational AutoEncoder models, followed by the possible
changes in their latent space distribution.

3.2.1. Variational inference


ot

In computing a posterior distribution, having many variables makes the exact inference compu-
tationally intractable, therefore, quite impractical. This problem is addressed by approximate
tn

inference methods, which use some approximation to make the inference problem tractable.
(Jordan et al. (1999)) introduce this solution as Variational Inference (VI), an approximate
method that casts the problem of inference as an optimisation problem. In this case, we need
rin

to approximate a tractable distribution called variational distribution qϕ to make it close to


the true posterior distribution pθ . Minimising a distance measure between two probability
distributions is required to make out variational distribution a good approximation of the true
ep

distribution. In the Variational Inference literature, this distance is usually Kullback-Leibler


(KL) divergence with the formula:
 
pθ (Z | X)
DKL (qϕ (Z)kpθ (Z | X)) = −Eqϕ (Z) log (2)
Pr

qϕ (Z)

where X denotes observed variables, and Z is used for latent variables. In our paper, X is the
training time series data, and Z represents the input time series. Ideally, we want to attain a
good approximation with a minimised KL divergence.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
In order to minimise this divergence, we should first derive a lower bound.

pθ (Z | X)
DKL (qϕ (Z)kpθ (Z | X)) = −Eqϕ (Z) log
qϕ (Z)
= −Eqϕ (Z) [log pθ (Z | X) − log qϕ (Z)]

d
 
pθ (Z, X) (3)
= −Eqϕ (Z) log − log qϕ (Z)
pθ (X)

we
= −Eqϕ (Z) [log pθ (Z, X) − log pθ (X) − log qϕ (Z)]

= −Eqϕ (Z) [log pθ (Z, X) − log qϕ (Z)] + Eqϕ (Z) [log pθ (X)]

vie
log pθ (X) in 3 is independent of Z so its expectation under qϕ (Z) is log pθ (X). So our cost
function is simplified to

DKL (qϕ (Z)kpθ (Z | X)) = −Eqϕ (Z) [log pθ (Z, X) − log qϕ (Z)] + log pθ (X) (4)

re
where the first term in the right-hand side of [4] is the variational lower bound or Evidence
Lower Bound (ELBO), and we denote it by L. Koller & Friedman (2009) show that minimising
the KL divergence is equivalent to maximising the lower-bound L. Therefore, our objective
er
becomes maximising ELBO in order to achieve a good approximation.

3.2.2. Black-box variational inference


pe

So far we have discussed our objective, which is maximising ELBO.

L (pθ , qϕ ) = Eqϕ (Z|X) [log pθ (X, Z) − log qϕ (Z | X)] (5)


ot

where θ are parameters of p, and similarly, ϕ are parameters of q. We use gradient descent
to maximise the ELBO and assume that qϕ is differentiable with respect to its parameters.
tn

According to Ranganath et al. (2013), this method is known as black-box variational inference.
Its objective is to maximise ELBO while simultaneously optimising over parameters θ and ϕ.
The former pushes log pθ (X) up, and the latter keeps ELBO tight around log pθ (X). With
rin

this approach, we can derive the score function estimator, i.e.,

∇θ,ϕ Eqϕ (Z) [log pθ (X, Z) − log qϕ (Z)]


(6)
= Eqϕ (Z) [(log pθ (X, Z) − log qϕ (Z)) ∇ϕ log qϕ (Z)]
ep

However, this estimator has a high variance under the Monte Carlo estimate to be use-
ful for practical purposes Ranganath et al. (2013). An approach to address this problem
Pr

is to reformulate ELBO, deriving a new non-Monte Carlo gradient estimator based on the
reparametrisation trick to make the estimate of expectations differentiable.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
3.2.3. The SGVB estimator and reparametrisation trick

From the previous sections, by reformulating equations 4 and 5, we conclude that ELBO can
be written as

L (pθ , qϕ ) = −DKL (qϕ (Z | X) kpθ (Z)) + Eqϕ (Z|X) [log pθ (X | Z)]

d
(7)

we
where the first term on the right-hand side is the negative KL divergence of the approximate
posterior which can be computed analytically, the second term on the right-hand side is the
reconstruction term which requires a sampling procedure. We should optimise this equation
with respect to both ϕ and θ. For this problem, the Monte Carlo gradient estimator of a

vie
function f (Z) can be written as

L
  1X
∇ϕ Eqϕ (Z) [f (Z)] = Eqϕ (Z) f (Z)∇qϕ (Z) log qϕ (Z) ' f (Z)∇qϕ (Z) log qϕ (Z) . (8)
L
l=1

re
Again, since the Monte Carlo nature of this gradient estimate, we conclude this estimator
has a high variance, and also, we cannot differentiate a sampling operation Ranganath et al.
(2013). This problem can be addressed by a simple trick called reparametrisation. We can
er
e ∼ qφ (Z | X) using a differentiable transformation of a noise variable; i.e.,
reparametrise Z

Ze = gφ (, X) with  ∼ p(), (9)


pe

where  is a random noise sample from a distribution p() such as standard Gaussian distribu-
tion. gφ (, X) is a deterministic transformation of a random noise to a probability distribution.
In the case of neural networks, this estimator allows us to backpropagate through this trans-
ot

formation. Now the Monte Carlo estimates of expectation of a function f (Z) with respect to
qφ (Z|X) can be written as:
tn

Eqφ (Z|X) [f (Z)] = Ep() [f (gφ (, X))]


L
1 X   (l)  (10)
' f gφ  , X where (l) ∼ p()
L
l=1
rin

Applying this result to 7, yields an estimator with lower variance than Monte Carlo estimates
Kingma & Welling (2014), called the Stochastic Gradient Variational Bayes (SGVB), which is
the equation [11].
ep

L
1 X  
Le (θ, φ; X) = −DKL (qφ (Z | X) kpθ (Z)) + log pθ X | Z (l) (11)
L
l=1
Pr


where Z (l) = gφ (l) , X , (l) ∼ p(), L is the number of samples, and l is a single sample.

3.2.4. Variational AutoEncoder

Using neural networks as an encoder qφ (Z | X), we can take advantage of using deeper layers
and capture a much better representation of the data. The Variational AutoEncoder is a
neural network architecture that maximises the ELBO in 11, which is equivalent to minimising

10

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
the KL divergence and maximising likelihood. We usually let the variational distribution be
Gaussian-distributed. Then, by reparametrisation of Gaussian distribution, we sample from
gφ (X, ) = µ + σ  where  is a noise sample from the standard Gaussian distribution, µ is
the mean, and σ is the variance of the Gaussian distribution.
In this paper, since the data is time series, we use bidirectional LSTM layers, which are

d
suitable for modelling sequential data, in the layers of the Variational AutoEncoder. Using

we
this architecture for VAE improves the approximation procedure of the variational distribution
qφ (Z | X).

3.2.5. Log-Normal Latent Space

vie
We are interested in using a heavy-tailed distribution when dealing with a dataset with nu-
merous tail data. This is beneficial in modelling tail data since a light-tailed distribution like
the Gaussian distribution is not sensitive enough to them. Further attempts in using other
variational probability distributions are described in (Figurnov et al. (2018); Naesseth et al.
(2017)).
re
So our first chosen distribution for latent space is the log-normal distribution. This is a
skewed heavy-tailed distribution with the following probability density function.
er
(ln X − µ)2
 
1
√ exp − . (12)
Xσ 2π 2σ 2
pe

The kullback-Leibler divergence for this distribution is the same as the Gaussian distribution.
Therefore, we just need a reparametrisation trick to sample from a log-normal distribution.

Our choice for this reparametrisation is exp() where  ∼ N µ; σ 2 . The algorithm of the
VAE model for time series generation is in Algorithm 1.
ot

Algorithm 1 Variational AutoEncoder for time series data generation.


Initialise a bi-LSTM encoder enc which inputs time series X and outputs µ, σ of dimension
tn

dim.
Initialise a bi-LSTM decoder dec which inputs latent space samples z and outputs the
reconstructed time series.
θ, φ, dim ← Initialise encoder and decoder parameters and latent space dimension.
repeat
rin

X M ← Minibatch of M time series (drawn from full time series dataset)


µM , σ M = enc(X M )
if Gaussian latent space then
 ∼ N (0, 1)   M
z M = gφ X M ,  = µM + σ M  or z M = gφ X M ,  = µM + eσ /2 
ep

else if Log-normallatent space then


 ∼ N µM , σ M 
z M = gφ X M ,  = e
end if
Xe M = dec(z M )
Pr

Reconstruction Error = MSE(X M , X eM)



LeM X M ; θ, φ = −DKL (X M ||z M ) + Reconstruction Error
g ← ∇θ,φ LeM (X M ; θ, φ) (Gradients of minibatch estimator)
θ, φ ← Update parameters using gradients g (e.g. Adam)
until convergence of parameters (θ, φ)
z N ← Draw N samples from the latent space
N
Xsynthetic ← dec(z N )
N
return Xsynthetic

11

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
3.3. Input data for deep learning

In this paper, we use two approaches for training deep models. The first approach only uses
the VAE-generated time series, called (G), and the second approach is combined, called (C),
which uses original data (O) and generated data (G) together for the training purpose. Our

d
sample size of the VAE is equal to the original data size. Therefore, the (C) approach doubles
the data for the training, and it may help reduce the risk of overfitting. However, the (G)

we
approach can be helpful for sensitive data, where it is restricted for training on the main data
(O).
It is worth noting that the test data are separated at the beginning, and there is no chance
of data leakage between the training and test dataset.

vie
3.4. Forecasting and Architecture

When the datasets are sequential, since the data timesteps are related to each other, tradi-

re
tional feed-forward neural networks suffer from a lack of memory units to model this relation.
Recurrent neural networks (RNNs) address this problem by using feedback connections al-
lowing information to remain. RNNs’ distributed hidden states facilitate them to remember
er
past information and their non-linear dynamics to update their neurons. These networks are
helpful in some tasks like time series forecasting, speech recognition, and language modelling.
However, they suffer from the vanishing or exploding gradients problem. If the RNNs’ weights
pe

are too large, gradients grow exponentially, which is exploding gradients. On the other hand, if
the RNN’s weights are too small, gradients shrink exponentially, which is vanishing gradients.
These problems lead RNNs to learn sequential data insufficiently.
Long-Short Term Memory (LSTM) networks address the vanishing gradient problem by
ot

allowing gradients’ flow to remain unchanged. LSTM contains gates that define if the data
can pass through or not, depending on the data’s priority. The gates also make the network
tn

learn what to save, what to forget, what to remember, and what to output. The cell state and
hidden state are used in gathering data for processing in the next state. For a forward pass in
a single cell of LSTM the following equations are calculated:
rin

Input gate : it = σ (Wi xt + Ui ht−1 + bi )

F orget gate : ft = σ (Wf xt + Uf ht−1 + bf )

Output gate : ot = σ (Wo xt + Uo ht−1 + bo )


ep

(13)
Intermediate cellstate : c̃t = tanh (Wc xt + Uc ht−1 + bc )

Cell state : ct = ft ◦ ct−1 + it ◦ c̃t

Output vector : ht = ot ◦ tanh (ct ) ,


Pr

where xt is the LSTM cell input vector, W, U, b are weight matrices and bias vectors which are
trainable parameters of the model. it , ft , and ot are input, forget, and output gates’ activation
vector, respectively. c̃t is the candidate to update ct , which is the cell state vector. ht is the
output vector of a LSTM cell, and ◦ denotes the Hadamard product.
In this paper, we design a variational autoencoder with three bidirectional LSTM layers

12

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
both in the encoder and decoder. We use these layers to capture the time series data’s latent
representation efficiently. For the forecasting model, first, we use a Gaussian noise layer fol-
lowed by two bi-directional LSTM layers with bias. We choose the standard deviation of this
Gaussian noise layer using KerasTuner. Each of these LSTM layers is followed by a dropout
regularisation. Then we use a concatenation layer to concatenate the network’s input and

d
output. We feed this concatenated output to two bi-LSTM layers, each followed by a dropout

we
regularisation. On top of this network, we use two one-dimensional convolutional layers. We
use Bi-LSTM layers to preserve information from past to future and future to past. Addi-
tionally, we utilise one-dimensional convolution layers to capture meaningful features and take
advantage of the local spatial coherency of time series. The latter is beneficiary in reducing

vie
the number of operations needed to process time series. The choice of using a bi-directional
LSTM layer followed by Conv1D layers is based on the effectiveness of these architectures in
capturing temporal dependencies and feature extraction from sequential data. LSTM is a type
of Recurrent Neural Networks that can handle long-term dependencies by selectively retaining

re
or forgetting previous inputs through its memory cell structure. Bi-directionality in LSTM
allows the network to learn from both past and future inputs, which can be beneficial for time
series forecasting tasks. CNNs, on the other hand, are known for their ability to extract local
er
features from input signals and can be used to improve the accuracy of predictions by identi-
fying important patterns in the data. Furthermore, Conv1D layers are particularly useful for
time series data as they can capture changes in the signal over time by sliding a filter across
pe

the data, which can help to detect trends or anomalies that might not be apparent in the raw
input. The details of this architecture are in section 4.5 The forecasting model architecture is
depicted in Figure 2.
ot
tn
rin
ep
Pr

13

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
Prediction

Conv1D

d
we
Conv1D

Conv1D

vie
BI-LSTM BI-LSTM BI-LSTM BI-LSTM BI-LSTM

BI-LSTM BI-LSTM BI-LSTM BI-LSTM BI-LSTM

re
Concatenate Layer
er
BI-LSTM BI-LSTM BI-LSTM BI-LSTM BI-LSTM

BI-LSTM BI-LSTM BI-LSTM BI-LSTM BI-LSTM


pe

Gaussian Noise Layer

X(1) X(2) X(3) X(n-1) X(n)


ot

Figure 2: Deep forecasting model. First, we add gaussian noise to the input data and feed it to two consequent
Bi-LSTM layers in the proposed forecasting model. Then we concatenate the output of these two Bi-LSTM
tn

layers with the time series input. Subsequently, we again feed this concatenated data to two consequent Bi-
LSTM layers. Finally, we feed the output of these two bi-LSTM layers to three consequent one-dimensional
convolution layers to get our predictions.
rin

3.5. Postprocessing

Since we scaled the data in the preprocessing phase of the framework, we rescale the data by
applying inverse min-max scaling to measure the loss functions correctly. In the Algorithm 2,
ep

the whole procedure of time series generation and forecasting algorithm is written.
Pr

14

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
Algorithm 2 Time series forecasting.
Initialize set of layers’ weights W = {w1 , w2 , w3 , w4 , w5 , w6 , w7 } and Gaussian noise with
standard deviation σ
XT rain , XT est ← Train Test Split(Dataset)
XT rainScaled , XT estScaled ← Scale Train and Test split of time series input using Min-Max
Scaling of XT rain
XT rainGenerated ← VAE(XT rainScaled )

d
XT rainCombined ← Concatenate(XT rainGenerated , XT rainScaled )
XW indowedT est , yW indowedT est ← Rolling Window(XT est ), (Rolling window mentioned in 3.5)

we
for O, G, C do (Defined in 3.3)
XW indowedT rain , yW indowedT rain ← Rolling Window(XT rain (O, G, C))
end for
repeat(Forecaster)
X M , yM ← Random minibatch of M time series (drawn from
XW indowedT rain , yW indowedT rain . We forecast each version of dataset (O, G, C) sepa-

vie
rately.)
 ∼ N (0, σ)
XeM = XM + 
f1 M ← BiLSTM(BiLSTM(X
X e M ; w1 ) ; w2 )
XM = Concatenate(X f1 M , X M )
X
Concatenated
f2 M ← BiLSTM(BiLSTM(X M
M
ye ← Conv1D(Conv1D(Conv1D(

LM yeM , y M ; W = Mean
X
 Squared Error
re
Concatenated ; w3 ) ; w4 )
f2 M ; w5 ); w6 ); w7 )

g ← ∇W LM yeM , y M ; W (Gradients of minibatch estimator)


er
W ← Update parameters using gradients g (e.g. Adam)
until convergence of parameters (W )
yeT est ← Forecaster(XW indowedT est )
yeT estRescaled , yT estRescaled ← Inverse Min-Max Scaling(e yT est , yW indowedT est )
pe

Compute MASE, sMAPE, RelAvgRMSE of yeT estRescaled and yT estRescaled


return losses

4. Experimental Design
ot

4.1. Datasets
tn

In this paper, we use six real-world time series datasets to perform augmentation and fore-
casting, described briefly in the following, and the summary of the dataset is in table 1.

• NN5 Dataset (Crone (2009)): This is a daily dataset from the NN5 competition contain-
rin

ing UK daily cash withdrawals at many ATMs, and we aggregate them into a weekly
dataset.

• M3 Dataset (Makridakis & Hibon (2000)): This is a 3003 series of the M3-Competition,
ep

which contains different types of time series data (micro, industry, macro, finance, de-
mography, others) and various time intervals. In this study, the models’ performances
are investigated monthly, quarterly, and yearly, consisting of 1428, 756, and 645 time
Pr

series, respectively.

• Australian Electricity Demand (AUSElec) (O’Hara-Wild et al. (2021)): This dataset


contains five half-hourly time series for the electricity demand of five states in Australia:
Victoria, New South Wales, Queensland, Tasmania, and South Australia. We aggregate
the dataset into a weekly dataset.

15

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
Dataset N Lmin Lmax T S
NN5 111 105 105 weekly 52
M 3Y early 645 20 47 yearly 1
M 3Quarterly 765 24 72 quarterly 4
M 3M onthly 1428 66 144 monthly 12
AUSElec 5 313 313 weekly 52
Electricity 321 156 156 weekly 52

d
US Births 1 240 240 monthly 12
Traffic 862 731 731 daily 365

we
Table 1: Summary of the datasets used in experiment. N represents number of time series, Lmin , Lmax are
minimum and maximum length of time series, T is sampling rate of time series, S represents seasonality of
time series.

vie
• Electricity Hourly Dataset (Lai (2017a)): This dataset shows the electricity consumption
of 370 clients in 15-minute periods in Kilowatt from 2011 to 2014 with 321 time series.
We aggregate this to weekly, which has a length of 156.

• US Births Dataset (Pruim et al. (2020)): This dataset contains one time series repre-

re
senting the number of births in the United States from 01/01/1969 to 31/12/1988. We
aggregate the dataset into a monthly dataset.
er
• Traffic Dataset (Lai (2017b)): This dataset contains 862 hourly time series of the road
occupancy rates on the freeways of the San Francisco Bay area from 2015 to 2016, and
we aggregate them into a daily dataset.
pe

4.2. Evaluation Scheme

To measure the performance of the proposed framework and benchmarks, we adopt three
ot

evaluation metrics commonly used in the forecasting literature. These are the symmetric
Mean Absolute Percentage Error (sMAPE), Mean Absolute Scaled Error (MASE) (Hyndman
tn

& Koehler (2006)), and Relative average root-mean-square error (RelAvgRMSE), which are
defined as follows:

m  
2 X |Ft − At |
sMAPE = , (14)
rin

m t=1 |Ft | + |At |

where At represents the observation at time t, Ft is the generated forecast, and m indicates
the forecast horizon.
ep

1
Pm
|Ft − At |
MASE = 1
Pn t=1
m
, (15)
n−S t=S+1 |At − At−S |

where n is the number of observations in the training set of a time series, and S refers to the
Pr

length of the seasonal period in a given time series.


v q P 
m 2
u 1
N (F − A )
t t
u Y
m t=1
RelAvgRMSE = N , (16)
u
t q P
1 m 0 2
i=1 m t=1 (Ft − At )

where N is the total number of individual time series in the dataset and Ft0 is the generated
forecast calculated by the baseline model. RelAvgRM SE is a geometric mean of the ratio

16

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
of the root-mean-square error (RMSE) between the candidate model and the baseline model
(Davydenko & Fildes (2013)). The Naı̈ve forecast is selected as the baseline model to calculate
RelAvgRM SE for all datasets. When RelAvgRM SE error is below 1, the forecast is better
than the Naı̈ve forecast and vice versa.

d
4.3. Forecasting Benchmarks

we
4.3.1. Exponential Smoothing

Hyndman et al. (2008) propose the state space representation of the parameters of exponential
smoothing models, typically referred to as ETS. In this paper, the ETS model refers to the

vie
best model selected automatically, using the Akaike information criterion (AIC), which we
implement in the R package smooth (Svetunkov (2019)).

4.3.2. Multiple temporal aggregation algorithm

re
Kourentzes & Petropoulos (2014) introduce the multiple temporal aggregation algorithm (MAPA)
to precisely target time series’ components with the help of multiple temporal structures. The
ETS models are used in the first MAPA versions since the forecasters needed these for each
er
temporal component. The resulting components (trend, level, seasonality) are aggregated to
create the final model. In this paper, we use the R package MAPA (Kourentzes & Petropoulos
(2018)) for implementing MAPA.
pe

4.3.3. TBATS

This method is suggested by De Livera et al. (2011) to address the problems such as complex
seasonal patterns and calendar effects which classical exponential smoothing models cannot
ot

realise. The model is named after the key components comprising its algorithms: Trigonometric
seasonal modes, Box-Cox transformation, ARMA errors, and Trend and Seasonal components.
tn

In this study, we implement the TBATS using a forecast package in the R language (Hyndman
et al. (2019)).

4.3.4. STheta Method


rin

The STheta method is a univariate forecasting method implemented for forecasting a non-
seasonal time series. In this method, a new time series is calculated by solving a second-order
ep

difference equation by decomposing the original time series into the lines called ‘Theta lines’.
Each of these lines is extrapolated with a forecasting algorithm, and the forecasts are combined
to create the forecast for the main time series (Assimakopoulos & Nikolopoulos (2000)). We
apply it via the R package forecTheta (Fiorucci et al. (2016)).
Pr

4.3.5. ARIMA

In statistics and particularly in the time series analysis, an autoregressive integrated moving
average (ARIMA) model is a generalisation of an autoregressive moving average (ARMA)
model. In a variation of the ARIMA method, Hyndman & Khandakar (2008) adopted a new
technique, which combines unit root tests, minimisation of the AIC, and Maximum Likelihood

17

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
Estimate (MLE) to select an ARIMA model automatically. We apply auto.ARIMA model
from the forecast package in R (Hyndman et al. (2019)).

4.4. Augmentation Benchmarks

d
This study explores two common augmentation methods to generate time series. Then we
compare their results with those from the VAE.

we
4.4.1. Moving Block Bootstrapping (MBB)

In our research, we adopt the MBB technique as a benchmark for time series augmentation,
following the procedure which is introduced in Bergmeir et al. (2016). MBB is a common

vie
bootstrapping technique in time series forecasting (Athanasopoulos et al. (2018)). To create
multiple copies of a time series, MBB first uses STL to decompose and subsequently remove
seasonal and trend components of a time series. Furthermore, the MBB technique is applied

re
to the remainder of the time series - i.e., seasonally and trend - adjusted series - to generate
multiple versions of the residual components. Finally, the bootstrapped residual components
are added back together with the corresponding trend and seasonal to make new bootstrapped
versions of a time series. In the MBB technique, the artificially augmented data closely resem-
er
ble the distribution of the original training dataset - i.e., with similar seasonality and trend.
We apply the MBB implementation available in the bld.mbb.bootstrap function from the R
pe

package forecast (Hyndman et al. (2019)).

4.4.2. Dynamic Time Warping Barycentric Averaging (DBA)

The second procedure is the Dynamic Time Warping (DTW)-based time series augmentation
ot

technique introduced by Forestier et al. (2017a). The DBA method averages a set of time series
to generate new synthetic samples, so being able to mix characteristics of different time series
when generating new series leads to better accounting for the global characteristics in a group
tn

of time series. As characteristics of the original dataset are considered to generate new time
series, similar to MBB, DBA can also generate augmented series similar to the original training
dataset. We apply the implementation of the DBA method from Forestier et al. (2017b).
rin

4.5. Model parameters

In choosing the hyperparameters and activation functions of VAE and bi-LSTM models, we
ep

use KerasTuner as a robust hyperparameter optimisation framework and its random search
module. Finding the optimal neurons was performed by searching among a group of 32, 64,
128, 256, and 512 neurons and ReLU, ELU for the activation functions. Regarding the initial
Pr

learning rate choice, we searched in an interval of 0.001 to 0.1 with the step of 0.002. Further-
more, the learning rate decay of 0.001 is used as the learning rate scheduler. All regularisation
parameters, such as dropout rates, are chosen using the aforementioned framework.
For the encoder part of the VAE model, we use three LSTM layers with linear activation
and 512, 256, and 64 neurons, respectively. We parameterise the mean and variance of LSTM
layers’ output with two dense layers and perform the reparametrisation trick afterwards. Note

18

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
that the number of dense neurons equals the latent space dimension, which is 8. We use two
linear LSTM layers for the decoder part with 256 and 512 neurons, respectively. Finally, we
use a time-distributed layer to apply a dense layer on all input slices. The number of neurons in
this dense layer should be matched with the VAE’s input data dimension. The decoding term
of the equation 11, is a model depending on the data we are modelling and can be Gaussian,

d
Bernoulli, and other distributions Kingma & Welling (2014). In choosing a loss function for

we
VAE, first, we want to minimise the Kullback-Leibler divergence between the approximate
posterior and exact posterior, which is the output of the encoder. Then, since our data is real-
valued, we choose Mean Squared Error (MSE) to ensure the similarity between the input data
and the reconstructed data, which is the decoder’s output. Consequently, the loss function for

vie
this architecture is the summation of KL-divergence and MSE loss.
For the forecasting model (see Figure 2), in the beginning, we use a Gaussian noise layer
with a standard deviation of 0.01. It is followed by a sequence of layers: a bi-directional LSTM
layer, a dropout layer, another bi-directional LSTM layer, and the last dropout layer. We use

re
512 neurons for bi-directional LSTMs and a dropout rate of 0.9. The output of this sequence
is then concatenated with the input data and fed into a sequence of bi-directional LSTM
and dropout layers similar to the previous step. Again, we use 512 neurons for bi-directional
er
LSTMs in both layers and a dropout rate of 0.9 and 0.4 for the first and the second layers,
respectively. All the bi-directional LSTM layers are linear. On top of the proposed sequence,
we implement two 1D convolutional layers with 64 filters per layer, ReLU as an activation
pe

function and kernel size and stride of 1. We apply Adam optimiser and Mean Absolute Error
(MAE) loss to train the forecaster.
Our experiments are on a single Nvidia Tesla P100 GPU with 16 GB RAM on the Google
Colab platform. We use the scikit-learn library for preprocessing and postprocessing phases
ot

and the TensorFlow framework and Keras for model training.


tn

5. Results
In this session, we present the results of our proposed model based on three error measures on
six datasets. We also report the average error of each forecasting model on all datasets.
rin

We calculate the Average column of each table based on the following formula:

J
1X
avgi = (Lij − min Lj ) for i from 1 to I (17)
ep

J j=1

where L is the matrix of losses with row and column iterators i and j, respectively. I and J are
the number of rows (models) and columns (datasets), respectively. This calculation aims to
Pr

find the average of how worse the models are on each dataset than a dynamic benchmark, which
is the model with the least error on that dataset. All the datasets use Gaussian distribution
to approximate posterior distribution, except the Electricity dataset, for which the log-normal
distribution is employed. For the M3 dataset, the weighted average loss on all yearly, quarterly,
and monthly frequencies is presented. Additionally, the results are the average between the
outcomes of five different random seeds in training the forecasting models.

19

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
According to Table 2, our augmented deep forecasting models using both versions of the
VAE - combined (VAE-C) and generated (VAE-G) - outperform all other types of statistical
augmentation methods (MBB and DBA) as well as the deep forecasting benchmark (LSTM)
in all datasets. Moreover, our proposed models obtain lower error than statistical benchmark
forecasting methods in all datasets except the M3 and US-Births. The average of our error in

d
all datasets, the Average column, shows a dramatic improvement of VAE-C and VAE-G (at

we
a rate of almost ten times) compared to all other benchmarks. It demonstrates the potential
of our augmentation-based deep forecasting model as a robust global forecasting algorithm.
In some cases, VAE cannot capture the underlying representation of the data in its entirety;
therefore, the forecaster performs poorly on the generated data (G). We address this problem

vie
by feeding the combined data (C), which is a mixture of the original data (O) and generated
data (G), to the forecaster. On average, the VAE-C approach performs slightly better than
the VAE-G approach.
These results show that the improvements in training the forecaster on the well-generated

Model
LSTM
NN5
0.182
AUSElec
0.037 0.092
re
synthetic data enhance the accuracy of the test results.

Electricity US-Births
0.023
Traffic
0.324
M3
0.134
Average
0.014
er
LSTM(VAE-G) 0.191 0.027 0.090 0.022 0.270 0.136 0.005
LSTM(VAE-C) 0.176 0.028 0.094 0.019 0.274 0.135 0.003
LSTM(DBA-G) 0.368 0.052 0.152 0.041 0.332 0.194 0.072
LSTM(DBA-C) 0.303 0.053 0.126 0.038 0.311 0.164 0.048
pe

LSTM(MBB-G) 0.276 0.037 0.099 0.024 0.316 0.142 0.031


LSTM(MBB-C) 0.238 0.033 0.096 0.023 0.310 0.137 0.022
ETS 0.220 0.081 0.138 0.022 0.309 0.135 0.033
MAPA 0.203 0.083 0.128 0.020 0.304 0.132 0.027
ARIMA 0.244 0.072 0.104 0.011 0.306 0.140 0.028
TBATS 0.206 0.065 0.138 0.022 0.310 0.136 0.028
ot

STHETA 0.220 0.081 0.137 0.020 0.312 0.132 0.033


Table 2: sMAPE Error Measures
tn

Based on results, our approach, LSTM(VAE-C), outperforms the best statistical augmen-
tation algorithm which is LSTM(MBB-C) by an average of 86.36% in terms of sMAPE error
measures across all datasets. On the other hand, LSTM(VAE-C) approach surpasses the best
rin

other statistical forecasting algorithm (MAPA) by an average of 88.89% in terms of sMAPE


error measures.
Table 3 presents the result of RelAvgRMSE error on six datasets. On average, the VAE-
ep

C model achieves better accuracy in this error measure than all other models. Both the
VAE-based models attain high accuracy in seasonal data like NN5. It can be inferred that
augmentation can enhance the potential of deep forecasting algorithms in forecasting seasonal
Pr

datasets. Moreover, MBB-G achieves better results in the AUSElec and Electricity datasets
than all other models because the AUSElec dataset has a short length that is unsuitable for
a VAE to capture the latent space adequately. On the other hand, the Electricity dataset has
numerous tail data, and a Gaussian latent space cannot capture them effectively. Changing the
latent space distribution from Gaussian to log-normal distribution for the Electricity dataset
is effective in achieving better results than DBA but not compared to MBB. For the US-
Birth, traffic, and M3 datasets, the ARIMA model, DBA-G and MAPA outperformed all

20

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
other models, respectively. On average, it is evident that VAE models have better accuracy
than statistical methods and augmentation-based deep models.

Model NN5 AUSElec Electricity US-Births Traffic M3 Average


LSTM 0.409 0.512 0.619 0.587 1.038 0.868 0.156
LSTM(VAE-G) 0.433 0.451 0.600 0.564 0.862 0.861 0.112

d
LSTM(VAE-C) 0.389 0.465 0.616 0.430 0.859 0.873 0.089
LSTM(DBA-G) 0.827 0.768 0.928 0.838 0.772 0.919 0.325
LSTM(DBA-C) 0.618 0.758 0.761 0.879 0.859 0.905 0.280

we
LSTM(MBB-G) 0.711 0.324 0.571 0.552 0.822 0.914 0.132
LSTM(MBB-C) 0.525 0.410 0.599 0.509 0.881 0.883 0.118
ETS 0.494 0.922 1.031 0.499 1.054 0.808 0.285
MAPA 0.471 1.129 0.952 0.492 1.048 0.799 0.299
ARIMA 0.595 1.013 0.678 0.245 1.049 0.806 0.214

vie
TBATS 0.472 0.782 1.003 0.679 1.042 0.817 0.283
STHETA 0.504 1.090 1.013 0.464 1.074 0.804 0.308
Table 3: RelAvgRMSE Error Measures

re
In this table, our approach, LSTM(VAE-C), outperforms the best statistical augmentation
algorithm which is LSTM(MBB-C) by an average of 24.57% in terms of RelAvgEMSE error
measures across all datasets. On the other hand, LSTM(VAE-C) approach surpasses the
best other statistical forecasting algorithm (ARIMA) by an average of 58.41% in terms of
er
RelAvgEMSE error measures.
Finally, Table 4 presents the result of our proposed model with MASE error on six datasets.
pe

We can see that VAE-C gets a better MASE on average. On average, this model for all
three error measures and all datasets is the best global forecasting model. In detail, VAE-G
outperforms all other models in the AUSElec and Electricity datasets. ARIMA obtains the
best result for the US-Births dataset compared to the other models. The best results for
ot

the Traffic and M3 datasets are obtained by the DBA-G and STHETA models, respectively.
Finally, for the NN5 Dataset, VAE-C shows the best result and the DBA-based methods show
poor results.
tn

As a drawback, we can mention that datasets with short time series lead the forecaster to
be unable to achieve good results compared to datasets with long time series, which might be
addressed by augmenting datasets to lengthen each time series. Like previous error measures,
rin

on average, both VAE models have better accuracy than the statistical methods and the
statistical augmentation-based deep models.

Model NN5 AUSElec Electricity US-Births Traffic M3 Average


ep

LSTM 0.728 0.466 0.782 0.820 2.043 1.433 0.254


LSTM(VAE-G) 0.904 0.405 0.778 0.755 1.631 1.470 0.199
LSTM(VAE-C) 0.696 0.427 0.782 0.698 1.667 1.471 0.166
LSTM(DBA-G) 1.639 0.787 1.353 1.461 1.077 1.923 0.582
LSTM(DBA-C) 1.612 0.971 1.099 1.359 1.355 1.686 0.556
Pr

LSTM(MBB-G) 0.893 0.515 0.831 0.765 1.420 1.536 0.202


LSTM(MBB-C) 0.841 0.483 0.814 0.778 1.587 1.474 0.205
ETS 0.920 1.154 1.409 0.848 2.168 1.435 0.531
MAPA 0.857 1.172 1.231 0.761 2.105 1.382 0.460
ARIMA 1.024 1.007 0.833 0.425 2.128 1.430 0.350
TBATS 0.866 0.940 1.650 0.836 2.173 1.482 0.533
STHETA 0.934 1.153 1.378 0.771 2.197 1.366 0.509
Table 4: MASE Error Measures

21

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
In this table, our approach, LSTM(VAE-C), outperforms the best statistical augmentation
algorithm which is LSTM(MBB-G) by an average of 17.82% in terms of MASE error measures
across all datasets. On the other hand, LSTM(VAE-C) approach surpasses the best other
statistical forecasting algorithm (ARIMA) by an average of 52.57% in terms of MASE error
measures.

d
Overall, VAE shows its ability to learn data representation through hierarchies of LSTM

we
architectures and generate new time series by sampling from the latent space of the time series.
However, using the MBB method to create a new time series does not change the trend and
seasonality of the time series. It only manipulates the residual component of it. The DBA
method generates a new time series by averaging a different set of time series. Therefore, there

vie
is no learning process, and no data-driven approach is used to generate the datasets.
We also check the significance of differences between the proposed VAE-based forecasting
methods and benchmark methods. In performing the statistical test, the overall result of the
Friedman rank sum test for sMAPE has a p-value of 2.87×10−11 that shows the significant

re
differences among the VAE-based methods and benchmarks on average for all datasets. More-
over, the statistical testing evaluation results for the RelAvgRMSE and MASE measures have
p-values of 1.61 × 10−5 and 8.89 × 10−7 , respectively. Based on these three error schemes’ test
er
results, we establish that the VAE augmentation-based methods compared to other augmen-
tation and forecasting methods, on average, can significantly improve time series forecasting.
pe

5.1. Reproducibility of results

Elements of the codes used in this study can be found in our developed Python library,
named AugmentTS, for the time series forecasting using augmentation, which is available
ot

from https://github.com/DrSasanBarak/AugmentTS. The results are reproducible using our


AugmentTS library.
tn

6. Feature-based analysis of augmentation


We empirically show the advantages of the proposed deep augmentation-based forecasting
rin

method with respect to the two different benchmark augmentation methods. The proposed
VAE method performs better than benchmarks to increase the forecasting capability of the
hybrid CNN-LSTM network as an average of all three metrics. Now we discuss whether the
ep

proposed method captures a robust representation of data.


General data quality describes the overall distributional similarity between the real data
and synthetic datasets. The similarity of marginal distributions, the performance of regression
models, or the predictive performance of machine learning models trained on the synthetic
Pr

dataset and tested on the real dataset are proposed to quantify the representativeness of the
synthetic dataset (Arnold & Neunhoeffer (2021)). According to the results, the synthetic time
series of VAE, which the deep forecaster trained on, has higher accuracy than MBB and DBA
on real test data.
This section proposes a new method to verify how synthetic data are similar to the original
dataset. One way to analyze time series is to represent it with its features. Feature of time

22

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
series is a number that encodes rich information of time series, like trend, seasonality, etc. For
a comprehensive overview, see chapter 4 of (Hyndman & Athanasopoulos (2021)). Fulcher
(2018) uses time series features to define a measure of similarity between pairs of time series
(see chapter 2). He argues how to determine the time series similarity based on their features.
For this purpose, we calculate 12 common time series features and plot their distributions. All

d
of the features have been previously used in a forecasting context by (Hyndman et al. (2015);

we
Talagala et al. (2018); Montero-Manso et al. (2020)), and are described in Appendix A. As
illustrated in Figure B.1a, for the AUSElec dataset, VAE and DBA time series have a similar
distribution to the original data. This can explain that the generated time series with VAE
and DBA are very similar to the original dataset in 12 selected features. However, the density

vie
distribution of the MBB method is different and shows a weaker performance.
For the Electricity dataset, as shown in Figure B.1b, the MBB method works as well as the
other two methods, but for some time series features, like x acf10, VAE covers a better density
of features than other methods. For all M3 datasets, VAE, compared to the DBA and MBB

re
methods, shows better results in keeping the synthetic time series features similar to those of
the real dataset; for instance, see unitroot kpss feature. Moreover, DBA weakly follows the
original distribution based on the time series features distribution. In all M3 datasets - i.e.,
er
monthly, quarterly, and yearly - DBA tends to have higher entropy on average than VAE and
MBB. Entropy measures the forecastability of time series, and lower entropy shows a higher
signal-to-noise ratio and better forecastability. Figure B.1e empirically shows this with a higher
pe

entropy feature in the dataset with lower predictability. However, merely analysing the entropy
feature is not a robust approach to estimate forecasting performance, and other approaches
should accompany it. For instance, as shown in the traffic dataset (B.1f) and the NN5 dataset
(B.1c), the average entropy of MBB is higher than that of DBA, but the forecasting error of
ot

the MBB is less than that of DBA.


For the traffic and NN5 datasets, VAE shows promising results to create a similar distri-
tn

bution to the original data compared to DBA and MBB. For example, in the traffic dataset,
as illustrated in Figure B.1f, VAE performs better in covering the density of the diff1 acf10
feature. For the NN5 dataset, as illustrated in Figure B.1c, MBB and DBA perform similar
rin

to VAE to cover some features, e.g., unitroot kpss and nonlinearity. However, VAE performs
better than the other two methods to cover the density of diff1 acf10 and arch r2 feature. In
general, this figure shows that data augmentation techniques that generate time series with
similar characteristics to the original dataset achieve better results than those that generate
ep

time series with diverse characteristics.


To better understand the feature space of real data compared to three types of augmenta-
tion, VAE, MBB, and DBA, we use the two-dimensional space with a t-distributed stochastic
Pr

neighbour embedding (t-SNE) method in Figure 3. t-SNE is a nonlinear approach for retaining
both local and global features of the data in a single map to conduct the nonlinear dimen-
sion reduction of the high dimensional feature space. To create this Figure, we first extract
time series features with EfficientFCParameters() setting of the TsFresh package (Christ et al.
(2018)). Then, we transfer the extracted feature to a two-dimensional space using the t-SNE
method. Each colour in Figure 3 shows the distribution of each dataset in feature space using

23

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
t-SNE. It can be seen that VAE-generated time series diversify the same real data pattern
better than other augmentation methods. However, there is no clear distinction between the
methods in more complex and noisy data with outliers.
We use a more complicated probability distribution in latent space to model the dataset to
improve the current augmentation technique. Choosing log-normal distribution as latent space

d
distribution exhibits a better performance in some datasets, particularly those containing nu-

we
merous tail data or complex underlying representations. Therefore, heavy-tailed distributions
can improve the performance in some complex datasets. Based on Figure 3b, the electricity
dataset consists of numerous tail data, which are hard for a VAE with Gaussian latent space
to capture. Heavy-tailed distributions, such as log-normal, gamma distribution, and student’s

vie
t-distribution, can perform better in modelling the tail data than light-tailed distributions. We
use log-normal distribution for latent space, which improves the VAE to have a more robust
performance on the sparse electricity dataset.

re
er
pe
ot
tn
rin
ep
Pr

24

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
d
we
(a) AUS Electricity (b) Electricity

re vie
er
(c) NN5 (d) M3Monthly
pe
ot
tn

(e) M3Quarterly (f) M3Yearly


rin
ep
Pr

(g) Traffic

Figure 3: Two components of time series features space

25

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
7. Conclusion
Using deep forecasting models with a limited number of time-series may weaken the capability
of the deep model learning and decrease the forecasting accuracy. This paper proposes a novel
use of VAE in time series forecasting. VAE augments the original time series using encoded

d
latent space and hierarchical neural network layers to learn their underlying distribution.
This provides much clearer characteristics of time series for better training of deep models.

we
Moreover, we show that VAE produces more realistic data than MBB and DBA methods based
on analysing the distribution of the meta-features of time series.
We develop a python library, AugmentTS, that augments time series data using deep

vie
generative models and visualises the latent space of generative models, and forecasts time series
using deep neural networks. The hybrid biLSTM-Conv model is trained using two approaches-
the combined and the generated. The combined approach trains on the augmented time series
together with the original time series database, while the generated approach only trains on

original dataset. re
augmented data. When the deep model is trained, the acquired network is transferred to the

In this paper, we mention some challenges in capturing the underlying distributions of time
series data. We discuss how to address them by choosing distributions other than the Gaussian.
er
Our evaluation of real-world time series datasets shows that generated time series from latent
space provide data-driven features for deep learning models and significantly improve fore-
pe

casting accuracy. We also provide empirical evidence of the approach’s efficacy against widely
accepted univariate forecasting methods. Furthermore, although there is no clear boundary
between the choice of the proposed transfer learning approaches, in many cases, while the
’combined’ is better in more extensive data, the ’generated strategy’ works better in small
ot

datasets.
We aim to improve our approach to generate multivariate time series data for further
studies, which is challenging using simple VAE architectures. We also look forward to using
tn

more complicated distributions as a choice of latent space distribution for complex datasets.
Additionally, increasing the ratio of sampled synthetic data to original data would be studied
for further research.
rin
ep
Pr

26

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
References
Arnold, C., & Neunhoeffer, M. (2021). Really useful synthetic data – a framework to evaluate
the quality of differentially private synthetic data. arXiv:2004.07740.
Assimakopoulos, V., & Nikolopoulos, K. (2000). The theta model: a decomposition approach

d
to forecasting. International journal of forecasting, 16 , 521–530.
Athanasopoulos, G., Song, H., & Sun, J. A. (2018). Bagging in tourism demand modeling and

we
forecasting. Journal of Travel Research, 57 , 52–68.
Bandara, K., Hewamalage, H., Liu, Y.-H., Kang, Y., & Bergmeir, C. (2021). Improving
the accuracy of global forecasting models using time series data augmentation. Pattern
Recognition, 120 , 108148.
Bergmeir, C., Hyndman, R. J., & Benı́tez, J. M. (2016). Bagging exponential smoothing

vie
methods using stl decomposition and box–cox transformation. International journal of fore-
casting, 32 , 303–312.
Cao, H., Tan, V. Y., & Pang, J. Z. (2014). A parsimonious mixture of gaussian trees model for
oversampling in imbalanced and multimodal time-series classification. IEEE transactions

re
on neural networks and learning systems, 25 , 2226–2239.
Chalapathy, R., Khoa, N. L. D., & Chawla, S. (2020). Robust deep learning methods for
anomaly detection. In Proceedings of the 26th ACM SIGKDD International Conference on
er
Knowledge Discovery & Data Mining (pp. 3507–3508).
Christ, M., Braun, N., & Neuffer, J. (2018). Tsfresh. https://github.com/blue-yonder/
tsfresh.
pe

Crone, S. F. (2009). NN5 forecasting competition. http://www.


neural-forecasting-competition.com/NN5/datasets.htm. URL: http://www.
neural-forecasting-competition.com/NN5/datasets.htm accessed: 2012-8-13.
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). Autoaugment: Learn-
ing augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on
ot

Computer Vision and Pattern Recognition (pp. 113–123).


Davydenko, A., & Fildes, R. (2013). Measuring forecasting accuracy: the case of judgmental
adjustments to sku-level demand forecasts. International Journal of Forecasting, 29 , 510–
tn

522. doi:10.1016/j.ijforecast.2012.09.002.
De Livera, A. M., Hyndman, R. J., & Snyder, R. D. (2011). Forecasting time series with
complex seasonal patterns using exponential smoothing. Journal of the American statistical
rin

association, 106 , 1513–1527.


Demir, S., Mincev, K., Kok, K., & Paterakis, N. G. (2021). Data augmentation for time series
regression: Applying transformations, autoencoders and adversarial networks to electricity
price forecasting. Applied Energy, 304 , 117695.
ep

DeVries, T., & Taylor, G. W. (2017). Dataset augmentation in feature space. arXiv preprint
arXiv:1702.05538 , .
Esteban, C., Hyland, S. L., & Rätsch, G. (2017). Real-valued (medical) time series generation
Pr

with recurrent conditional gans. arXiv preprint arXiv:1706.02633 , .


Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., & Muller, P.-A. (2019). Deep learning
for time series classification: a review. Data mining and knowledge discovery, 33 , 917–963.
Figurnov, M., Mohamed, S., & Mnih, A. (2018). Implicit reparameterization gradients. Ad-
vances in neural information processing systems, 31 .
Fiorucci, J. A., Louzada, F., Yiqi, B., & Fiorucci, M. J. A. (2016). Package ‘forectheta’, .
Fons, E., Dawson, P., Zeng, X.-j., Keane, J., & Iosifidis, A. (2021). Adaptive weighting scheme
for automatic time-series data augmentation. arXiv preprint arXiv:2102.08310 , .

27

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
Forestier, G., Petitjean, F., Dau, H. A., Webb, G. I., & Keogh, E. (2017a). Generating
synthetic time series to augment sparse datasets. In 2017 IEEE international conference on
data mining (ICDM) (pp. 865–870). IEEE.
Forestier, G., Petitjean, F., Dau, H. A., Webb, G. I., & Keogh, E. (2017b). Generating
synthetic time series to augment sparse datasets. In Data Mining (ICDM), 2017 IEEE

d
International Conference on (pp. 865–870). IEEE.
Fulcher, B. D. (2018). Feature-based time-series analysis. In Feature engineering for machine

we
learning and data analytics (pp. 87–116). CRC Press.
Gamboa, J. C. B. (2017). Deep learning for time-series analysis. arXiv preprint
arXiv:1701.01887 , .
Gao, J., Song, X., Wen, Q., Wang, P., Sun, L., & Xu, H. (2020). Robusttad: Robust time series

vie
anomaly detection via decomposition and convolutional neural networks. arXiv preprint
arXiv:2002.09545 , .
Han, Z., Zhao, J., Leung, H., Ma, K. F., & Wang, W. (2019). A review of deep learning models
for time series prediction. IEEE Sensors Journal , 21 , 7833–7848.

.
re
Hasibi, R., Shokri, M., & Dehghan, M. (2019). Augmentation scheme for dealing with imbal-
anced network traffic classification using deep learning. arXiv preprint arXiv:1901.00204 ,

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
er
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–
778).
Hsu, D. (2017). Time series forecasting based on augmented long short-term memory. CoRR,
pe

abs/1707.00666 . URL: http://arxiv.org/abs/1707.00666. arXiv:1707.00666.


Hyndman, R., & Athanasopoulos, G. (2021). Forecasting: principles and practice , 3rd edition.
otexts: Melbourne, australia.
Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay, L., O’Hara-Wild, M.,
Petropoulos, F., Razbash, S., Wang, E., & Yasmeen, F. (2019). forecast: Forecasting func-
ot

tions for time series and linear models. URL: http://pkg.robjhyndman.com/forecast r


package version 8.10.
tn

Hyndman, R., Koehler, A. B., Ord, J. K., & Snyder, R. D. (2008). Forecasting with exponential
smoothing: the state space approach. Springer Science & Business Media.
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: the forecast
package for r. Journal of statistical software, 27 , 1–22.
rin

Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy.
International journal of forecasting, 22 , 679–688.
Hyndman, R. J., Wang, E., & Laptev, N. (2015). Large-scale unusual time series detection. In
2015 IEEE international conference on data mining workshop (ICDMW) (pp. 1616–1619).
ep

IEEE.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction
to variational methods for graphical models. Mach. Learn., 37 , 183–233. URL: https:
Pr

//doi.org/10.1023/A:1007665907178. doi:10.1023/A:1007665907178.
Kang, Y., Hyndman, R. J., & Li, F. (2020). Gratis: Generating time series with diverse and
controllable characteristics. Statistical Analysis and Data Mining: The ASA Data Science
Journal , 13 , 354–376.
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. arXiv:1312.6114.
Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques.
Adaptive computation and machine learning. MIT Press. URL: https://books.google.

28

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
co.in/books?id=7dzpHCHzNQ4C.
Kourentzes, N., & Petropoulos, F. (2014). Mapa: Multiple aggregation prediction algorithm, .
Kourentzes, N., & Petropoulos, F. (2018). MAPA: Multiple Aggregation Prediction Algorithm.
URL: https://CRAN.R-project.org/package=MAPA r package version 2.0.4.
Lai, G. (2017a). Electricity hourly dataset. https://github.com/laiguokun/
multivariate-time-series-data.

d
Lai, G. (2017b). Traffic hourly dataset. https://github.com/laiguokun/

we
multivariate-time-series-data.
Laptev, N., Yu, J., & Rajagopal, R. (2018). Reconstruction and regression loss for time-series
transfer learning. In Proc. SIGKDD MiLeTS .
Le Guennec, A., Malinowski, S., & Tavenard, R. (2016). Data augmentation for time series
classification using convolutional neural networks. In ECML/PKDD workshop on advanced

vie
analytics and learning on temporal data.
Lee, T. E. K., Kuah, Y., Leo, K.-H., Sanei, S., Chew, E., & Zhao, L. (2019). Surrogate
rehabilitative time series data for image-based deep learning. In 2019 27th European Signal
Processing Conference (EUSIPCO) (pp. 1–5). IEEE.

re
Makridakis, S., & Hibon, M. (2000). The m3-competition: results, conclusions and implica-
tions. International journal of forecasting, 16 , 451–476.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020a). The m4 competition: 100,000
time series and 61 forecasting methods. International Journal of Forecasting, 36 , 54–74.
er
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020b). The m5 accuracy competition:
Results, findings and conclusions. Int J Forecast, .
Montero-Manso, P., Athanasopoulos, G., Hyndman, R. J., & Talagala, T. S. (2020). Fforma:
pe

Feature-based forecast model averaging. International Journal of Forecasting, 36 , 86–92.


Naesseth, C., Ruiz, F., Linderman, S., & Blei, D. (2017). Reparameterization gradients through
acceptance-rejection sampling algorithms. In Artificial Intelligence and Statistics (pp. 489–
498). PMLR.
Nishizaki, H. (2017). Data augmentation and feature extraction using variational autoencoder
ot

for acoustic modeling. In 2017 Asia-Pacific Signal and Information Processing Association
Annual Summit and Conference (APSIPA ASC) (pp. 1222–1227). doi:10.1109/APSIPA.
tn

2017.8282225.
O’Hara-Wild, M., Hyndman, R., & Wang, E. (2021). Australian electricity demand dataset.
https://cran.r-project.org/package=tsibbledata.
Olson, M., Wyner, A. J., & Berk, R. (2018). Modern neural networks generalize on small data
rin

sets. In Proceedings of the 32nd International Conference on Neural Information Processing


Systems (pp. 3623–3632).
Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019).
ep

Specaugment: A simple data augmentation method for automatic speech recognition. arXiv
preprint arXiv:1904.08779 , .
Pruim, R., Kaplan, D., & Horton, N. (2020). Us births dataset. https://cran.r-project.
org/package=mosaicData.
Pr

Ranganath, R., Gerrish, S., & Blei, D. M. (2013). Black box variational inference.
arXiv:1401.0118.
Ribeiro, M., Grolinger, K., ElYamany, H. F., Higashino, W. A., & Capretz, M. A. (2018).
Transfer learning with seasonal and trend adjustment for cross-building energy forecasting.
Energy and Buildings, 165 , 352–363.
Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep
learning. Journal of Big Data, 6 , 1–48.

29

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
Smyl, S., & Kuber, K. (2016). Data preprocessing and augmentation for multiple short time
series forecasting with recurrent neural networks. In 36th International Symposium on Fore-
casting.
Svetunkov, I. (2019). smooth: Forecasting using state space models. R package version, 2 .
Talagala, T. S., Hyndman, R. J., Athanasopoulos, G. et al. (2018). Meta-learning how to

d
forecast time series. Monash Econometrics and Business Statistics Working Papers, 6 , 16.
Ullah, U., Xu, Z., Wang, H., Menzel, S., Sendhoff, B., & Bäck, T. (2020). Exploring clinical

we
time series forecasting with meta-features in variational recurrent models. In 2020 Interna-
tional Joint Conference on Neural Networks (IJCNN) (pp. 1–9). IEEE.
Wang, Q., Meng, F., & Breckon, T. P. (2020). Data augmentation with norm-vae for unsu-
pervised domain adaptation. arXiv:2012.00848.

vie
Wen, Q., Sun, L., Yang, F., Song, X., Gao, J., Wang, X., & Xu, H. (2020). Time series data
augmentation for deep learning: A survey. arXiv preprint arXiv:2002.12478 , .
Wen, T., & Keyes, R. (2019). Time series anomaly detection using convolutional neural
networks and transfer learning. arXiv preprint arXiv:1905.13628 , .
Wu, Z., Wang, S., Qian, Y., & Yu, K. (2019). Data augmentation using variational autoencoder

re
for embedding based speaker verification. In INTERSPEECH (pp. 1163–1167).
Yoon, J., Jarrett, D., & Van der Schaar, M. (2019). Time-series generative adversarial net-
works, .
Zeroual, A., Harrou, F., Dairi, A., & Sun, Y. (2020). Deep learning methods for forecasting
er
covid-19 time-series data: A comparative study. Chaos, Solitons & Fractals, 140 , 110121.
pe
ot
tn
rin
ep
Pr

30

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
Appendix A. List of Time Series Features
Here, we describe the 12 time series features used in the discussion part of the paper.
Trend: It measures the strength of trend in the time series. For non-trendy time series,
the value is close to 0. Trend value calculated as:

d
Var (et )
Trend = 1 − , (A.1)
Var (ft + et )

we
where ft and et are smoothed trend component and remainder of a time series, respectively.
Unitroot pp: It shows the statistic for the “Z-alpha” version of the Phillips Perron (PP)
unit root test, with constant a trend and lag one.

vie
Entropy: The spectral entropy is the Shannon entropy
Z π
− fˆ(λ) log fˆ(λ)dλ, (A.2)
−π

re
where fˆ(λ) is an estimate of the spectral density of the data. It quantifies how much a time
series is forecastable. A Low value indicates a high signal-to-noise ratio, and a large value
occurs when a series is difficult to forecast.
er
X acf10: It computes the sum of squares of a time series’s first ten autocorrelation coef-
ficients.
Spike: Spike, in time series, is any time point with a residual value greater than two times
pe

the standard deviation of the residuals. This feature measures the spikiness of a time series.
Diff1 acf10: It measures the sum of squares of a differenced series’s first ten autocorrela-
tion coefficients.
Arch r2: It is the R2 value of an autoregressive (AR) model applied to the pre-whitened
ot

time series. Pre-whitening the time series means removing the mean, trend, and AR informa-
tion from the time series.
tn

Garch r2: Garch stands for generalised autoregressive conditional heteroskedasticity.


Garch r2 shows the R2 value of an AR model applied to {zt2 }, where zt are the residuals
after fitting a garch(1,1) model to the pre-whitened time series.
rin

Curvature: It measures the curvature of a time series calculated based on the coefficients
of orthogonal quadratic regression.
Unitroot kpss: In the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test, the null hypoth-
esis is that the data are stationary. Here, we look for evidence that the null hypothesis is false.
ep

So, small p-values suggest that the time series is not stationary. The unitroot kpss feature is
a vector comprising the statistic for the KPSS unit root test with a linear trend and lag one.
Linearity: This feature measures the linearity of a time series calculated based on the
Pr

coefficients of orthogonal quadratic regression.


Nonlinearity: The nonlinearity coefficient is calculated by modifying the statistic used
in Teräsvirta’s nonlinearity test. The test uses a statistic X2 = T log(SSE1/SSE0), where
SSE1 and SSE0 are the sum of squared residuals from a nonlinear and linear autoregression,
respectively. Nonlinearity shows large values when the series is nonlinear; it takes values near
0 when the series is linear.

31

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
Appendix B. Distribution of time series features

Density distribution of AUS Electricity Demand Data Set


Original Original 25 Original 0.6 Original
0.008
VAE VAE VAE VAE
8 MBB MBB MBB MBB
0.007 0.5
DBA DBA 20 DBA DBA
0.006
6 0.4
0.005 15
Density

d
0.004 0.3
4
10
0.003 0.2
2 0.002
5
0.1

we
0.001

0 0.000 0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 500 400 300 200 100 0 100 200 0.4 0.5 0.6 0.7 0.8 0.9 1.0 2 0 2 4 6 8
trend unitroot_pp entropy x_acf10
50 500
Original Original 50 Original Original
VAE VAE VAE VAE
800000 MBB MBB MBB MBB
DBA 40 DBA DBA 400 DBA
40

600000 30 300
30
Density

vie
400000 20 200
20

200000 10 10 100

0 0 0 0
0.5 0.0 0.5 1.0 1.5 2.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.10 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030
spike 1e 5 diff1_acf10 arch_r2 garch_r2

4.0 0.35
Original Original Original Original
VAE VAE VAE 14 VAE
0.8 MBB 3.5 MBB 0.30 MBB MBB
DBA DBA DBA 12 DBA

0.6
3.0

2.5

re 0.25

0.20
10
Density

8
2.0
0.4 0.15 6
1.5
0.10 4
1.0
0.2
0.5 0.05 2

0.0 0.0 0.00 0


17.5 15.0 12.5 10.0 7.5 5.0 2.5 1 0 1 2 3 4 20 15 10 5 0 5 10 15 20 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25
er
curvature unitroot_kpss linearity nonlinearity

(a) AUS Electricity

0.06
Density distribution of Electricity Data Set
Original Original 5 Original 0.40 Original
6
VAE VAE VAE VAE
MBB 0.05 MBB MBB 0.35 MBB
pe

5 DBA DBA 4 DBA DBA


0.30
0.04
4 0.25
3
Density

0.03 0.20
3
2 0.15
2 0.02
0.10
0.01 1
1
0.05

0 0.00 0 0.00
0.0 0.2 0.4 0.6 0.8 1.0 175 150 125 100 75 50 25 0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8
trend unitroot_pp entropy x_acf10
ot

100000 10 10 16
Original Original Original Original
VAE VAE VAE VAE
MBB MBB MBB 14 MBB
80000 DBA 8 DBA 8 DBA DBA
12

60000 6 6 10
Density

8
tn

40000 4 4 6

4
20000 2 2
2

0 0 0 0
0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
spike diff1_acf10 arch_r2 garch_r2

Original Original 0.14 Original Original


VAE VAE VAE VAE
rin

2.5 2.5
0.20 MBB MBB MBB MBB
0.12
DBA DBA DBA DBA
2.0 0.10 2.0
0.15
0.08
Density

1.5 1.5
0.10 0.06
1.0 1.0
0.04
0.05
0.5 0.5
0.02
ep

0.00 0.0 0.00 0.0


10 5 0 5 10 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 10 5 0 5 10 0 1 2 3 4 5
curvature unitroot_kpss linearity nonlinearity

(b) Electricity

Figure B.1: Time Series Features’ Distribution


Pr

32

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
Density distribution of NN5 Data Set
Original 0.005 Original Original 1.75 Original

d
10 VAE VAE 10 VAE VAE
MBB MBB MBB 1.50 MBB
DBA 0.004 DBA DBA DBA
8 8 1.25
0.003
1.00
Density

we
6 6

0.002 0.75
4 4
0.50
2 0.001 2
0.25

0 0.000 0 0.00
0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1000 800 600 400 200 0.5 0.6 0.7 0.8 0.9 1.0 0 1 2 3 4
trend unitroot_pp entropy x_acf10

400000 Original Original Original 50 Original


12 16
VAE VAE VAE VAE

vie
350000 MBB MBB MBB MBB
DBA 10 DBA 14 DBA DBA
40
300000
12
250000 8
10 30
Density

200000 6 8
150000 20
6
4
100000 4
10
2
50000 2
0 0 0 0

0.25

0.20
0.0 0.5 1.0
spike
1.5 2.0 2.5
1e 5

Original
VAE
MBB
DBA
0.5

0.4
0.0 0.2 0.4 0.6
diff1_acf10
0.8 1.0 1.2

re
Original
VAE
MBB
DBA
0.12

0.10

0.08
0.0 0.1 0.2
arch_r2
0.3

Original
VAE
MBB
DBA
0.4

4.0

3.5

3.0

2.5
0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14
garch_r2

Original
VAE
MBB
DBA
Density

0.15 0.3 2.0


0.06
1.5
er
0.10 0.2
0.04
1.0
0.05 0.1 0.02
0.5

0.00 0.0 0.00 0.0


10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 2 0 2 4 6 8 10 15 10 5 0 5 10 15 20 25 0.0 0.5 1.0 1.5 2.0
curvature unitroot_kpss linearity nonlinearity
pe

(c) NN5
Density distribution of M3Monthly Data Set
Original 0.07 Original Original Original
17.5 2.00 0.5
VAE VAE VAE VAE
MBB 0.06 MBB MBB MBB
15.0 DBA DBA 1.75 DBA DBA
0.4
12.5 0.05 1.50

0.04 1.25 0.3


Density

10.0
1.00
7.5 0.03
0.75 0.2
5.0 0.02
ot

0.50
0.1
2.5 0.01 0.25
0.0 0.00 0.00 0.0
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 250 200 150 100 50 0 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 2 0 2 4 6 8 10
trend unitroot_pp entropy x_acf10
40000 2.5
Original Original Original 14 Original
35000 VAE VAE VAE VAE
tn

12
MBB MBB MBB 12 MBB
DBA 2.0 DBA DBA DBA
30000 10
10
25000
1.5 8
8
Density

20000
6 6
15000 1.0
4 4
10000
0.5
5000 2 2
rin

0 0.0 0 0
0.0000 0.0005 0.0010 0.0015 0.0020 0 1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
spike diff1_acf10 arch_r2 garch_r2
0.25
Original 1.0 Original Original Original
0.175 VAE VAE VAE 2.00 VAE
MBB MBB MBB MBB
DBA DBA 0.20 DBA 1.75 DBA
0.150 0.8
0.125 1.50
0.15
0.6 1.25
ep
Density

0.100
1.00
0.075 0.4 0.10
0.75
0.050
0.05 0.50
0.2
0.025 0.25
0.000 0.0 0.00 0.00
15 10 5 0 5 10 0 1 2 3 15 10 5 0 5 10 15 0 5 10 15 20
curvature unitroot_kpss linearity nonlinearity
Pr

(d) M3Monthly

Figure B.1: Time Series Features’ Distribution (cont.)

33

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343
Density distribution of M3Quarterly Data Set

d
Original 0.175 Original 4.0 Original Original
VAE VAE VAE 0.5 VAE
MBB MBB 3.5 MBB MBB
20 0.150
DBA DBA DBA DBA
3.0 0.4
0.125

we
15 2.5
0.100 0.3
Density

2.0
10 0.075
1.5 0.2
0.050
1.0
5 0.1
0.025 0.5

0 0.000 0.0 0.0


0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 140 120 100 80 60 40 20 0 20 0.0 0.2 0.4 0.6 0.8 1.0 1.2 2 0 2 4 6 8
trend unitroot_pp entropy x_acf10
20000 4.0 4.0

vie
Original Original 3.5 Original Original
17500 VAE 3.5 VAE VAE 3.5 VAE
MBB MBB 3.0 MBB MBB
15000 DBA DBA DBA DBA
3.0 3.0
2.5
12500 2.5 2.5
2.0
Density

10000 2.0 2.0


1.5
7500 1.5 1.5

5000 1.0
1.0 1.0

2500 0.5 0.5 0.5

0.35

0.30

0.25
0
0.000 0.002 0.004 0.006
spike
0.008 0.010

Original
VAE
MBB
DBA
0.0

3.5

3.0

2.5
0

Original
VAE
MBB
DBA
2 4
diff1_acf10
6

re8
0.0

0.7

0.6

0.5
0.2 0.0

Original
VAE
MBB
DBA
0.2 0.4 0.6
arch_r2
0.8 1.0 1.2
0.0

1.0

0.8

0.6
0.2 0.0 0.2 0.4 0.6
garch_r2
0.8 1.0

Original
VAE
MBB
DBA
1.2
Density

0.20 2.0 0.4


er
0.15 1.5 0.3 0.4

0.10 1.0 0.2


0.2
0.05 0.5 0.1

0.00 0.0 0.0 0.0


7.5 5.0 2.5 0.0 2.5 5.0 7.5 0.0 0.5 1.0 1.5 2.0 10 5 0 5 10 10 0 10 20 30 40 50 60
curvature unitroot_kpss linearity nonlinearity
pe

(e) M3Quarterly

0.006
Density distribution of San Francisco Traffic Data Set
Original Original Original Original
VAE VAE VAE 1.4 VAE
8 MBB 0.005 MBB 8 MBB MBB
DBA DBA DBA 1.2 DBA

0.004 1.0
6 6
Density

0.003 0.8
4
ot

4 0.6
0.002
0.4
2 2
0.001
0.2

0 0.000 0 0.0
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1000 800 600 400 200 0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8
trend unitroot_pp entropy x_acf10
tn

250000 Original Original 30 Original Original


VAE 12 VAE VAE VAE
50
MBB MBB MBB MBB
DBA 10 DBA 25 DBA DBA
200000
40
8 20
150000
Density

30
6 15
100000
20
4 10
rin

50000 10
2 5

0 0 0 0
0.0000 0.0002 0.0004 0.0006 0.0008 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5
spike diff1_acf10 arch_r2 garch_r2

Original Original Original 7 Original


0.200 0.12
VAE 0.6 VAE VAE VAE
0.175 MBB MBB MBB MBB
DBA DBA DBA 6 DBA
0.5 0.10
0.150
ep

5
0.125 0.4 0.08
Density

4
0.100 0.3 0.06
3
0.075
0.2 0.04
2
0.050
0.1 0.02 1
0.025

0.000 0.0 0.00 0


20 15 10 5 0 5 10 15 20 0 2 4 6 8 10 20 10 0 10 20 0 1 2 3 4 5 6
Pr

curvature unitroot_kpss linearity nonlinearity

(f) Traffic

Figure B.1: Time Series Features’ Distribution (cont.)

34

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4251343

You might also like