Professional Documents
Culture Documents
2021 Thesis Generation
2021 Thesis Generation
B ACHELOR T HESIS
Author: Supervisor:
Jonas Fontana Abdelouahab Khelifati,
Dr. Mourad Khayati,
Prof. Dr. Philippe
Cudré-Mauroux
eXascale Infolab
Department of Informatics
Abstract
Jonas Fontana
With the explosion of time series data, numerous studies have focused on de-
veloping tools to analyze them. However, in order to train, improve and evaluate
these methods, a lot of data is necessary. Large datasets are often very homoge-
neous, and in addition for many domain-specific application, very few if any data is
available. To this end, various synthetic time series augmentation techniques have
been proposed. Those techniques rely on different principles and aim to achieve dif-
ferent goals when generating new data, which makes it difficult to choose the best
technique for a given use-case situation.
In this thesis, we study six augmentation techniques: Anomalies Injection, DBA,
Autoregressive models, GAN, InfoGAN and TimeGAN. We empirically evaluate
these techniques using datasets that exhibit different characteristics (for example,
some of them have very evident cyclical patterns, some others contain anomalies).
In addition, we introduce a new a tool to automate the evaluation, parameterize the
techniques, and visualize the generated time series. From this we will conclude that
in most cases machine learning methods are in many cases the best option, how-
ever simple methods (such as DBA and Anomalies Injection) and methods based on
autoregressive models are also a valid option in some situations.
Contents
Abstract iii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Time Series Properties and Features . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Spectral Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Time Series Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Root Mean Squared Error (RMSE) . . . . . . . . . . . . . . . . . 6
4 Experiments 13
4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.1 Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Statistical metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Distribution-based metrics . . . . . . . . . . . . . . . . . . . . . 18
4.2.3 Information-based metrics . . . . . . . . . . . . . . . . . . . . . . 18
4.2.4 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Augmentation Framework 25
5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6 Conclusion 27
Bibliography 29
vii
List of Figures
5.1 Use example: searching for the best algorithm combining the 3 scenarios 26
1
Chapter 1
Introduction
1.1 Motivation
In the last decade, there has been an explosion of time series data. From financial
field to rivers and sea levels, but also medical data and speech analysis: time series
are everywhere. Every time a sequence of values is recorded in time order they
create a time series. Consequently, a big number of time series analysis methods are
appearing for tasks such as classification, forecasting, anomaly detection, and more.
However, in order to train, improve and evaluate these methods, a lot of data is
necessary. Paradoxically, large datasets are often very homogeneous, which makes
it challenging to train a model. In addition, for many domain-specific applications
very few if any data is available. For this reason, new time series augmentation
methods are created as well.
In this thesis, we organize the different methods and propose a way to evaluate
them. We also present a tool to facilitate this task and show an example of how to
use it with different datasets and with different metrics.
and (2.2 Time Series Evaluation Metrics) we will present a way to evaluate time se-
ries, and in (3 Generation Algorithms) we will show some implementations for each
of the discussed classes. The results of the analysis performed are presented in (4
Experiments). Finally, in (5 Generation Tool) we will also present a tool we created
to facilitate the use and the comparison of the techniques considered.
Chapter 2
Background
In this section, we will briefly discuss the major techniques to evaluate time series.
We will introduce a list of time series features and metrics based on these features,
which we will use to evaluate the generated time series. We will explain how they
work, and discuss why we consider them important.
2.1.2 Distribution
Another important characteristic of a time series is the way in which its values are
distributed. If the values are distributed in a normal way (Gaussian distribution),
the mean and variance of this distribution are enough to fully describe it. However,
most of the time it is not so. To show the distribution of a time series we can therefore
partition the possible values in sets and count how many points are in each group.
It is then possible to plot the results using a histogram, as shown in Figure 2.1a and
Figure 2.1b.
( A ) Example of a time series and the cor- ( B ) Example of time series distribution
responding distribution functions1
2.1.3 Autocorrelation
Given 2 variables (in this case 2 time series), the correlation between them is the
measure of similarity they have. It measure how they vary together. The correlation
function is defined as
Cov( x, y)
Corr ( x, y) =
σ( x )σ(y)
where Cov( x, y) is the covariance of x and y, and σ( x ) is the standard deviation of x.
It can have a value between -1 (perfect negative correlation) and +1 (perfect positive
correlation).
As the name says, the autocorrelation of a time series is its correlation with itself
(shifted by n steps). The autocorrelation function measures its correlation with a
copy of itself shifted by 1 step, by 2 steps, by 3 steps, etc. Figure 2.2 shows an
example of 2 autocorrelation functions
v1 and v2 , the Mutual Information measures the amount of information we can ob-
tain about v2 by looking at v1 (and the other way around, as it is symmetric).
In this case, given two time series s1 and s2 (the original one and the synthetic
one), Mutual Information can be used to measure the similarity between them, thus
evaluating the generation algorithm. The Mutual Information (MI) between time
series X and Y can be computed as
| X | |Y |
p(i, j)
MI ( X, Y ) = ∑ ∑ p(i, j)log p (i ) p ( j )
i =1 j =1
where p(i, j) = Xi ∩ Yj /N is the probability that an object picked at random falls
into both classes Xi and Yj ([Scikit-learn: Clustering]), as illustrated by Fig. 2.3.
2.2.2 Correlation
As introduced in 2.1.3, correlation is the measure of similarity between two variables
(in this case, between two time series). It is computed with:
Cov( x, y)
Corr ( x, y) =
σ( x )σ(y)
they vary perfectly oppositely, and 0 that they vary independently from each other.
Fig. 2.4 shows three examples of correlation.
Chapter 3
In this section, for each approach mentioned in 1.3 we will present some concrete
techniques. These are the techniques that will then be used to generate synthetic
data in 4.
With this algorithm, a new time series is created as the weighted average of the
already existing time series. The only parametrization is in how the weights are dis-
tributed. A first approach is to distribute them randomly across all the available time
series. Theoretically, with this approach it would be possible to generate an infinite
number of new time series. However, more advanced approaches exist. According
to [Forestier et al., 2017], the best results are obtained when a "main" time series re-
ceives most of the weight, and the remaining is distributed across the other series
according to an exponential function based on their distance. The farther they are
from this main time series, the smaller the weight they receive. Formally, the weight
for time series i is given by
DTW ( Ti ,T ∗ )
ln(0.5)· d∗
wi = e NN (3.1)
where T ∗ is the "main" time series selected and d∗NN is the distance between T ∗
and its nearest neighbor. Once all the weights have been distributed, the weighted
average of N time series under DTW is the time series that minimizes
N
arg min T := ∑ wi DTW 2 (T, Ti ) (3.2)
i =1
3.3. Data generation with autoregressive models 9
Fig. 3.3 shows a set of time series (left) and a new time series generated with
DBA starting from this set (right).
where c is a constant. Based on the existing data, the model sets a coefficient φ for
each of the p previous values. Once the model is constructed, it can then be used to
generate new time series, one point by one.
AR models can be extended to more sophisticated models, for example ARMA
(Autoregressive Moving Average), ARIMA (Autoregressive Integrated Moving Av-
erage), or SARIMA (Seasonal Autoregressive Integrated Moving Average). Despite
the increasing number of components, they are all based on the idea that past values
determine the current one.
Algorithm 3 Pseudocode of AR
Input: Original dataset, window size n
for time series t in input dataset do
Find n coefficients φi for time series t using regression
for length of time series to generate do
Generate new point using coefficients φ
Keep only new points
Output: set of new time series
10 Chapter 3. Time Series Generation Methods
min max V ( D, G ) = Ex∼ pdata (x) [logD ( x )] + Ez∼ pz (z) [log(1 − D ( G (z)))]1
G D
The training succeeds when the discriminator is wrong about 50% of the time.
At this point, the generator should be able to create synthetic data that is indistin-
guishable from the real one. Fig. 3.4a shows a scheme of a GAN model.
3.4.1 TimeGAN
GANs were not created specifically for time series. Indeed, they are mostly used
with images, and therefore are not particularly good when it comes to properties
specific to time series.
For this reason, [Yoon, Jarrett, and Schaar, 2019] developed an alternative called
TimeGAN. The goal of TimeGAN is to capture not only the properties within each
time point, but also across time ([Yoon, Jarrett, and Schaar, 2019]). In other words, it
tries to consider temporal dynamics. To do this, it uses four components (instead of 2):
an embedding function, a recovery function, a sequence generator, and a sequence
discriminator. Training the autoencoding components (first two) jointly with the ad-
versarial components (last two), the model should be able to simultaneously learns
1 [Goodfellow et al., 2014]
3.4. Generative Adversarial Networks (GANs) 11
to encode features, generate representations, and iterate across time ([Yoon, Jarrett,
and Schaar, 2019]). In this way, the adversarial components act on the latent space
provided by the embedding function, learning the underlying temporal dynamics.
Fig. 3.4b shows a scheme of a TimeGAN model.
3.4.2 InfoGAN
Another evolution of GAN is InfoGAN, presented by [Chen et al., 2016]. The idea
behind InfoGAN is to encourage the model to learn meaningful interpretation ([Chen
et al., 2016]) through mutual information maximization and unsupervised disentan-
gled representation. According to the authors, the underlying problem is that GANs
receive no restrictions on how to use the input vector, and might therefore learn a
very entangled representation. However, "many domains naturally decompose into
a set of semantically meaningful factors"([Goodfellow et al., 2014]). For example,
the image of a face can be represented through the eyes color, eyes distance, hairs
length, etc. To encourage this representation, InfoGAN decomposes the input vec-
tor into two parts: a first component z, which will be the source of the incompressible
noise, and c, whose structure will enforce learning of the meaningful features. The
Generator thus becomes G (z, c). To ensure that it will not just ignore the c part, Info-
GAN requires G (z, c) to have a high mutual information (see 2.2.1) with c. Fig. 3.4c
shows a scheme of an InfoGAN model.
13
Chapter 4
Experiments
In this chapter, we will evaluate the data generated with the techniques mentioned
in 3. Firstly, we will summarize the specifications of the machine we use. After
that, we will present the datasets used in the experiment, as well as the metrics
considered. In a second part, we will then analyze the results obtained, and at the
end of the chapter we will derive a recommendation on which class of techniques to
use in which situation.
4.1 Setup
In this section we will present the setup for our experiment. In particular, we will
indicate the machine used, the dataset considered, and the evaluation methods. The
algorithms used have already been presented in 3 Time Series Generation Methods.
4.1.1 Machine
For the experimental part, we chose to run our code on a linux server with the fol-
lowing specifications:
4.1.2 Datasets
In this section, we will shortly describe the datasets used to compare the different
generation techniques. These have been chosen to have different characteristics, for
example a greater or smaller number of records, a more o less strong seasonal pat-
tern, presence or absence of anomalies/spikes, and uniformity or variation of the
time series.
Alabama_weather
The Alabama_weather dataset contains longer and more variegated time series w.r.t.
the considered UCR datasets. The time series in the datasets show no particular
trend or seasonality. Some contain noise, others much less. To keep the generation
time reasonable, only a subset has been considered (limiting the number of time
series and their length).
Currency2
As the name says, the Currency2 dataset contains a list of currencies values. Most
of the time series in this dataset have a similar value range, but there are some ex-
ceptions. They do not have any evident cyclical pattern or common trend, and they
tend not to change too abruptly.
Alabama_weather_3072
The major drawback of the BasicGAN implementation found is that it requires a
dataset with a specific structure, meaning 3072 time series of length 3072. To obtain
it, we manually created a dataset shifting a 3072-points window over a time series
of length 6143, with a step of 1. As consequence, the resulting dataset is extremely
uniform, but this allows to consider BasicGAN technique as well. The original time
series is from the Alabama_weather dataset.
Summary
Table 4.1 summarize the datasets used. Fig. 4.1 shows them, while Fig. 4.2 shows
the only the first 3 time series of each dataset.
4.1. Setup 15
4.1.3 Metrics
The first thing to do in order to evaluate the generation algorithms is to choose how.
Since our goal is to perform a general analysis (and not task-specific), we have de-
cided to select some features of the original time series and observe whether or not
they are maintained in the generated ones. In addition, we have also selected some
metrics, to discover the relation between the original time series and the generated
ones. Below is the list of what we decided to consider:
Mean. Although this is arguably the least interesting property we consider, it is a
good way to understand if a method generates data on the same scale as the
real one. If not, it might be necessary to process the generated data before being
able to use it. It is therefore important to know it.
Variance. This is a relatively simple way to compare the distribution of the new time
series with that of the original ones. Alone it would not be enough, but it is still
an important property to check.
Spectral Entropy (see 2.1.1). Applying this to both the original data and the gen-
erated one, it allows us to see which method better preserves the amount of
16 Chapter 4. Experiments
information in the time series. If the generated data has a much lower Spec-
tral Entropy than the original data, it might not be optimal for tasks such as
forecasting.
Mutual Information (see 2.2.1) shared between the original data and the synthetic
one. This is an important evaluation metric and, in contrast to the previous
features, it compares directly the new data with the original one (instead of
extracting properties from both and compare these).
Correlation (see 2.2.2) between original time series and the synthetic one. A very
high or very small correlation mean that the generated data has a similar trend
to the original one (i.e. almost the same or the opposite values with only small
variations), and therefore will not be so useful. At the same time, a very small
correlation indicate that the generated data is completely different from the
original one, and therefore a model trained with this data could perform poorly
on the real one.
Distribution (see 2.1.2). The distribution is a very important property of time series,
and it is important to evaluate whether or not the generating algorithms are
capable of preserving it.
Autocorrelation (see 2.1.3). The particularity of time series is that time is impor-
tant. In particular, values at a given time might strongly depend on values at
previous times. Autocorrelation measures this dependence, and it is useful to
evaluate how much the time-dependencies are preserved.
4.2 Results
In this section, we will group the metrics listed in 4.1.3 into logical classes and for
each of them we will analyze the results across all the datasets and all the generation
methods.
Statistical metrics:. This group includes the mean and the variance. They are classic
statistical measures that give a first insight into the considered data.
Distribution-based metrics:. This group includes the correlation with the original
dataset, the distribution, and the auto-correlation. They are very important to
4.2. Results 17
understand if the generated data preserves the distribution of the values in the
original dataset, and, if yes, whether or not it does it generating completely new
data or just by copying the original one.
Information-based metrics:. This group includes entropy and mutual information.
They are both based on the information theory, and will analyze the amount of
information preserved during the generation.
( A ) Original ( C ) AR ( E ) InfoGAN
( A ) Original ( C ) AR ( E ) InfoGAN
that the data generated with this method is less similar (i.e. has a higher distance)
to the original one than those generated with other methods, but despite this, it
shares a lot of information with it. InfoGAN has very similar results w.r.t. mutual
information, but slightly better preservation of the entropy in the datasets without
spikes.
4.2.4 Runtime
Below are presented the times (in seconds) needed to generate synthetic data with
every method and for each of the 6 main datasets considered. The time presented
only considers the generation (and training for InfoGAN and TimeGAN), but not
the time needed to extract the metrics, plot the results, save the data, or any other
task. In addition, it is to mention that the values presented are the results of a single
20 Chapter 4. Experiments
( A ) Original ( C ) AR ( E ) InfoGAN
( A ) Original ( C ) AR ( E ) InfoGAN
experiment. Although it would be more correct to do multiple runs for each value
and consider the mean, in this case the exact result is not particularly important.
What is investigated here is rather the order of growth of each algorithm, and the
relation between them.
Runtime (s) BeetleFly (20x512) Coffee (28x286) Ham (105x431) Ligthing7 (73x319) Alabama_weather (15x2000) Currency2 (20x2610)
AnomaliesInjection 3.4792 3.7648 13.2981 10.4487 3.9154 2.4432
AR 8.6032 4.1136 26.3448 3.3944 57.1649 18.2239
DBA 5.1887 4.0671 104.6299 33.9407 54.898 143.2891
InfoGAN (200 eps) 222.7651 232.73 1381.9785 400.3487 2576.8138 2659.7723
TimeGAN (200 eps) 301.3007 343.3313 319.0164 288.4910 266.4494 269.8655
From table 4.2 we can see that AnomaliesInjection, AR and DBA require very
few time compared to GAN-based methods. In addition, the required time is less
influenced by the size of the input dataset.
4.3. Recommendations 21
( A ) Original ( C ) AR ( E ) InfoGAN
( A ) Original ( C ) AR ( E ) InfoGAN
Generation with InfoGAN requires much longer, and the required time is heavily
conditional on the size of the input dataset. For small datasets, the time required by
TimeGAN is similar to that of InfoGAN. However, for TimeGAN the time remains
constant with the growth of the input dataset, leading to a much faster generation
with bigger inputs.
4.3 Recommendations
In this section, we try to summarize which class of algorithms is better in which sit-
uation, based on the results obtained. In particular, we will consider three variants:
the similarity of the generated data, the size, and the presence of anomalies. In the
first case, we consider a generated time series to be similar to the original one when
22 Chapter 4. Experiments
( A ) Original ( C ) AR ( E ) InfoGAN
( A ) Original ( C ) AR ( E ) InfoGAN
it consists of almost the same values, with only some minor changes. More for-
mally, two similar time series have a high correlation and a small RMSE (Root Mean
Squared Error, not analyzed here). The second variable depends on the size of the
required output. We distinguish two cases: when it is enough to double the initial
(original) input, and when instead we need to generate a bigger dataset. Lastly, we
distinguish input datasets with anomalies (e.g. 4.1d) from datasets without anoma-
lies (e.g. 4.1b).
The classes of algorithms considered are those just analyzed, meaning simple
methods (including DBA and Anomalies Injection), autoregressive methods (AR),
and ML methods (e.g. GANs).
The recommendations derived are illustrated in Fig. 4.17.
4.3. Recommendations 23
( A ) Original ( C ) AR ( E ) InfoGAN
Chapter 5
Augmentation Framework
To automate the experimental part of this work, and to facilitate its replicability, we
have implemented a tool that brings together the selected algorithms1 . It has been
created for Ubuntu 20.4, and it is mainly written in python 3.6. The tool allows to
generate new time series starting from any dataset with the right format (.csv file
with time series as columns). The datasets considered in this work are included
with the code, however a user can add its own.
It is possible to choose between 6 generation techniques: AnomaliesInjection,
AR, DBA, Basic GAN, InfoGAN, and TimeGAN. The tool takes care of hiding the
differences in their implementation (different programming languages, input for-
mat, etc.) and standardizes the output format. In addition, upon request, it extracts
some metrics both from the new and the original data in order to evaluate the gen-
eration.
5.1 Implementation
As mentioned, the sources for the different techniques are very heterogeneous. Some
of them are written in python, others in java. Some have a ready-to-work implemen-
tation, while others are just a library, or even an article, and need additional work
to be used. In addition, some implementation use different version of the same li-
brary (in particular tensorflow), or require different input format. TimeGAN and
InfoGAN also produces sub-samples of time series, which needs to be reconstructed
into full time series to be compared with the other techniques.
Our contribution with this tool is to hide all these differences and make it possi-
ble to use the different techniques in a uniform and seamless way. Table 5.1 summa-
rize these differences.
1 The code and the datasets are publicly available at https://github.com/Fontanjo/TS_generation_benchmark
F IGURE 5.1: Use example: searching for the best algorithm combin-
ing the 3 scenarios
5.2 Scenarios
The main goal of the generation tool is to simplify the generation of synthetic time
series using different algorithms, as well as visualizing and analyzing the results.
Figure 5.1 shows how to use the tool to choose the best algorithm for a given situa-
tion, and below is a list with a more specific description of the main use scenarios:
Scenario 1: Visualize datasets. The first thing a user might be interested in is to
visualize a dataset. In this scenario, the user can add its own dataset to the frame-
work and use the provided tools to visualize it. In addition, he can visualize the
distribution of the dataset, as well as the distribution of each single time series in it.
The same thing holds for autocorrelation, for which it is possible to generate a single
plot for the entire dataset or a separate plot for each time series.
Scenario 2: Analyze datasets. In this scenario, a user studies a dataset based on a
set of properties and metrics. It is possible to extract the mean, the spectral entropy,
and all the properties described in 2.1. It is also possible to specify two datasets and
compute metrics s.a. mutual information and correlation between them. All these
results are stored in a csv file, and they can be plotted with gnuplot using another
provided script.
Scenario 3: Augment datasets. In this scenario, the user has a dataset he wants to
augment. He can add this dataset to the framework, then select the algorithm he
wants to use and customize the parameters. For example, he can specify the number
of epochs for ML-based methods, or frequency and type of anomalies for anomalies
injection. It is also possible to specify the size of the generated data (number of time
series and their length).
27
Chapter 6
Conclusion
Bibliography