Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Data Min Knowl Disc (2016) 30:476–509

DOI 10.1007/s10618-015-0425-y

Time series representation and similarity based on local


autopatterns

Mustafa Gokce Baydogan1 · George Runger2

Received: 17 November 2014 / Accepted: 19 June 2015 / Published online: 7 July 2015
© The Author(s) 2015

Abstract Time series data mining has received much greater interest along with the
increase in temporal data sets from different domains such as medicine, finance, multi-
media, etc. Representations are important to reduce dimensionality and generate useful
similarity measures. High-level representations such as Fourier transforms, wavelets,
piecewise polynomial models, etc., were considered previously. Recently, autoregres-
sive kernels were introduced to reflect the similarity of the time series. We introduce
a novel approach to model the dependency structure in time series that generalizes
the concept of autoregression to local autopatterns. Our approach generates a pattern-
based representation along with a similarity measure called learned pattern similarity
(LPS). A tree-based ensemble-learning strategy that is fast and insensitive to parame-
ter settings is the basis for the approach. Then, a robust similarity measure based on
the learned patterns is presented. This unsupervised approach to represent and mea-
sure the similarity between time series generally applies to a number of data mining
tasks (e.g., clustering, anomaly detection, classification). Furthermore, an embedded
learning of the representation avoids pre-defined features and an extraction step which
is common in some feature-based approaches. The method generalizes in a straight-
forward manner to multivariate time series. The effectiveness of LPS is evaluated on
time series classification problems from various domains. We compare LPS to eleven
well-known similarity measures. Our experimental results show that LPS provides fast
and competitive results on benchmark datasets from several domains. Furthermore,

Responsible editor: Eamonn Keogh.

B Mustafa Gokce Baydogan


mustafa.baydogan@boun.edu.tr

1 Department of Industrial Engineering, Boğaziçi University, Istanbul, Turkey


2 School of Computing, Informatics and Decision Systems Engineering, Arizona State University,
Tempe, AZ, USA

123
Time series representation and similarity based on local autopatterns 477

LPS provides a research direction and template approach that breaks from the linear
dependency models to potentially foster other promising nonlinear approaches.

Keywords Time series · Similarity · Pattern discovery · Autoregression ·


Regression tree

1 Introduction

Machine learning on large time series databases has received considerable interest
over the past decade as time series data has increased in applications. An important
challenge for the analysis is the high dimensionality of time series data. Much of
the research has focused on a high-level representation by transforming the original
data to another domain to reduce the dimension (Ratanamahatana et al. 2010). More-
over, the trends, shapes and patterns within the data often provide more information
than the individual values (Ratanamahatana et al. 2010). Consequently, higher-level
representations are also preferred to capture such properties (Lin et al. 2007). These
representations include Fourier transforms, wavelets, piecewise polynomial models,
etc. (Lin et al. 2003). Also, discretization approaches to represent the time series have
become popular in the last decade. For example, symbolic aggregate approximation
(SAX) (Lin et al. 2007, 2012; Shieh and Keogh 2008) is a simple symbolic repre-
sentation for univariate series that segments the series into fixed-length intervals (and
uses a symbol to represent the mean of the values). This representation is similar to
piecewise aggregate approximation (PAA) (Chakrabarti et al. 2002). Overviews of the
time series representation approaches were provided by Fu (2011), Lin et al. (2007),
Ratanamahatana et al. (2010), Wang et al. (2013).
A generative model is another model where the series is represented through the
learned model parameters (Chen et al. 2013; Warren Liao 2005). These approaches are
referred to as “model-based kernels” (Chen et al. 2013). Approaches in this category
assume a parametric model of a certain form. Kernels such as probability product
kernels (Jebara et al. 2004), subsequence kernels (Kuksa and Pavlovic 2010), Fisher
kernels (Jaakkola et al. 1999), etc., implicitly generate a similarity measure after
transforming the series based on a model. Autoregressive (AR) kernels (Cuturi 2011)
in this category assume that there is a linear recurrence relation between the time series
values. AR models focus on the dynamic aspects of time series by specifying that a
value at a specific time depends linearly on previous values.
An efficient and effective similarity search over time series databases is another
important topic for time series learning as such data become ubiquitous. A distance
measure that can properly capture the underlying information and reflect the similarity
of the data is of fundamental importance for a variety of data mining tasks such as
clustering, anomaly detection, classification, etc. (Han and Kamber 2001). See Wang
et al. (2013) for a comprehensive evaluation and comparison of the most popular time
series similarity approaches.
As a parameter-free approach, similarity based on Euclidean distance is very popu-
lar and it was shown to work well for many applications (Wang et al. 2010). Euclidean
distance falls in the category of lock-step measures because it compares the ith value

123
478 M. G. Baydogan, G. Runger

of one time series to the ith value of another (Wang et al. 2013). This makes Euclidean
distance sensitive to the noise, scaling, translation and dilation of the patterns within
the time series. On the other hand, it can perform well for certain applications as the
training data size increases (Wang et al. 2013).
Alternatively, elastic measures compute the similarity invariant to certain nonlinear
variations in the time dimension. This is achieved through the comparison of one-to-
many points as in dynamic time warping (DTW) (Ratanamahatana and Keogh 2005) or
one-to-many/one-to-none points as in longest common subsequence (LCSS) (Latecki
et al. 2005). DTW distance is considered to be strong for many time series data mining
problems (Ratanamahatana and Keogh 2005). Alternative approaches based on the idea
of DTW are also proposed in the literature. A weighting scheme is used by WDTW
Jeong et al. (2011) to weight against large warpings. Derivative DTW (DDTW) (Keogh
and Pazzani 2001) uses of the differences between consecutive time values. Also, Edit
distance based approaches are shown to be competitive in this domain. Edit distance
with a real penalty (ERP) (Chen et al. 2005), time warp edit (TWE) distance (Marteau
2009) and the move-split-merge (MSM) (Stefan et al. 2013) are some of the successful
approaches in this category.
Definition of the similarity is also critical for “similarity-based kernels” (Lowe
1995). These kernels make use of the similarity information for time series data mining.
For example, a kernel based on DTW is proposed for applications to speech recog-
nition tasks by Cuturi (2011). Similarity-based kernels do not compare the dynamics
directly, but measure alignments between time series (Gaidon et al. 2011). Most of the
time series kernels (both similarity and model-based kernels) attempt to solve certain
invariance problems in the time dimension. Consequently, there are relationships with
the computer vision literature where patches are extracted from the images to account
for certain invariances such as location, scale, etc. Motivated by similar ideas, studies
based on time series segments (Baydogan et al. 2013; Grabocka and Schmidt-Thieme
2014; Lin et al. 2012) were recently proposed in the time series mining literature to
handle the invariances. Time series are characterized by feature vectors derived from
their segments using a bag-of-words (BoW) type of representation (Baydogan et al.
2013).
The Learned Pattern Similarity (LPS) method described here is also motivated by
BoW approaches. LPS first learns a representation from the segments of the time series
in a manner similar to autoregression then introduces a similarity measure based on
this representation. In order to illustrate the basics of LPS, we use a synthetic time
series classification problem. Consider a two-class problem in which series from class
one are defined by three peaks, whereas two peaks define class two, regardless of
locations. Figure 1a illustrates 10 time series from each class with a heatmap with
time on the x-axis and the time series on the y-axis. Figure 1b plots the values at time t
versus t + 1 to provide intuition regarding the AR models. A model trained on values
at time t to predict the value at t + 1 is called an AR model of lag 1, AR(1). As can be
seen from the scatter plot, the linearity assumption in the AR models is restrictive. The
peaks in this example are the main reason for the non-linear autoregressive behavior
(i.e., points around x = 2). Hence, we use a nonlinear approach to model the depen-
dencies based on a tree-based learning strategy. This allows our model to capture more
complex dependencies with a robust model (few and insensitive to parameters). Anal-

123
Time series representation and similarity based on local autopatterns 479

(a) (b)
2 3
2 Class 1 Class 2
1.5
4

Observation at time t + 1
2
1
6
0.5 1
8
0
10 0
−0.5
12
−1
14 −1
−1.5
16
−2 −2
18
−2.5
20 −3
100 200 300 400 −3 −2 −1 0 1 2 3
Observation at time t

Fig. 1 Time series database with 20 instances. The time series are 400 time units long. Values are presented
as a heatmap with time shown on the x-axis and time series values on the y-axis (a). Scatter plot of values
at time t and t + 1 over all the time series (b). Note that there are several overlapping observations at point
(2,2)

ogous to autoregression, we view these dependencies as autopatterns. A regression


tree is trained on the data visualized in Fig. 1. The structure of the tree is provided in
Fig. 2a.
The simple example in Fig. 1 (2 vs. 3 peaks) shows how the trees can encode the
dependency structure in the time series. However, AR modeling has the potential to
miss location information which can be important for some time series analysis prob-
lems. Consider a case where single predictor segment cannot adequately separate the
classes (e.g., where the locations of the peaks define the classes). Suppose a time series
database has time series of length 100 from two classes. Class 1 is defined by a peak
between time points 1 and 50, whereas Class 2 has the same peak between time points
51 and 100. Obviously, an AR(1) model cannot capture the difference of these series
as it generates the same representation for both classes. In such cases, modeling the
change of the autocorrelation structure over time is important. Consequently, instead
of learning a single tree structure, LPS trains an ensemble of regression trees to account
for multiple predictor segments of multiple lengths. The concept of encoding the local
autopatterns present in the time series through generalized models of autocorrelation
remains the same as the simple example in Fig. 1. However, the segments are allowed
to change in location and length at every split node of the trees in the ensemble. This
is analogous to modeling autocorrelation at multiple lags and at multiple locations
as in autoregressive kernels (Cuturi 2011), but with more expressive models than the
linear autoregressive counterparts. Furthermore, model-based approaches generally
fit models to each individual time series and compare their parameters. Modeling
each series separately is an iterative and potentially time-consuming process. On the
other hand, our LPS approach fits a single auto-pattern model to all series simultane-
ously.
LPS enjoys the benefits of recursive partitioning of the feature space to capture
nonlinear relationships and ensembles to identify behaviors that differ in regions of

123
480 M. G. Baydogan, G. Runger

[1] Terminal Node Distribution


yes T1 < 1.74 no 2 3
0.002089858

Time Series
n=7960

11
1
[2] [3]

−0.2830683 1.986791
n=6960 n=1000
0 100 200 300 400
Frequency
(a) (b)
Fig. 2 Regression tree trained on observations at time t to predict observations at time point t + 1 (a).
Corresponding terminal node distribution for time series 1 and 11 from classes 1 and 2 respectively (b).
Level differences in the frequency of observations at each terminal node reveal the difference of the time
series

the feature space. One also needs to differentiate the models of individual time series.
In LPS, each series is represented by the distribution of the values assigned to regions
defined by the recursive partitioning (terminal nodes) learned by the trees. Implicitly,
trees learn regions where dependencies are similar. Then, for each time series, the
frequency of the values residing at each terminal node of the learned ensemble is used
in the representation. This is illustrated as a boxplot in Fig. 2b for two time series from
different classes and one tree.
LPS extends to multivariate time series (MTS) in a straightforward manner without
any additional computational cost. Most of the studies on MTS similarity make use
of univariate approaches and weight the distance between each attribute to generate
a final similarity measure. This is common in many gesture recognition (GR) tasks
(Liu et al. 2009). For example, Akl and Valaee (2010), Liu et al. (2009) focused
on GR based on DTW distance. With the high dimensionality introduced by many
attributes and longer series, it becomes difficult to compute the similarity between
multivariate series. Also, the relationship between the attributes is not considered
when the similarity is computed over individual series and this is problematic for
certain applications with interaction effects between the attributes (as discussed by
Baydogan and Runger 2014). Our LPS similarity measure considers the interactions
between the individual attributes of a MTS.
Our approach inherits the properties of the tree-based ensembles. That is, it can
handle numerical, categorical and ordinal data, nonlinear and interactive effects. It is
scale invariant and robust to missing values. Most of the existing time series represen-
tation approaches have problems handling missing values or the data types other than
numeric. LPS can handle dilations and translations of the patterns (i.e., scale and shift
invariance) with the representation learning. These comments apply to both univariate
and multivariate time series. Furthermore, LPS allows for a simple parallel imple-
mentation which makes it computationally efficient. Our approach provides fast and
competitive results on benchmark datasets from the UCR time series database (Keogh
et al. 2011) and other published work (Frank and Asuncion 2010; Hills et al. 2014;
Lines and Bagnall 2014; Olszewski 2012; Rakthanmanon and Keogh 2013; Sübakan
et al. 2014; CMU 2012).

123
Time series representation and similarity based on local autopatterns 481

LPS provides generalized approach to model dependencies that are nonlinear (along
with dilations and translations), that generalizes the concept of autoregression. We
think of these dependencies as local autopatterns in the time series. Thus, LPS provides
a research direction for time series modeling that breaks from the linear dependency
models to potentially foster other promising nonlinear approaches. LPS provides an
example template for the steps to generate nonlinear autopatterns, local in time, rep-
resent time series, and produce similarity measures that can be used in a number of
analysis tasks. This template can be a guide for alternatives that extend upon LPS.
The remainder of this paper is organized as follows. Section 2 provides background
and a summary of related work. Section 3 describes the framework for learning the
patterns and computing the similarity. Section 4 demonstrates the effectiveness and
efficiency of our proposed approach by testing on a full set of benchmark datasets.
Section 5 provides conclusions.

2 Background and related work

A univariate time series, x n = (x n (1), x n (2), . . . , x n (t), . . . , x n (T )) is an ordered set


of T values. We assume time series are measured at equally-spaced time points. A
time series database, X , stores N univariate time series.

2.1 Autoregressive model

The autoregressive model of lag p, AR( p), is a collection of linear models to predict a
value at time t, x n (t), based on the previous values x n (t −1), x n (t − 2), . . . , x n (t − p).
The form of AR( p) models is


p
x (t) =
n
φ j x n (t − j) + t (1)
j=1

where the mean is assumed to be zero and the regression coefficients, φ j , are para-
meters to be estimated. Given the lag p, there are several approaches to estimate the
coefficients. Least-squares estimation is commonly employed to find the regression
coefficients. This approach assumes that the error terms, t , have independent Gaussian
(normal) distributions with zero-mean and constant variance.
AR( p) models the lagged dependence between the observations. However, AR
models of this type assumes linear relations which might be problematic for applica-
tions. Moreover, the optimum model lag is not known a priori and has to be determined
via lag selection criteria. Also, the coefficients may change over time, but Eq. 1 assumes
that relations are the same for the entire time period.

2.2 Regression trees

Our approach makes use of regression trees, but much differently than the traditional
approach. A regression tree partitions the feature space to decrease the impurity of a

123
482 M. G. Baydogan, G. Runger

target y at the terminal nodes (Breiman et al. 1984).The impurity at a node is usually
measured with the sum of squared error, SSE = i (yi − ȳ)2 , where the sum and
the mean ȳ is computed over the instances assigned to the node. A split is chosen
to minimize the weighted average of the SSE over the child nodes. Finding the best
partition is generally computationally infeasible (Hastie et al. 2009). Hence, regression
trees use a greedy strategy to partition the input space. The prediction for an instance
assigned to terminal node m is the mean of the target attribute ȳ over the instances
in the training set assigned to m. Models of this type are sometimes called piecewise
constant regression models as they partition the predictor space in a set of regions and
fit a constant value within each region.

2.3 Time series representation

Several representations have been proposed for efficient data mining in time series
databases. We refer to Ratanamahatana et al. (2010) for detailed categorization and
description of these approaches. Discrete representations are common in time series
research Ratanamahatana et al. (2010). For example, SAX (Lin et al. 2007) discretizes
the values based on the mean of values in fixed-length intervals. This representation
is similar to PAA mentioned previously (Chakrabarti et al. 2002).
A traditional role for tree-based learners with time series is to approximate with a
piecewise constant model in a recursive manner (Geurts 2001). A popular regression
tree-based representation uses (t, x n (t)) as the data where the time index t is the
only predictor and x n (t) is the target (Geurts 2001). See Fig. 3 for one of the time
series from CBF dataset (Keogh et al. 2011). Initially, the mean of all values are
zero. A split minimizing the weighted sum of squared error (SSE) for the parent node
partitions the values into two nodes for which the mean values are −0.83 and 0.42. The
tree recursively partitions the time series values to minimize overall SSE in a greedy
manner. Because time is used as the predictor, the values residing at each terminal
node are contiguous and define an interval as shown in Fig. 3a. The discretized vector
has 128 elements (length of the times series) in this example.
The number of values residing at each terminal node can be used to represent the
time series (Geurts 2001). There are six terminal nodes defining the discretization
illustrated in Fig. 3a. Simply a vector of length six, (Hastie et al. 2009; Geurts 2001;
Geurts et al. 2006; Breiman et al. 1984; Keogh et al. 2006; Jebara et al. 2004) can be
used to represent the whole time series.
Tree-based representations for time series were considered by Baydogan and
Runger (2014), Baydogan et al. (2013) specifically for classification tasks. The previ-
ous work differs from the methods here in a number of ways. First, the previous work
utilized the class attribute for the representation. Also, Baydogan et al. (2013) used a
quite different approach in which simple features (such as means and standard devia-
tions) were extracted from segments before a codebook was generated. The work by
Baydogan and Runger (2014) considered node counts for a representation. However,
the procedure was again focused on the class attribute for splitting rules and used a
substantially different data structure (in addition, without overlapping segments). Here
the approach is entirely unsupervised, and we generate splits in a different manner.

123
Time series representation and similarity based on local autopatterns 483

Fig. 3 A representation of time series from CBF dataset (Keogh et al. 2011) (a) and the regression tree
trained to obtain the representation (b). a Feature space. b Regression tree

We provide a new representation and also develop a similarity measure that can be
used for data mining tasks other than classification.

2.4 Time series similarity

Popular time series similarity measures have been summarized and evaluated by Lines
and Bagnall (2014). Eleven measures are empirically compared on 75 time series
classification datasets from different domains. The conclusion is that there is no one
measure that can significantly outperform the others (Lines and Bagnall 2014). Also,
it is shown that there is no statistically significant difference in the performance of the
elastic measures. The top three ranked algorithms on these datasets are claimed to be
WDTW, MSM and DTW with the best warping window (referred to as DTWBest).
As these approaches perform approximately at the same level, DTWBest is used for
comparisons, which is a common practice in the literature (Batista et al. 2014). See
Lines and Bagnall (2014) and Wang et al. (2013) for further discussion of the time
series similarity measures.
Moreover, similarity computation for MTS is a challenging task as the problem of
finding similarity between multiple series is not well-defined. To solve this problem,
similarity-based approaches are commonly employed over individual attributes of
MTS and the similarity over individual series of MTS is weighted to obtain a final
similarity measure. However, MTS are not only characterized by individual attributes
but also by their relationships.

3 Time series representation based on local autopatterns

LPS learns dependency patterns (autopatterns) from the time series by modeling the
relationships between the time series segments. We introduce a segmentation that is
related conceptually to multiple lag values for autocorrelation. After representing each
time series as a matrix of segments, a tree-based learning strategy to discover the depen-

123
484 M. G. Baydogan, G. Runger

dency structure is discussed. A BoW-type representation that encodes the dependency


patterns is generated for each time series. Then, a novel similarity measure based on
the proposed representation named as “learned pattern similarity (LPS)” is introduced.

3.1 Recursive partitioning on time series segments to learn autopatterns

Our approach extracts all possible segments of length L(L < T ) starting from each
time index t = 1, 2, . . . , T − L + 1. Here, a segment refers to the values from
the series that are contiguous in time. A segment starting at time t is denoted as
stn = {x n (t), x n (t + 1), . . . , x n (t + L − 1)}. The segment matrix S n in Eq. 2 for each
time series x n is generated with columns equal to all possible segments (T − L + 1
segments of length L are possible) for each series
⎡ ⎤
x n (1) x n (2) . . . x n (T − L + 1)
⎢ x n (2) x n (3) . . . x n (T − L + 2) ⎥
⎢ ⎥
n
SL×(T −L+1) =⎢
⎢ ...

⎥ (2)
⎣ ... ⎦
x n (L) x n (L + 1) . . . x n (T )

After generating the segment matrix for each time series in the database, we concate-
nate the matrices row wise to learn the dependence relations over all the time series.
We denote this segment matrix as S N L×(T −L+1) . Our approach uses regression trees
to identify the structural dependencies between the time series observations. Before
training the regression tree based on the segment matrix, a random column, r th column
of S, is selected as the target segment. Then, we train a regression tree which selects
a random pth column of the segment matrix as the predictor at each split. Note that
the index p used here is different than the lag parameter p used by AR( p) models.
Similar to the split selection criterion in regression trees, the value that minimizes SSE
is used as the split decision. This is illustrated in a simple example in Sect. 1 where the
split is determined to be T 1 < 1.740247 (Fig. 2a). The regression tree trained in this
manner learns a nonlinear autoregressive model. The index of the column determines
the starting time of a segment. Therefore, the lag level is determined by the selection
of p and r . In order to allow for the discovery of autopatterns based on the multiple
(potentially) different local relationships, p is selected randomly at each node. Ran-
dom strategies related to this were also shown to perform well in another regression
context by Geurts et al. (2006). A random selection of p at each split also enables LPS
to model a dependency that changes over time.
The setting of L basically sets an upper bound on the lag level in the approach.
Obviously, the lag cannot be greater than T − L. To model all possible lag levels,
we introduce a new learning strategy that trains J trees, {g j , j = 1, 2, . . . , J }, in an
ensemble framework. In addition to selecting a random predictor segment at each node
to account for multiple lags, each tree uses a random segment length in the approach.
This allows for a large number of possible lag levels to be modeled. Also, the depth
of the trees is restricted to be D to control the complexity. Algorithm 1 shows the
steps to build a single tree. The method that generates the split value in Step 6 can be

123
Time series representation and similarity based on local autopatterns 485

modified for computational speed. We consider two splitting strategies: “regression”


and “random” splits. This is discussed further in Sect. 3.3.

Algorithm 1 Regression tree algorithm: tr ee(S, depth, r ), where S is a N L × (T −


L + 1) matrix with i jth entry si j and r is a random target column of S in the regression
split setting.
1: if cannot further split or depth = D then
2: Designate this node as a terminal node
3: Return
4: end if
5: Select a random column index p
6: Obtain p ∗ , the split point for column p
7: Sle f t ← instances (rows) i with si p ≤ p ∗
8: Sright ← instances (rows) i with si p > p ∗
9: depth = depth + 1
10: tr ee(Sle f t , depth, r )
11: tr ee(Sright , depth, r )

When all time series are used for training, the algorithm is analogous to searching
for common patterns over all time series. Each tree generates a representation and the
final time series representation is obtained via concatenation. For simplicity, assume
that all trees contain the same number of terminal nodes R. The general case is easily
handled. Let H j (x n ) denote the R-dimensional frequency vector of instances in the
terminal nodes from tree g j for time series x n . We concatenate the frequency vectors
over the trees to obtain the final representation of each time series, denoted as H (x n ),
of length R × J (and modified obviously for non-constant R). Our representation
summarizes the patterns in the time series based on the terminal node distribution of
the instances over the trees.
The descriptions and examples are provided for time series of the same length, but
lengths can differ. Our segment extraction scheme should be modified in such cases.
Keeping the same number of segments, longer segments should be extracted for longer
series. Then the representation obtained should be normalized for each series based
on the segment lengths.
Moreover, interpolation is commonly required to estimate the missing values for
time series with missing values. However, the estimation method itself adds an addi-
tional parameter to the time series problem. Our proposed approach naturally handles
the data with missing values, without any additional steps, because tree-based learning
implicitly handles the attributes with missing values (Breiman et al. 1984). Robust-
ness of the proposed approach to missing values is empirically evaluated in Sect. 4.7.
Although the descriptions are provided for numerical time series, LPS can also be
applied to categorical time series such as DNA sequences.

3.2 Extension to multivariate time series

An MTS is an M-attribute time series. In the multivariate scenario, the segment matrix
S n should be generated for each attribute of the multivariate series and concatenated
column wise to obtain a segment matrix of size L × (M × (T − L + 1)) for each

123
486 M. G. Baydogan, G. Runger

multivariate series. A positive property of LPS is that the rest of the algorithm remains
the same. With the help of the random selection of the columns at each node of the
tree, interactions between multiple attributes are modeled. This enables our approach
to model generalized cross-correlation at different lag levels. Depending on the number
of attributes, the number of trees and the depth level might be set larger to capture the
relevant information. Also, the complexity of LPS is not affected due to the random
segment selection at each iteration.

3.3 Splitting strategies

The split decision is one step of LPS and our approach considers two splitting strategies.
In the first alternative, referred to as “random splits”, the split value is determined
randomly from a uniform distribution based on the minimum and maximum of the
values in Step 6 of Algorithm 1.
The second alternative introduces a split similar to the ones used in regression
trees. In this alternative, tree construction in Algorithm 1 is modified slightly to learn
a regression tree. A regression tree requires a target and a random column is chosen to
be the target for each tree. Then, Step 6 sets the split value to minimize the weighted
average of the SSE on the target column over the child nodes. This alternative, referred
to as “regression splits”, provides certain benefits. With the regression tree approach,
a search for autopatterns is done in a more intelligent way as opposed to “random
splits”.
With an explicit objective function (i.e., minimize the weighted average of the SSE),
the split value on a random predictor column is selected to partition the values of the
target around the child nodes mean levels. In a sense, two columns of the segment
matrix (predictor and target) are discretized simultaneously. If the dependencies of the
patterns within series are important for similarity, “regression splits” has the potential
to work better. In other words, “regression splits” model a dependency between time
periods and it has the potential to work well if this behavior is important. This is
especially important for MTS as the relationship between multiple attributes is likely
to provide information about the learning task. Rows (segments) from the segment
matrix S are assigned to tree nodes as in the “random splits” case.
Generating a split value with “random splits” is very fast computationally. On
the other hand, “regression splits” evaluates of all possible split values at each node.
Added complexity arises from sorting before the evaluation of possible split locations.
This requires more computation, but the tree-based methods are well known to be
computationally fast (Breiman et al. 1984). Both strategies are evaluated empirically
in Sect. 4 and further discussion on the split choice is provided in Sect. 4.5.

3.4 Difference series

Regression trees find dependencies between segments based on the mean levels of the
values. In order to introduce dependencies in terms of trends in the representation, we
also generate segments (both predictor and target in the case of “regression trees”)

123
Time series representation and similarity based on local autopatterns 487

from the differences of consecutive values. Here, T − L difference segments for each
time series are generated as
⎡ ⎤
x n (2) − x n (1) x n (3) − x n (2) ... x n (T − L + 1) − x n (T − L)
⎢ x n (3) − x n (2) x n (4) − x n (3) ... x n (T − L + 2) − x (T − L + 1)⎥
n
⎢ ⎥
⎢ ... ⎥
⎢ ⎥
⎣ ... ⎦
x n (L + 1) − x n (L) x n (L + 2) − x n (L + 1) ... x n (T ) − x n (T − 1)
(3)

In our modified approach with differences, the difference columns are concatenated
column wise with the original segment matrix S and segments are randomly selected
from this enlarged matrix. Hence, a segment matrix of size N L × (2T − 2L + 1)
is used for representation learning in Algorithm 1. As studied in the experiments,
potentially, a better representation can be learned with this strategy. The addition of
the difference segments does not effect the complexity because our approach selects
a random segment at each iteration. With the addition of the difference series, LPS
enjoys the similar advantages as derivative dynamic time warping (DDTW) (Keogh
and Pazzani 2001). DDTW measures the similarity based on the trends by estimating
the local derivatives of the data. Similar information is captured by LPS with the
introduction of the difference series.

3.5 Similarity measure

Given the representation described previously, a similarity measure is developed. Sup-


pose h nk is the kth entry of H (x n ), then the similarity between the time series x n and

x n is set to

1 
R×J
 
sim(x n , x n ) = min(h nk , h nk ) (4)
R×J
k=1
As the similarity measure counts the number of matched values in the representation,
LPS can be categorized as a pattern-based similarity measure. Because of the random
selection of segments, we aggregate the similarity over all the trees as given in Eq. 4.
This enables our approach to capture patterns from different lags and locations. By
matching based on the minimum number of values in the pattern, the measure has some
relationship to similarity approaches based on subsequences, such as longest common
subsequence (LCSS) (Latecki et al. 2005). Our matching strategy also allows us to
handle the problem of dilation with this matching strategy.
Instead of generating a measure of similarity using Eq. 4, we propose a dissimilarity
measure to benefit from bounding schemes such as early abandoning (Keogh et al.
2005) which can accelerate the similarity search over the time series. The dissimilarity

between the time series x n and x n is set to

1 

R×J
 

dissim(x n , x n ) =
h k − h nk
(5)
R×J
k=1

123
488 M. G. Baydogan, G. Runger

The dissimilarity measure in Eq. 5 penalizes the number of mismatched values between
the time series. Moreover, it provides the opposite information as the similarity mea-
sure in Eq. 4. This can be seen as follows. Suppose that the absolute difference in the
sum in Eq. 5 is written as

1 
R×J
  
dissim(x n , x n ) = max(h nk , h nk ) − min(h nk , h nk ) (6)
R×J
k=1

If the sum is distributed over the terms in Eq. 6, we obtain the sum of the maximums
minus the sum of the minimums. The sum of the entries in a representation is constant,
B, for each series where B is equal to sum of the segment lengths considered for each
tree. Hence,


R×J R×J
 
R×J
n n n n n n 
max(h k , h k )+min(h k , h k ) = max(h k , h k )+ min(h nk , h nk ) (7)
k=1 k=1 k=1

R×J
 
R×J

2B = max(h nk , h nk ) + min(h nk , h nk ) (8)
k=1 k=1

R×J
 
R×J

max(h nk , h nk ) = 2B − min(h nk , h nk ) (9)
k=1 k=1

Plugging Eq. 9 back in Eq. 6, produces the dissimilarity in Eq. 10, which has the same
summation term as the similarity measure in Eq. 4, but with a negative sign. Because
the rest of the terms are constant in Eq. 10, the similarity in Eq. 4 is basically the
opposite of dissimilarity in Eq. 10.

2 
R×J
n 
dissim(x , x ) =
n
B− min(h nk , h nk ) (10)
R×J
k=1

Although the length of the final representation can be larger than the time series length
depending on D and J , it is still computationally efficient as further illustrated in Sect.
4.4. Moreover, bounding strategies still work for MTS with the proposed representation
and similarity measure because LPS transforms an MTS into a univariate vector.

3.6 Parameters

There are four parameters in our approach: the splitting strategy, the number of trees
J , the depth D and the subsequence length L. However, LPS is robust to the settings of
these parameters if they are set in certain range. For example, L is selected randomly
for each tree. J and D can be set large if there is no concern regarding the computation
time. Similarly, “regression splits” is preferred if training time is not a problem. LPS is
quite insensitive to parameter settings and we illustrate its robustness on several data
sets to support this claim empirically.

123
Time series representation and similarity based on local autopatterns 489

If there is information regarding the application, one may want to set the parame-
ters accordingly. The most important parameter of the approach is L. First, L sets
an upper bound as T − L on the lag as discussed earlier. Therefore, if only short
term dependencies are important in “regression trees”, L may be set large. This way,
dependencies are modeled over shorter time windows. However, interesting patterns
of the time series may be missed if long-term dependencies are important. To account
for long-term dependencies, a smaller L is preferred.
The setting of L is preferred to be handled with a simple approach that leverages
the large number of trees we typically use in LPS. Instead of setting L to a certain
level, L is set randomly for each tree. This provides robust performance, as shown in
our experiments, and removes the need to specify a value for L. Another option is to
set the parameters based on the cross-validation accuracy on training data. Section 4
further discusses how the parameters are handled in the experiments.

3.7 Algorithmic complexity

The time for learning the representation is mainly determined by the training of the
trees. The time complexity of building a single tree is O(νηβ), where ν = 1 is the
number of the features evaluated at each split, η = N × L is the number of instances
in the segment matrix and β = D is the depth of the tree. Because we set L as the
proportion of a full time series length, we define γ as L = γ T . As we build J trees
in a random fashion, the overall complexity of training is O(J N T D). Moreover, the
columns of S are generated at the splitting stage to avoid unnecessary storage of the
overlapping segments. Hence, our proposed approach is efficient in terms of memory
usage.
The testing complexity is determined by the complexity of the representation and
the classification. A time series representation requires the traversal of the trees which
is O(T J D). The time complexity of the classification is similar to the complexity of
NNEuclidean which is linear in the representation length. The time series are repre-
sented with a R × J length vector where R is the number of terminal nodes. Here,
R is determined by the depth parameter D and it is non-constant. Assuming that R is
constant and equal to the maximum possible value, R = 2 D , the worst case testing
complexity of LPS is O(N J 2 D ).
Theoretically, the worst-case complexity of LPS in testing is exponential to the
setting of D. However the proposed approach is very fast in practice which is further
discussed in Sect. 4.4. If a small decrease in the computation time is of practical
concern, bounding schemes can be used to accelerate this approach. The simplest
and well-known approach for NNEuclidean is early abandoning (Keogh et al. 2005),
as mentioned earlier. For example, during the computation of the LPS similarity for
nearest-neighbor classification, we can stop the calculation if the current sum of the
absolute differences between each pair of corresponding data points exceeds the best
similarity so far (Keogh et al. 2006). The computation time can be reduced significantly
with this bounding scheme (Rakthanmanon et al. 2012).
More importantly, almost all the steps of LPS are embarrassingly parallel. The trees
in the ensemble can be trained in parallel to learn the representation. Likewise, the

123
490 M. G. Baydogan, G. Runger

similarity computation can be done in parallel over multiple trees. This makes LPS
very suitable for large-scale similarity search in a parallel environment.

4 Experiments and results

LPS is implemented as an R R Core (2014) package named as LPStimeSeries which is


publicly available. We also provide a MATLAB implementation of LPS at (Baydogan
2013). All the scripts and information to run both implementations are provided at
(Baydogan 2013) to promote a full reproducibility of our results. Here we report the
results from the R implementation of LPS.
The effectiveness of the proposed representation and similarity measure is evaluated
based on the accuracy of a one nearest-neighbor (1NN) classifier. This scheme was
proposed as an objective evaluation method Keogh and Kasetty (2003). We refer the
reader to Wang et al. (2013) for further discussion of the evaluation methods for the
similarity measures. We emphasize that our similarity measure has greater applicability
to other tasks, but the 1NN analysis provides a convenient structure for comparisons.
Our approach is tested on 75 univariate time series classification datasets, where
46 are available in the UCR time series database (Keogh et al. 2011) and the rest
are available in other published work (Hills et al. 2014; Lines and Bagnall 2014;
Rakthanmanon and Keogh 2013). As discussed by Lines and Bagnall (2014), these
datasets have diverse characteristics such as the lengths of the series, the number of
classes, etc. We standardize each time series to zero mean and unit standard deviation.
This adjusts for potentially different baselines or scales that are not considered to
be relevant (or persistent) for a learner. We also performed experiments on several
multivariate time series classification problems. The details of the experimentation
and results are discussed in Sect. 4.2.
The datasets are grouped into four categories by Lines and Bagnall (2014) to facil-
itate better interpretation. The largest category with 29 data sets is the group of sensor
readings. The second largest category has 28 data sets from image outline classifica-
tion. As many of the problems in this category are not rotationally aligned, classifiers
working in the time domain have the potential to fail for this category (Lines and
Bagnall 2014). The challenge with rotational invariance is common for many of the
data sets in this category. The third group has 12 data sets taken from motion capture
devices attached to human subjects. The simulated data sets form the last category.
See Lines and Bagnall (2014) for the details of the categorization.
Our approach is compared to nearest neighbors (NN) classifiers discussed by Lines
and Bagnall (2014). Eleven similarity measures are considered in our comprehensive
evaluation: DTW and DDTW Dynamic time warping and derivative dynamic time
warping that use the full warping window (Ratanamahatana and Keogh 2005; Keogh
and Pazzani 2001), DTWBest and DDTWBest DTW and DDTW with the window
size setting determined through cross-validation, DTWWeight and DDTWWeight
Weighted version of DTW and DDTW (Jeong et al. 2011), LCSS Longest common
subsequence (Latecki et al. 2005), MSM Move-Split-Merge (Stefan et al. 2013), TWE
Time warp edit distance (Marteau 2009), ERP85 Edit distance with real penalty (Chen
et al. 2005), and ED Euclidean distance.

123
Time series representation and similarity based on local autopatterns 491

The results for all NN classifiers were obtained from Lines and Bagnall (2014). As
mentioned by Lines and Bagnall (2014), some of these measures require certain hyper-
parameters to be set. The parameter optimization is done on the training set through
cross-validation by allowing at most 100 model evaluations for each approach.
The most important parameter in LPS is the segment length setting as discussed
in Sect. 3.6. We introduce two strategies for the segment length setting. The first
strategy sets the segment length L as the proportion of full time series length, γ ∈
{0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95}, and, furthermore, we consider the depth D ∈
{2, 4, 6} based on a leave-one-out cross-validation on the training data. For the cross-
validation, we train 25 trees (J = 25) for each fold. This version of LPS is named
as LPSBest as it is analogous to DTWBest. Hence, we allow for |γ | × |D| = 7 ×
3 = 21 model evaluations in our study. Although the depth can be fixed as LPS is
insensitive to depth setting if set large enough, three levels of depth are evaluated.
Because training the ensemble is performed once with the largest depth setting and
similarity is evaluated for each depth setting, evaluation of smaller depth levels does not
introduce additional computational costs for training. After the parameters providing
the best cross-validation accuracy are obtained, we train J = 200 trees in the ensemble
with the selected parameters to obtain the final representation.
In the second strategy, the segment length L is chosen randomly for each tree as
the proportion of full time series length between 0.05 × T and 0.95 × T . Also, the
number of trees J and the depth D are fixed to 200 and 6, respectively, for all datasets.
This version of LPS is referred to as LPS. The values of the parameters are set the
same for all datasets to illustrate the robustness of LPS. In other words, no parameter
tuning is conducted for this strategy. As we discussed in Sect. 3.6, J and D can be set
large if there is no concern regarding the computational time. Empirical evaluation of
this strategy is provided in Sect. 4.3.
The random selection of the segment lengths in LPS saves significant computational
time in training because we avoid the cross-validation step. Also, both splitting options
(regression and random) are evaluated for this strategy. We run 10 replications and
report the median performance because of the random nature of our proposed approach.

4.1 Classification accuracy

Tables 1 and 2 summarize the median error rates from 10 replications of our algorithm
on the test data. We also provide the classification of the data sets in terms of the
problem types in these tables. Comparison of LPS to multiple classifiers over all
datasets is done using a procedure suggested by Demšar (2006). The testing procedure
employs a Friedman test (Friedman 1940) followed by the Nemenyi test (1963) if a
significant difference is identified by Friedman’s test. It is basically a non-parametric
form of Analysis of Variance based on the ranks of the methods on each dataset (Lines
and Bagnall 2014).
Figure 4 shows the average ranks for all classifiers on 75 datasets. LPSBest has
best average rank and LPS is second best. Based on the Friedman test, we find that
there is a significant difference between the 13 classifiers at the 0.05 level. Proceeding
with the Nemenyi test, we compute the critical difference (CD). This test concludes

123
492

Table 1 The full results of LPS based measures and 11 similarity measures on 75 datasets (part one of two)

123
LPS DTW DDTW LCSS MSM TWE ERP ED
Best Full Full Best Weight Full Best Weight

Adiac• 0.211 0.235 0.396 0.389 0.394 0.414 0.330 0.327 0.749 0.373 0.366 0.391 0.389
ArrowHead• 0.200 0.200 0.429 0.217 0.211 0.320 0.211 0.206 0.217 0.257 0.229 0.183 0.183
ARSim 0.004 0.046 0.404 0.417 0.407 0.101 0.103 0.101 0.270 0.311 0.491 0.433 0.489
Beef 0.367 0.300 0.367 0.333 0.300 0.333 0.300 0.300 0.233 0.533 0.400 0.333 0.333
BeetleFly• 0.150 0.150 0.350 0.350 0.450 0.150 0.100 0.200 0.350 0.600 0.550 0.300 0.350
BirdChicken• 0.050 0.000 0.350 0.350 0.400 0.350 0.300 0.300 0.250 0.350 0.550 0.400 0.400
Car 0.183 0.167 0.267 0.233 0.217 0.267 0.217 0.217 0.167 0.100 0.083 0.233 0.267
CBF 0.002 0.000 0.003 0.006 0.003 0.408 0.428 0.409 0.010 0.030 0.009 0.002 0.148
ChlorineConc 0.352 0.358 0.378 0.375 0.373 0.353 0.351 0.349 0.441 0.382 0.379 0.362 0.369
CinCECGtorso 0.064 0.216 0.378 0.071 0.068 0.375 0.071 0.088 0.068 0.081 0.237 0.118 0.102
Coffee 0.071 0.036 0.000 0.000 0.000 0.071 0.036 0.036 0.000 0.107 0.000 0.000 0.000
Computers 0.136 0.186 0.124 0.124 0.124 0.212 0.200 0.212 0.244 0.220 0.160 0.116 0.272
CricketX♣ 0.282 0.309 0.236 0.246 0.236 0.462 0.438 0.415 0.249 0.241 0.241 0.244 0.421
CricketY♣ 0.208 0.213 0.218 0.205 0.187 0.546 0.482 0.454 0.182 0.177 0.238 0.162 0.346
CricketZ♣ 0.305 0.278 0.215 0.177 0.187 0.459 0.454 0.408 0.215 0.249 0.221 0.179 0.387
DiatomSize• 0.049 0.049 0.036 0.075 0.036 0.291 0.092 0.118 0.101 0.046 0.049 0.075 0.075
DistPhalanxAge• 0.237 0.234 0.245 0.201 0.223 0.216 0.237 0.237 0.194 0.209 0.209 0.252 0.252
DistPhalanxOut• 0.234 0.237 0.239 0.254 0.246 0.246 0.228 0.210 0.268 0.246 0.279 0.236 0.239
DistPhalanxTW• 0.327 0.335 0.324 0.324 0.324 0.345 0.317 0.345 0.381 0.338 0.309 0.331 0.317
M. G. Baydogan, G. Runger
Table 1 continued

LPS DTW DDTW LCSS MSM TWE ERP ED


Best Full Full Best Weight Full Best Weight

Earthquakes 0.331 0.335 0.295 0.309 0.281 0.353 0.331 0.353 0.317 0.338 0.324 0.295 0.302
ECGFiveDays 0.155 0.188 0.243 0.200 0.245 0.307 0.282 0.304 0.233 0.230 0.221 0.197 0.200
ElectricDevices 0.273 0.271 0.329 0.295 0.303 0.333 0.300 0.303 0.562 0.287 0.358 0.305 0.456
FaceAll• 0.242 0.232 0.192 0.192 0.206 0.127 0.118 0.103 0.199 0.191 0.214 0.207 0.286
FaceFour• 0.040 0.057 0.170 0.102 0.125 0.375 0.261 0.284 0.034 0.057 0.148 0.136 0.216
FacesUCR• 0.098 0.069 0.106 0.091 0.087 0.166 0.152 0.149 0.046 0.037 0.083 0.077 0.229
fiftywords• 0.213 0.190 0.310 0.235 0.229 0.308 0.237 0.231 0.202 0.187 0.207 0.288 0.369
fish• 0.094 0.054 0.166 0.166 0.154 0.103 0.080 0.040 0.131 0.063 0.069 0.126 0.217
FordA 0.090 0.098 0.276 0.206 0.213 0.203 0.183 0.181 0.212 0.228 0.252 0.205 0.314
FordB 0.223 0.252 0.341 0.330 0.320 0.295 0.284 0.281 0.275 0.277 0.302 0.314 0.404
GunPoint♣ 0.000 0.003 0.093 0.087 0.020 0.007 0.000 0.007 0.027 0.027 0.047 0.053 0.087
Haptics♣ 0.562 0.575 0.601 0.594 0.607 0.698 0.591 0.594 0.623 0.578 0.549 0.627 0.627
Time series representation and similarity based on local autopatterns

Herring• 0.398 0.430 0.406 0.344 0.313 0.344 0.406 0.438 0.328 0.344 0.250 0.266 0.266
InlineSkate♣ 0.494 0.514 0.629 0.615 0.598 0.738 0.725 0.785 0.587 0.576 0.576 0.589 0.676
ItalyPower 0.053 0.073 0.060 0.039 0.057 0.110 0.027 0.054 0.049 0.064 0.050 0.040 0.039
LargeKitchen 0.157 0.347 0.264 0.264 0.264 0.269 0.285 0.269 0.456 0.243 0.299 0.387 0.517
Lightning2 0.197 0.213 0.131 0.131 0.098 0.328 0.230 0.180 0.230 0.180 0.164 0.131 0.246
Lightning7 0.411 0.315 0.274 0.288 0.233 0.425 0.301 0.315 0.425 0.247 0.260 0.260 0.425

Bold shows the best error rate


Legend for the data set category: • image outline classification, ♣ motion classification,  sensor reading classification,  simulated data classification

123
493
494

Table 2 The full results of LPS based measures and 11 similarity measures on 75 datasets (part two of two)

123
LPS DTW DDTW LCSS MSM TWE ERP ED
Best Full Full Best Weight Full Best Weight

MALLAT 0.093 0.103 0.069 0.090 0.058 0.074 0.052 0.048 0.091 0.067 0.067 0.090 0.090
MedicalImages• 0.297 0.279 0.270 0.261 0.274 0.349 0.341 0.336 0.341 0.261 0.299 0.324 0.311
MidPhalanxAge• 0.523 0.536 0.539 0.539 0.565 0.506 0.461 0.481 0.435 0.506 0.539 0.506 0.526
MidPhalanxOut• 0.208 0.232 0.247 0.199 0.223 0.223 0.206 0.216 0.227 0.254 0.289 0.220 0.254
MidPhalanxTW• 0.497 0.419 0.649 0.682 0.688 0.643 0.636 0.662 0.649 0.649 0.636 0.656 0.695
MoteStrain 0.114 0.076 0.175 0.134 0.141 0.291 0.228 0.204 0.131 0.128 0.191 0.130 0.125
NonInvThorax1 0.183 0.202 0.274 0.196 0.205 0.599 0.404 0.399 0.215 0.193 0.188 0.185 0.196
NonInvThorax2 0.147 0.161 0.173 0.132 0.142 0.421 0.304 0.274 0.170 0.117 0.128 0.121 0.132
OliveOil 0.133 0.133 0.167 0.133 0.167 0.133 0.167 0.133 0.600 0.167 0.133 0.133 0.133
OSULeaf• 0.134 0.248 0.409 0.401 0.376 0.120 0.128 0.112 0.211 0.227 0.223 0.397 0.483
Phalanges• 0.226 0.220 0.260 0.228 0.237 0.244 0.196 0.198 0.219 0.248 0.281 0.242 0.246
Plane 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.010 0.000 0.000 0.038
ProxPhalanxAge• 0.112 0.132 0.137 0.127 0.137 0.141 0.156 0.141 0.112 0.122 0.122 0.132 0.127
ProxPhalanxOut• 0.172 0.172 0.213 0.196 0.189 0.182 0.162 0.165 0.175 0.192 0.223 0.220 0.206
ProxPhalanxTW• 0.278 0.254 0.234 0.278 0.239 0.220 0.224 0.229 0.229 0.273 0.224 0.259 0.278
Refr.Devices 0.329 0.320 0.509 0.515 0.488 0.549 0.560 0.549 0.376 0.416 0.480 0.365 0.515
ScreenType 0.440 0.431 0.427 0.445 0.448 0.477 0.485 0.477 0.435 0.536 0.485 0.459 0.533
ShapeletSim  0.006 0.083 0.333 0.328 0.261 0.494 0.456 0.511 0.106 0.139 0.178 0.367 0.444
ShapesAll• 0.218 0.188 0.278 0.247 0.252 0.185 0.182 0.173 0.203 0.190 0.337 0.247 0.272
M. G. Baydogan, G. Runger
Table 2 continued

LPS DTW DDTW LCSS MSM TWE ERP ED


Best Full Full Best Weight Full Best Weight

SmallKitchen 0.225 0.224 0.299 0.256 0.293 0.296 0.296 0.299 0.456 0.232 0.293 0.272 0.645
SonyRobot1 0.225 0.240 0.276 0.301 0.266 0.258 0.309 0.268 0.319 0.260 0.319 0.301 0.301
SonyRobot2 0.123 0.136 0.171 0.143 0.140 0.149 0.150 0.149 0.183 0.126 0.139 0.179 0.143
StarLightCurves 0.033 0.038 0.096 0.097 0.096 0.098 0.091 0.086 0.126 0.114 0.119 0.150 0.147
SwedishLeaf• 0.072 0.072 0.208 0.154 0.126 0.115 0.096 0.107 0.112 0.104 0.109 0.138 0.211
Symbols• 0.030 0.038 0.055 0.069 0.055 0.114 0.087 0.085 0.050 0.033 0.030 0.080 0.108
SyntheticControl 0.027 0.027 0.007 0.017 0.007 0.433 0.433 0.433 0.047 0.027 0.013 0.027 0.120
ToeSegmentation1♣ 0.077 0.092 0.105 0.101 0.105 0.136 0.154 0.136 0.167 0.132 0.132 0.110 0.325
ToeSegmentation2♣ 0.100 0.108 0.077 0.108 0.077 0.246 0.154 0.162 0.046 0.115 0.138 0.108 0.377
Trace 0.020 0.020 0.000 0.010 0.000 0.000 0.010 0.000 0.030 0.070 0.010 0.050 0.240
TwoLeadECG 0.061 0.059 0.134 0.149 0.134 0.084 0.086 0.084 0.203 0.060 0.040 0.102 0.260
TwoPatterns 0.014 0.008 0.000 0.002 0.000 0.003 0.003 0.003 0.001 0.001 0.002 0.000 0.093
Time series representation and similarity based on local autopatterns

UwaveX♣ 0.189 0.175 0.278 0.226 0.226 0.357 0.270 0.269 0.229 0.232 0.229 0.228 0.265
UwaveY♣ 0.263 0.240 0.376 0.302 0.303 0.463 0.377 0.368 0.332 0.302 0.314 0.319 0.336
UwaveZ♣ 0.253 0.236 0.357 0.327 0.335 0.472 0.375 0.377 0.317 0.301 0.312 0.336 0.351
UwaveAll♣ 0.025 0.034 0.107 0.035 0.034 0.150 0.066 0.063 0.038 0.036 0.061 0.044 0.052
wafer 0.001 0.004 0.020 0.004 0.003 0.022 0.003 0.003 0.010 0.003 0.004 0.004 0.005
WordSynonyms• 0.270 0.251 0.367 0.260 0.276 0.417 0.315 0.303 0.260 0.229 0.254 0.321 0.382
yoga• 0.136 0.130 0.164 0.157 0.147 0.180 0.171 0.160 0.140 0.135 0.132 0.153 0.170

Bold shows the best error rate


Legend for the data set category: • image outline classification, ♣ motion classification,  sensor reading classification,  simulated data classification

123
495
496 M. G. Baydogan, G. Runger

Fig. 4 The average ranks for all classifiers on 75 datasets. The critical differences at 0.05 and 0.10 levels
are 2.107 and 1.957, respectively. The performance of LPSBest is significantly better than DTWBest at
level 0.10

that two classifiers have a significant difference in their performances if their average
ranks differ by at least the critical difference (Demšar 2006). The critical differences
at significance level 0.05 and 0.10 are 2.107 and 1.957, respectively.The performance
LPSBest is not significantly different at significance level 0.05 when compared to
LPS, MSM, WDTW and DTWBest, but the performance is significantly better than
DTWBest at level 0.10, and LPSBest has the best average rank. The full set of results
are available on (Baydogan 2013).
A more detailed view of the results follows the graphical approach from Ding
et al. (2008). Scatter plots in Fig. 5 show pairwise comparisons of error rates from
LPSBest against LPS, MSM, WDTW and DTWBest. Each axis represents a method
and each dot represents the error rate for a particular dataset. We draw the line x = y
to represent the region where both methods perform about the same. A point above
the line indicates that the approach on the x axis has better accuracy than the one on
the y axis for the corresponding dataset.
As seen in Fig. 5, our experiments show that LPSBest performs slightly better for
most of the datasets than MSM, WDTW and DTWBest. Similar performance of LPS-
Best and LPS is further discussed in Sect. 4.5. LPSBest performs significantly worse
for only four data sets (CricketX, CricketY, CricketZ, Coffee). However, Coffee only
has 28 test series and LPSBest misclassifies two more instances than the competitor for
this data set. Also, Cricket and Uwave datasets treat the attributes from a multivariate
time series separately. The performance based on the individual attributes may not be
conclusive as the class has the potential to be defined by the relationships between the
attributes.
We also compare the approaches based on the problem category to determine any
sensitivities to the type of datasets. Table 3 illustrates the average ranks of five similarity
measures based on the problem types. LPSBest is only outperformed in one case:
WDTW with the simulated data. This is our least interesting problem type over the
four problem categories. MSM provides similar performance to LPSBest for the image
data. Otherwise, LPSBest provides the best scores and LPS is second best.
We also compare approaches with respect to the length of the time series. A data
set is labeled as “long” if its length is greater than the median length over the 75 time

123
Time series representation and similarity based on local autopatterns 497

(a) 0.7 (b) 0.7


0.6 0.6

0.5 0.5

WDTW
0.4 0.4
LPS

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
LPSBest LPSBest

(c) 0.7 (d) 0.7


0.6 0.6

0.5 0.5
DTWBest
MSM

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

LPSBest LPSBest

Fig. 5 Error rates for LPSBest (median of 10 replications) versus LPS, WDTW, MSM and DTWBest

Table 3 Average ranks of five classifiers for the 75 datasets by problem category

LPSBest LPS WDTW MSM DTWBest

Image 2.607 2.429 3.607 2.697 3.661


Motion 2.167 2.542 3.167 3.584 3.542
Sensor 2.207 2.811 3.207 3.500 3.276
Simulated 2.833 3.000 2.334 3.167 3.667
Overall 2.400 2.640 3.280 3.187 3.494

series, 300 time units, otherwise it is “short”. This strategy divides the problems into
two groups (with almost equal number of data sets) and the same test is conducted
for each length category. Figure 6 shows the average ranks for all classifiers on each
category. LPSBest performs considerably better for the longer time series (i.e., rank
difference from the closest competitor, MSM, is greater for the “long” category).
As mentioned by Batista et al. (2014), in order to understand the predictive capa-
bility of an algorithm, it is useful to show if better performance of a method can be
predicted ahead of time. The a priori usefulness of a classifier can be assessed with a
Texas sharpshooter plot introduced by Batista et al. (2014). The main idea is that the

123
498 M. G. Baydogan, G. Runger

Fig. 6 The average ranks for all classifiers based on the length of the time series. LPSBest performs
considerably better for the longer time series. a Long time series (T > 300). CDs for 0.05 and 0.1 levels
are 2.999 and 2.786, respectively. b Short time series (T ≤ 300). CDs for 0.05 and 0.1 levels are 2.960 and
2.749, respectively

ratio of cross-validation accuracy of two classifiers should be consistent with the ratio
of the accuracy over the test data. This way, if the method provides reliable ratios, then
one can decide which classifier to use. In other words, if both ratios are greater than
one, we predict a gain in one classifier and also observe a gain (true positive, TP). If
both ratios are less than one, we expect a degradation in the performance and observe a
worse performance (true negative, TN). In all the other cases, the outcome is not desir-
able. A Texas sharpshooter plot of LPSBest versus DTWBest is provided in Fig. 7.
For 58 of the data sets, we correctly predicted LPSBest to be better or worse (i.e., TP
+ TN). There are a number of data sets whose ratio is around point (1, 1). These data
sets represent marginal increases/decreases in accuracy (Batista et al. 2014). There is
only one data set with a substantive false negative, MidPhalanxTW, where LPSBest
is claimed to decrease accuracy, but the actual accuracy increased.
The test error rates of LPSBest and LPS with regression versus LPS with random
splits are illustrated in Fig. 8. Reported values are the median of the error rates over
10 replications. There are few datasets for which “regression splits” are performing

123
Time series representation and similarity based on local autopatterns 499

1.8

1.6

1.4
Actual Accuracy Gain
1.2
False Negative True Positive
1

0.8

0.6

0.4
True Negative False Positive
0.2
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Expected Accuracy Gain

Fig. 7 Texas sharpshooter plot of LPSBest vs DTWBest. Scatter plot of expected accuracy gain versus
the actual gain in accuracy over 75 data sets shows if one can rely on the cross-validation accuracy of
LPSBest when compared to DTWBest. Expected accuracy gain is the ratio of the cross-validation accuracy
of LPSBest and DTWBest (on training data). Similarly actual accuracy gain is computed by dividing
LPSBest’s accuracy by DTWBest’s on the test data. Points around point (1, 1) are not interesting as they
represent marginal increases/decreases in accuracy (Batista et al. 2014)
(a) 0.7 (b) 0.7
0.6 0.6

0.5 0.5
LPS random

LPS random

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
LPSBest LPS regression

Fig. 8 The test error rates of LPSBest and LPS with regression versus LPS with random splits. “Regression
splits” slightly improves the accuracy, but the improvements are small. Overall, LPSBest and LPS with
“regression splits” provides better error rates than “random splits” on 51 and 45 data sets, respectively

substantially better than “random splits”. “Regression splits” slightly improves the
accuracy, but the improvements are small. Overall, LPSBest and LPS with “regres-
sion splits” provides better error rates than “random splits” on 51 and 45 data sets,
respectively. The good performance of “random splits” is discussed further in Sect. 4.5.

4.2 LPS for multivariate time series similarity

Similarity computation over multivariate time series is a challenging task. Most of


the similarity-based approaches compute the similarity over individual attributes of

123
500 M. G. Baydogan, G. Runger

MTS and use the weighted sum of the similarities to obtain a final similarity mea-
sure for MTS. However, MTS are not only characterized by individual attributes
but also their relationships. Hence use of similarities between individual attributes
might be problematic for MTS. Moreover, some of the attributes of MTS may be
categorical or they may differ in scale even if they are numeric. For example, con-
sider a network traffic classification problem where observations are the series of
network flows (Sübakan et al. 2014) and the aim is to identify the application that
generates the flow. Each flow is defined as a series of network packets transferred
between IP-Port pairs. Each packet is characterized by four attributes: packet size,
transfer direction, payload and the duration between this packet and the previous
packet. The transfer direction of a packet (i.e. upstream or downstream) is basically
a categorical attribute. Duration between the packets has a different scale. Similar-
ity computation in such cases may not be effective with traditional approaches when
individual attributes are considered. On the other hand, LPS is capable of generating
a similarity measure based on the relationships between the attributes under these
circumstances.
We illustrate the benefits of our proposed approach on 15 MTS datasets introduced
by Frank and Asuncion (2010), Keogh et al. (2011), Olszewski (2012), Sübakan et al.
(2014), CMU (2012). The characteristics of the datasets are provided in Table 4. We
compare LPS to DTW with a full window. We only consider DTW with a full window
since the selection of the best window based on individual attributes is not well-defined
in multivariate settings. Furthermore, DTW requires each attribute to have the same
scale. We employ the same strategy in Sect. 4 and standardize each attribute to zero
mean and unit standard deviation. This transformation is not performed for LPS as trees
are scale invariant and this standardization has the potential to distort the interactions
between the attributes (Cortina 1993). Another transformation is required for categor-
ical predictors to obtain numerical attributes for DTW. The direction information in
“Network Flow” dataset is transformed to a binary representation.
Table 4 summarizes the error rates of LPS and DTW with a full warping window.
We use the same settings as in the univariate case for LPS (D = 6, J = 200) and
report the median error rate over 10 replications. LPS performs better than DTW with
a full window for most of the datasets. For some of the datasets such as Japanese
Vowels or Network Flow, DTW suffers from the standardization of each attribute. For
example, Network Flow data has the transfer direction and the duration between the
consecutive packets as time series data. Standardization of the binary variable used to
represent the transfer direction is not well-defined. Similarly, scaling the time between
consecutive packets is also problematic. On the other hand, LPS models the local
correlation within and between the attributes of multivariate time series. The Network
Flow dataset contains irregularly-spaced time series (time between the packets is not
the same), but the time between consecutive packets is considered as an attribute.
As a tree-based approach, LPS can model the dependency structure for this type of
dataset. On the other hand, the warping path computation in DTW is problematic for
Network Flow as the time between the observations is not constant. Note that there is
no parameter tuning to make LPS comparable to DTW with a full window. The same
set of parameters are used for all datasets. This setting is even the same one used by
LPS for univariate time series.

123
Table 4 Multivariate time series classification datasets and their characteristics

# of attributes length # of classes Dataset size DTW


Train Test LPS Full Source

PEMS  963 144 7 267 173 0.156 0.168 Frank and Asuncion (2010)
KickvsPunch ♣ 62 274–841 2 16 10 0.100 0.100 CMU (2012)
WalkvsRun ♣ 62 128–1918 2 28 16 0.000 0.000 CMU (2012)
CMU_MOCAP_S16 ♣ 62 127-580 2 29 29 0.000 0.069 CMU (2012)
AUSLAN ♣ 22 45–136 95 1140 1425 0.246 0.238 Frank and Asuncion (2010)
ArabicDigits  13 4–93 10 6600 2200 0.029 0.092 Frank and Asuncion (2010)
Japanese Vowels  12 7–29 9 270 370 0.049 0.351 Frank and Asuncion (2010)
Wafer  6 104–198 2 298 4896 0.038 0.040 Olszewski (2012)
Network Flow  4 50–997 2 803 534 0.032 0.288 Sübakan et al. (2014)
Handwri. Char. ♣ 3 60–182 20 300 2558 0.035 0.033 Frank and Asuncion (2010)
Time series representation and similarity based on local autopatterns

UWaveMTS ♣ 3 315 8 896 3582 0.020 0.071 Keogh et al. (2011)


ECG  2 39–152 2 100 100 0.180 0.150 Olszewski (2012)
Libras ♣ 2 45 15 180 180 0.097 0.200 Frank and Asuncion (2010)
DigitsShape • 2 30–98 4 24 16 0.000 0.063 Sübakan et al. (2014)
Shapes • 2 52–98 3 18 12 0.000 0.000 Sübakan et al. (2014)

• image outline classification, ♣ motion classification,  sensor reading classification

123
501
502 M. G. Baydogan, G. Runger

Moreover, Table 2 also has the results for uWaveGestureLibrary where each axis
is combined to obtain a univariate time series. This is shown in the row for uWaveAll.
Consider the performance of full LPS with regression splits where the error rate is
0.034. Although a combined univariate analysis with LPS accounts for some relation-
ships between individual axes (depending on the segment length, shorter segments are
required to model long-term dependencies), explicitly modeling relationships between
individual attributes (i.e. cross-correlation) by the segment selection in the multivari-
ate version (with full LPS and regression splits) improves the results substantially to
0.020. This result is also better than the LPSBest result (i.e. 0.025) for the combined
univariate series (uWaveAll). A multivariate implementation of LPS is also provided
as a MATLAB script at Baydogan (2013).

4.3 Sensitivity analysis

Given a splitting strategy, LPS requires the setting of three parameters, segment length
(L) and tree building parameters (J and D). Segment length is the only parameter of
importance in our approach because LPS is robust to the settings of J and D if set
sufficiently large. On the other hand, segment length is selected randomly by each
tree by LPS and it has been shown to work well. To illustrate the robustness of our
approach, we choose five datasets (Fish, SweedishLeaf, SmallKitchen, MedicalIm-
ages,WordsSynonyms) and show the classification accuracy for selected tree building
parameters (J and D). These datasets provide a reasonable number of training and test
time series. Here 10 replications are conducted for each setting combination. Depth
and number of tree levels considered in this experimentation are D ∈ {2, 4, 6, 8, 10}
and J ∈ {10, 50, 100, 250, 500}.
Average test error rates over the datasets and the replications are shown in Fig. 9. The
depth is prefixed to D = 6 in the experiments that use different J settings for which the
results are illustrated in Fig. 9a, b. We also report the average 10-fold cross-validation
(CV) error rate for each setting. Larger J provides better error rates. Moreover, the
results are more stable with a larger number of trees (i.e., error variance is smaller over
10 replications). If there is no concern regarding the computation time, J should be
set large. Our experiments in Sect. 4 used J = 200 which provided reasonable results
over all datasets. The change in the average CV error rates is similar to the progression
of test error rates.
Figure 9c, d show the sensitivity of the error rates to the depth parameter, D, when
J = 200. As discussed in Sect. 3.6, D can be dropped by growing a full tree. However,
the control over the representation size is lost with this approach. If there is a need to
control the complexity of the representation, this may be inappropriate. The figures
show limited sensitivity to D for modest depth values. As for J , the CV and test error
rate progressions are similar.
We also performed another experiment on all datasets to illustrate the robustness of
LPS to depth setting. We ran 10 replicates of LPS over all datasets with D ∈ {6, 8, 10}
and J = 100. Figure 10 shows the boxplot of the error rates obtained for each setting.
Each boxplot shows the distribution of 75×10 = 750 error rates. There is no significant
difference between the performances for different depth settings.

123
Time series representation and similarity based on local autopatterns 503

0.45
0.45 MedicalImages MedicalImages
SmallKitchen 0.4 SmallKitchen
0.4 SwedishLeaf SwedishLeaf
WordSynonyms WordSynonyms
0.35
fish fish
0.35

Test error rate


0.3
CV error rate

0.3
0.25
0.25
0.2
0.2 0.15
0.15 0.1
0.1 0.05
0.05 0
10 50 100 250 500 10 50 100 250 500
J J

0.45
0.45 MedicalImages MedicalImages
SmallKitchen 0.4 SmallKitchen
0.4 SwedishLeaf SwedishLeaf
WordSynonyms 0.35 WordSynonyms
fish fish
0.35
Test error rate

0.3
CV error rate

0.3
0.25
0.25
0.2
0.2 0.15
0.15 0.1
0.1 0.05
0.05 0
2 4 6 8 10 2 4 6 8 10
D D

Fig. 9 Test and 10-fold CV error rates for selected values of J (D = 6) and D (J = 200) over five datasets
(10 replications)

Fig. 10 The boxplot of the error 0.6


rates obtained for each depth
setting, D ∈ {6, 8, 10}. Each 0.5
boxplot shows the distribution of
75 × 10 = 750 error rates. The 0.4
distribution of error rates does
not change significantly for 0.3
different depth settings
0.2

0.1

0
6 8 10
D

4.4 Computational complexity

We implemented LPS as an R Core (2014) package and our experiments use an Ubuntu
14.04 system with 16 GB RAM, dual core CPU (i7-3540M, 3.0 GHz). Although the
CPU can handle four threads in parallel, only a single thread is used.

123
504 M. G. Baydogan, G. Runger

StarLightCurves dataset from Keogh et al. (2011) is used to demonstrate the effect
of the parameters J, N , T and D on the testing times empirically. Testing time is
basically the elapsed time for querying the training time series database to find the
best match (i.e., the most similar time series) and smaller query times are required for
many applications. For each dataset, we randomly selected γ ∈ {0.2, 0.4, 0.6, 0.8, 1}
proportion of the number of time series (γ N ) and the length of the time series (γT ). The
levels considered for J and D are J ∈ {50, 100, 150, 200, 250} and D ∈ {4, 5, 6, 7, 8}.
Here 10 replications are conducted for each setting combination.
The time for testing (J = 200 and D = 6) is illustrated in Fig. 11a. A linear increase
in the query time with γ N and γT is consistent with the complexity of LPS discussed
in Sect. 3.7. Consequently, LPS is very convenient to use with large databases with
long time series.
In practice, the testing time does not increase exponentially with the increase in
D as illustrated in Fig. 11b. Empirically, the change in computational time is similar
to D log D behavior. Also, the proposed approach is very fast in practice. If a very
small decrease in the computational time is of practical concern, bounding schemes
can be used to accelerate this approach. The simplest and well-known approach for
NNEuclidean is early abandoning (EA) (Keogh et al. 2005) as mentioned earlier.
Similar to EA for NNEuclidean, we can stop the calculation if the current sum of the
absolute differences between each pair of representations exceeds the best similarity
so far (Keogh et al. 2006). The computational time can be reduced substantially with
this bounding scheme (Rakthanmanon et al. 2012).

4.5 Discussion on LPS based approaches and the splitting strategies

Section 3.1 discusses how LPS models the relationship within the time series (both
univariate and multivariate). Empirical evidence for the ability of LPS to model such
relationships is provided by the good results of LPS-based measures on the ARSim
dataset. The error rate of LPSBest is 0.004 where the closest competitor provides
0.101 (DDTW and WDDTW). ARSim is a simulated data set designed to introduce

(a) 0.12 γT = 0.2


(b) 0.4
J = 50
γT = 0.4 J = 100
γT = 0.6 0.35 J = 150
0.1 J = 200
γT = 0.8
0.3 J = 250
γT = 1
Test Time (sec)

Test Time (sec)

0.08
0.25

0.06 0.2

0.15
0.04
0.1
0.02
0.05

0 0
0.2 0.4 0.6 0.8 1 4 5 6 7 8
γN D

Fig. 11 Test times with changing values of the of the parameters J, N , T and D. a (J = 200 and D = 6) and
b (γ N = 1 and γ T = 1)

123
Time series representation and similarity based on local autopatterns 505

a challenge for classifiers working in the time domain (Bagnall et al. 2012). It is a
binary classification problem where each class follows two different AR(7) models.
Because of the autoregressive nature of LPS (i.e., one segment is used to predict the
other), the error rates are substantially lower for this data set.
The good performance of “random splits” shown in Fig. 8 is not surprising because
of the substantial overlap between the columns of a segment matrix. The columns of a
segment matrix are obtained by sliding a window of length L by one time unit (stride
= 1). Therefore, patterns are highly likely to be captured with the help of recursive
partitioning, even if each segment is selected randomly.
However, “random splits” is likely to perform poorly for multivariate time series
because the dependencies between the multiple series are not considered explicitly. A
multivariate time series is transformed to univariate time series by concatenating the
individual axis in UWaveAll dataset. Although it is treated as a univariate time series
in Table 1, “regression splits” can model the dependencies between axes with an error
rate of 0.034 (Sect. 4.2) where the error rate of “random split” is 0.051. This dataset
is a good example because gestures are not defined by the movements over individual
axes. The interaction of the moves over different axes is important to define the classes.
Thus, the multivariate version of LPS has an error rate of 0.022. As parameter selection
introduces a better selection of the segment length which is important for this particular
problem, the error rate for LPSBest is also very small (i.e., 0.025).

4.6 Use of learned representation as an input to learning algorithms

LPS obtains similarity from a representation based on tree-based ensembles. The


learned representation can also be used as an input to learning algorithms. Each tree
in the ensemble generates a representation and the combined representation obtained
from the trees generates a sparse vector. Each tree potentially captures different infor-
mation about the series which is important for many learning tasks.
In order to illustrate other potential uses of the learned representation, we apply
principal component analysis (PCA) to two alternative representations of the CBF
dataset. The first representation considers the raw values as the feature vector (a vector
of length 128). Then we generate the LPS representation with the same parameter
settings as in our experiments. Figure 12 plots the first two PCA scores for each time
series from both representations. To assess the performance of both representations
visually, the class information is color-coded. PCA on the LPS representation shows
better separation of the three classes as illustrated in Fig. 12. Although this is an
example for a classification task, the LPS representation can be used for other tasks
such as forecasting and it has a natural extension to clustering, anomaly detection,
etc., with the proposed similarity measure.
If the learned representation is used as the input to a learning algorithm, feature
selection might help. LPS generates a sparse representation where each tree provides
different information. With the help of feature selection, relevant information can be
captured. Split choice should be less important in such cases and use of “random splits”
has the advantage of its computational efficiency. However, details of this study are
beyond the scope of the current work here.

123
506 M. G. Baydogan, G. Runger

(a) 10 Class 1
(b) 50 Class 1
Class 2 40 Class 2
8
Class 3 Class 3
6 30
20
4
10
PC 2

PC 2
2
0
0
−10
−2 −20
−4 −30
−6 −40
−10 −5 0 5 10 15 −60 −40 −20 0 20 40
PC 1 PC 1

Fig. 12 First two principal components from PCA applied to the LPS representation and the raw time
series of the CBF dataset. PCA on LPS representation shows better separation of the three classes. a PCA
on raw time series. b PCA on LPS representation

4.7 Missing values

Estimation of the missing values is commonly employed for time series datasets.
However, the estimation method introduces a new parameter to the time series problem.
On the other hand, LPS handles the data with missing values without a need for any
additional step as our learning strategy inherits the properties of decision tree learning.
uWaveGestureLibrary Liu et al. (2009) is used here to illustrate the performance
of the multivariate version of LPS when there are missing values. To simplify the
experiments, difference series are discarded. For each axis and instance, we randomly
removed γ ∈ {0.01, 0.05, 0.1, 0.25, 0.5} proportion of the values in the training data
and the test data. The error rates over 10 replications are provided in Fig. 13. For the

0.065

0.06

0.055
Error rate

0.05

0.045

0.04

0.035

0.03

0.01 0.05 0.1 0.25 0.5


γ
Fig. 13 Boxplot of the test error rates with different proportions of missing values for uWaveGestureLibrary
(Liu et al. 2009) (10 replications). LPS is robust to large proportions of missing values without requiring
specific mechanisms

123
Time series representation and similarity based on local autopatterns 507

gesture recognition task, LPS performs reasonably well even with large proportions
of missing values.

5 Conclusions

This study proposes a novel time series representation based on the idea of non-
linear dependencies described as autopatterns that can occur locally in time. Time
series segments are extracted and then partitioned according to simple rules to detect
these patterns. Random partitions and those generated from regression trees applied
to random predictor and target segments are used to generate a representation. Conse-
quently, features are learned within the method based on the dependencies in the time
series. This avoids the feature extraction step which is common in the feature-based
approaches. The method discovers the patterns of the time series with a tree-based
ensemble learning strategy. The approach conceptually generalizes autoregressive
models to detect dependencies that are potentially local and nonlinear in the time
series. Consequently, LPS is based on a traditional approach, but provides a promis-
ing new research direction.
A robust similarity measure called learned pattern similarity (LPS) based on the
matching patterns of the time series is also presented. Our experimental results show
that LPS does not require the setting of many parameters and provides fast and com-
petitive results on benchmark datasets from several domains. Also, an R Core (2014)
package (named as LPStimeSeries) is implemented as part of this study.

Acknowledgments This research was partially supported by the Scientific and Technological Research
Council of Turkey (TUBITAK) Grant Number 114C103.

References
Akl A, Valaee S (2010) Accelerometer-based gesture recognition via dynamic-time warping, affinity prop-
agation, compressive sensing. In: 2010 IEEE International conference on acoustics speech and signal
processing (ICASSP), pp 2270–2273
Bagnall A, Davis LM, Hills J, Lines J (2012) Transformation based ensembles for time series classification.
In: SDM, vol. 12. SIAM, pp 307–318
Batista G, Keogh E, Tataw O, de Souza V (2014) Cid: an efficient complexity-invariant distance for time
series. Data Min Knowl Discov 28(3):634–669. doi:10.1007/s10618-013-0312-3
Baydogan MG (2013) Learned pattern similarity (LPS). homepage: www.mustafabaydogan.com/
learned-pattern-similarity-lps.html
Baydogan MG, Runger G (2014) Learning a symbolic representation for multivariate time series classifi-
cation. Data Min Knowl Discov pp 1–23. doi:10.1007/s10618-014-0349-y
Baydogan MG, Runger G, Tuv E (2013) A bag-of-features framework to classify time series. IEEE Trans
Pattern Anal Mach Intell 35(11):2796–2802
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth, Belmont
Chakrabarti K, Keogh E, Mehrotra S, Pazzani M (2002) Locally adaptive dimensionality reduction for
indexing large time series databases. ACM Trans Database Syst 27(2):188–228
Chen H, Tang F, Tino P, Yao X (2013) Model-based kernel for efficient time series analysis. In: Proceedings
of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,
New York, pp 392–400

123
508 M. G. Baydogan, G. Runger

Chen L, Özsu MT, Oria V (2005) Robust and fast similarity search for moving object trajectories. In:
Proceedings of the 2005 ACM SIGMOD International conference on management of data, SIGMOD
’05. ACM, New York, pp 491–502. doi:10.1145/1066157.1066213
Cortina JM (1993) Interaction, nonlinearity, and multicollinearity: implications for multiple regression. J
Manag 19(4):915–922
Cuturi M (2011) Fast global alignment kernels. In: Getoor L, Scheffer T (ed) Proceedings of the 28th
international conference on machine learning (ICML-11). ACM, New York, pp 929–936
CMU (2012) Graphics Lab Motion Capture Database: Homepage: mocap.cs.cmu.edu
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data:
experimental comparison of representations and distance measures. Proc VLDB Endow 1:1542–1552
Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann
Math Stat 11(1):86–92. http://www.jstor.org/stable/2235971
Fu T (2011) A review on time series data mining. Eng Appl Artif Intell 24:164–181
Gaidon A, Harchaoui Z, Schmid C (2011) A time series kernel for action recognition. In: BMVC 2011-
British machine vision conference. BMVA Press, Dundee, pp 63–1
Geurts P (2001) Pattern extraction for time series classification. Principles of data mining and knowledge
discovery. Lecture Notes in Computer Science, vol 2168. Springer, Berlin, pp 115–127
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42
Grabocka J, Schmidt-Thieme L (2014) Invariant time-series factorization. Data Min Knowl Discov 28(5—
-6):1455–1479
Han J, Kamber M, (2001) Data mining: concepts and techniques. The Morgan Kaufmann Series In Data
Management Systems. Elsevier Books, Oxford. http://books.google.com/books?id=6hkR_ixby08C
Hastie T, Tibshirani R, Friedman J (2009) Elements of statistical learning. Springer, Berlin
Hills J, Lines J, Baranauskas E, Mapp J, Bagnall A (2014) Classification of time series by shapelet trans-
formation. Data Min Knowl Discov 28(4):851–881. doi:10.1007/s10618-013-0322-1
Jaakkola T, Diekhans M, Haussler D (1999) Using the fisher kernel method to detect remote protein homolo-
gies. In: ISMB vol. 99, pp 149–158
Jebara T, Kondor R, Howard A (2004) Probability product kernels. J Mach Learn Res 5:819–844. http://dl.
acm.org/citation.cfm?id=1005332.1016786
Jeong YS, Jeong MK, Omitaomu OA, (2011) Weighted dynamic time warping for time series classification.
Pattern Recognit 44(9): 2231–2240. doi:10.1016/j.patcog.2010.09.022. http://www.sciencedirect.
com/science/article/pii/S003132031000484X. Computer Analysis of Images and Patterns
Keogh E, Kasetty S (2003) On the need for time series data mining benchmarks: a survey and empirical
demonstration. Data Min Knowl Discov 7(4):349–371
Keogh E, Lin J, Fu A (2005) HOT SAX: efficiently finding the most unusual time series subsequence. In:
Proceedings of the fifth IEEE international conference on data mining, ICDM ’05. IEEE Computer
Society, Washington, DC, pp 226–233
Keogh E, Wei L, Xi X, Lee SH, Vlachos M (2006) LB_Keogh supports exact indexing of shapes under
rotation invariance with arbitrary representations and distance measures. In: Proceedings of the 32nd
international conference on very large data bases, VLDB ’06. VLDB Endowment, pp 882–893
Keogh E, Zhu Q, Hu BYH, Xi X, Wei L, Ratanamahatana CA (2011) The UCR time series classifica-
tion/clustering. homepage:www.cs.ucr.edu/~eamonn/time_series_data/
Keogh EJ, Pazzani MJ (2001) Derivative dynamic time warping. In: SDM, vol. 1. SIAM, pp 5–7
Kuksa P, Pavlovic V (2010) Spatial representation for efficient sequence classification. In: 2010 20th Inter-
national conference on pattern recognition (ICPR), pp 3320–3323
Latecki L, Megalooikonomou V, Wang Q, Lakaemper R, Ratanamahatana C, Keogh E (2005) Partial elastic
matching of time series. In: Fifth IEEE international conference on data mining, pp 701–704
Liao TW (2005) Clustering of time series data-a survey. Pattern Recogn 38(11):1857–1874. doi:10.1016/
j.patcog.2005.01.025
Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for
streaming algorithms. In: Proceedings of the 8th ACM SIGMOD workshop on research issues in data
mining and knowledge discovery. ACM Press, New York, pp 2–11
Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series.
Data Min Knowl Discov 15:107–144

123
Time series representation and similarity based on local autopatterns 509

Lin J, Khade R, Li Y (2012) Rotation-invariant similarity in time series using bag-of-patterns representation.
J Intell Inf Syst 39(2):287–315
Lines J, Bagnall A (2014) Time series classification with ensembles of elastic distance measures. Data Min
Knowl Discov 29(3):565–592. doi:10.1007/s10618-014-0361-2
Liu J, Wang Z, Zhong L, Wickramasuriya J, Vasudevan V (2009) uWave: Accelerometer-based personalized
gesture recognition and its applications. IEEE International conference on pervasive computing and
communications, pp 1–9
Lowe DG (1995) Similarity metric learning for a variable-kernel classifier. Neural Comput 7(1):72–85
Marteau PF (2009) Time warp edit distance with stiffness adjustment for time series matching. IEEE Trans
Pattern Anal Mach Intell 31(2):306–318. doi:10.1109/TPAMI.2008.76
Nemenyi P (1963) Distribution-free multiple comparisons. Princeton University, Princeton
Olszewski RT (2012)http://www.cs.cmu.edu/~bobski/. Accessed June 10
R Core Team (2014) R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna.http://www.R-project.org/
Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2012) Searching
and mining trillions of time series subsequences under dynamic time warping. In: Proceedings of the
18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’12.
ACM, New York, pp 262–270
Rakthanmanon T, Keogh E, Fast Shapelets: A scalable algorithm for discovering time series shapelets, chap.
73, pp. 668–676. doi:10.1137/1.9781611972832.74
Ratanamahatana C, Keogh E (2005) Three myths about dynamic time warping data mining. In: Proceedings
of SIAM international conference on data mining (SDM05), vol 21, pp 506–510
Ratanamahatana CA, Lin J, Gunopulos D, Keogh E, Vlachos M, Das G (2010) Mining time series data.
In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Berlin, pp
1049–1077
Shieh J, Keogh E (2008) iSAX: indexing and mining terabyte sized time series. In: Proceedings of the 14th
ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08. ACM,
New York, pp 623–631
Stefan A, Athitsos V, Das G (2013) The move-split-merge metric for time series. IEEE Trans Knowl Data
Eng 25(6):1425–1438. doi:10.1109/TKDE.2012.88
Sübakan YC, Kurt B, Cemgil AT, Sankur B (2014) Probabilistic sequence clustering with spectral learn-
ing. Dig Signal Process 29(0):1–19. doi:10.1016/j.dsp.2014.02.014. http://www.sciencedirect.com/
science/article/pii/S1051200414000517
Wang Q, Megalooikonomou V, Faloutsos C (2010) Time series analysis with multiple resolutions. Inf Syst
35(1):56–74
Wang X, Mueen A, Ding H, Trajcevski G, Scheuermann P, Keogh E (2013) Experimental comparison of
representation methods and distance measures for time series data. Data Min Knowl Discov 26(2):275–
309

123

You might also like