Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

JID: IS

ARTICLE IN PRESS [m5G;April 18, 2018;12:52]


Information Systems 0 0 0 (2018) 1–11

Contents lists available at ScienceDirect

Information Systems
journal homepage: www.elsevier.com/locate/is

Searching for variable-speed motions in long sequences of motion


capture data
Jan Sedmidubsky∗, Petr Elias, Pavel Zezula
Masaryk University, Botanicka 68a, Brno, 602 00, Czechia

a r t i c l e i n f o a b s t r a c t

Article history: Motion capture data digitally represent human movements by sequences of body configurations in time.
Received 10 March 2017 Subsequence searching in long sequences of such spatio-temporal data is difficult as query-relevant mo-
Revised 26 October 2017
tions can vary in execution speeds and styles and can occur anywhere in a very long data sequence. To
Accepted 11 April 2018
deal with these problems, we employ a fast and effective similarity measure that is elastic. The prop-
Available online xxx
erty of elasticity enables matching of two overlapping but slightly misaligned subsequences with a high
Keywords: confidence. Based on the elasticity, the long data sequence is partitioned into overlapping segments that
Content-based retrieval are organized in multiple levels. The number of levels and sizes of overlaps are optimized to generate a
Motion capture data modest number of segments while being able to trace an arbitrary query. In a retrieval phase, a query is
Subsequence matching always represented as a single segment and fast matched against segments within a relevant level with-
Speed-invariant retrieval out any costly post-processing. Moreover, visiting adjacent levels makes possible subsequence searching
Similarity measure
of time-warped (i.e., faster or slower executed) queries. To efficiently search on a large scale, segment fea-
Hierarchical segmentation
Indexing
tures can be binarized and segmentation levels independently indexed. We experimentally demonstrate
effectiveness and efficiency of the proposed approach for subsequence searching on a real-life dataset.
© 2018 Elsevier Ltd. All rights reserved.

1. Introduction example, find occurrences of perfect backflip landings within hun-


dreds of hours of exercise recordings. Locating query-relevant sub-
Current motion capturing technologies can accurately record sequences constitutes a hard task since their lengths and positions
a human motion at high spatial and temporal resolutions. The (i.e., beginnings and endings) are unknown. Moreover, the query
recorded motion is represented as an ordered sequence of poses can not be anticipated in advance and need not correspond to any
that describe skeleton configurations in corresponding video semantic or known action class, so textual-annotation-based re-
frames. The skeleton configuration is represented by a set of 3D trieval can not be applied. To deal with these problems, a fine seg-
coordinates determining positions of the captured body joints in mentation technique along with an effective similarity measure are
space. The recorded motion sequences are used in a variety of ap- needed.
plications, e.g., in healthcare to recognize movement disorders, in The contribution of this paper is an efficient subsequence
sports to analyze performances of top athletes, or in computer ani- matching approach that is schematically illustrated in Fig. 1. The
mation to browse large databases of human motions for production retrieval process is based on a multi-level segmentation structure
of realistically looking games or movies. These applications require that produces a minimum number of segments with respect to
effective and efficient search operations to increase reusability and an elasticity property of a similarity measure. The elasticity al-
findability of expensively recorded data in the past. lows segments to be shifted much more than of a single frame,
A search operation is primarily specified by a query object that which increases overall search performance. We further binarize
can be either selected as an existing example [1], or modeled segment features and employ a disk-based index structure to in-
by special interfaces such as hand-drawn sketches [2] or puppet dex individual segmentation levels independently and thus access
models [3]. We especially focus on the query-by-example subse- them in parallel. This enables real-time searching in long mo-
quence matching operation: Given a short query sequence and a tion sequences, taking even dozens of days. A proposed speed-
very long data sequence, search the data sequence and locate its invariant retrieval algorithm additionally supports searching for
subsequences that are the most similar to the query sequence. For query-relevant motions that are executed faster or slower. The
whole multi-level segmentation structure is customizable to com-
∗ promise between search performance and accuracy. It is also dy-
Corresponding author.
E-mail address: xsedmid@fi.muni.cz (J. Sedmidubsky). namic to support the addition of a new content.

https://doi.org/10.1016/j.is.2018.04.002
0306-4379/© 2018 Elsevier Ltd. All rights reserved.

Please cite this article as: J. Sedmidubsky et al., Searching for variable-speed motions in long sequences of motion capture data, Infor-
mation Systems (2018), https://doi.org/10.1016/j.is.2018.04.002
JID: IS
ARTICLE IN PRESS [m5G;April 18, 2018;12:52]

2 J. Sedmidubsky et al. / Information Systems 000 (2018) 1–11

Fig. 1. Flowchart diagram presenting a general overview on the proposed subsequence matching approach in motion capture data.

2. Related work sures directly exploits the time-series representation of motion


data. Multi-dimensional time series of body joints (positions or
Subsequence matching methods for motion capture data gen- angles) are compared by the temporal alignment techniques, such
erally require a (1) segmentation technique to partition a data se- as Dynamic Time Warping (DTW) [11,12] and its variants [13–15],
quence into meaningfully-long data segments, (2) similarity mea- or Longest Common Subsequence (LCS) [16]. Although such tech-
sure to compare query and data segments, and (3) retrieval algo- niques deal well with temporal discrepancies, such as faster and
rithm to efficiently localize query-similar subsequences by group- slower movements of otherwise same actions, their comparison
ing the most relevant data segments. has quadratic time complexity. Multiple time series can be repre-
sented by shapelets that have a great descriptive power and prove
2.1. Segmenting motions to be very effective and efficient in binary classification [17]. Multi-
variate time series can further be transformed into univariate time
A segmentation technique partitions the long data sequence series, such as in [18], where motion primitives are obtained by
into short segments that are directly comparable with segment(s) a hierarchical clustering. A comprehensive review on time-series
of the query sequence. For specific applications, data segments mining can be found in [19].
can be identified with respect to observed changes in repeat- The feature-based comparison constitutes the second family of
ing movement patterns [4] or changes in pose distribution [5]. approaches. Feature extraction is applied to discover the key mo-
The alternative is a semantic segmentation which localizes non- tion characteristics in lower-dimensional vector spaces. The string-
overlapping segments corresponding to some of the predefined matching Smith–Waterman distance [20,21] is applied to match
classes (e.g., walking, kicking and jumping) [1,6,7]. Semantic and motions based on their string representation assembled from a
domain-specific segmentations are suitable for discovering such vocabulary of motion features. Features can be either carefully
parts that comply with the pre-defined semantics. On the other chosen manually [22] or acquired by supervised learning using
hand, they are rather inconvenient for subsequence retrieval in Neural Networks [23,24] or Support Vector Machines [25]. Deep
general because of their poor ability to locate queries that do not [26,27] and recurrent [23] neural networks constitute the best-
belong to any predefined semantic classes, such as transitional performing methods in recognizing actions. Despite being very ef-
motions in between two consecutive actions. Consequently, the fective, they are inherently designed for classifying actions and
semantic segmentations are better applicable in search scenarios hardly employed to quantify a pair-wise similarity, which is re-
where query categories are anticipated, for example in event de- quired for the task of subsequence matching. Fortunately, some
tection or stream annotation. exceptions exist where discriminative feature vectors are extracted
This problem is traditionally overcome by a fine-grained seg- and used for a similarity comparison. For example in [24], the 160-
mentation that partitions both query and data sequence into short bit features extracted from a deep auto-encoder are fast-compared
segments of a fixed-size. The query can be partitioned into over- by the Hamming distance. Other metric functions are commonly
lapping segments using the sliding-window principle and the data used to compare feature vectors, such as the weighted Euclidean
sequence into disjoint (i.e., non-overlapping) segments to reduce distance in [28].
the data replication, or vice versa [8,9]. In both cases, however, In this work we employ a convolutional neural network trained
query segments need to be matched with data segments respect- on the motion data domain to achieve recognition rates compa-
ing their temporal order. The temporal alignment of candidate rable with state-of-the-art classifiers as demonstrated in [29]. Im-
matches can be very computationally demanding as search com- portantly, we benefit from a fixed-size 4096-dimensional highly-
plexity raises with the query length. We avoid this problem by descriptive feature representation that is extracted from the last
considering a query as a single segment. Such query segment is hidden layer of the network. The proposed similarity measure is
evaluated against a multi-level overlapping segmentation to en- thus a fundamental building block for the design of our subse-
sure that an arbitrary query-relevant data subsequence (bounded quence matching framework. This is particularly reflected in the
in length by a user) highly overlaps with at least one data seg- segmentation structure that is tailored to maximally utilize the ex-
ment of a similar length. For more details regarding existing on- pedient properties of the employed similarity measure.
line, semi-online and offline segmentation techniques we refer to
a study in [10].
2.3. Indexing and retrieving motions
2.2. Comparing similarity
By conveniently combining a segmentation technique and a
A variety of similarity measures exists to determine pair- similarity measure, a subsequence retrieval algorithm is used to
wise proximity of motions. The first group of similarity mea- locate query-relevant parts within a very long, but already seg-

Please cite this article as: J. Sedmidubsky et al., Searching for variable-speed motions in long sequences of motion capture data, Infor-
mation Systems (2018), https://doi.org/10.1016/j.is.2018.04.002
JID: IS
ARTICLE IN PRESS [m5G;April 18, 2018;12:52]

J. Sedmidubsky et al. / Information Systems 000 (2018) 1–11 3

mented, data sequence. Sequential search is commonly used, such of the same motions. The important properties are described in the
as the A-LTK method [30] or string-matching-based Knuth–Morris– following paragraphs.
Pratt search algorithm [31]. However, sequential scan does not
provide much scalability for searching in large volumes of mo-
• Effectiveness. A high descriptive power of feature vectors can
tion data. To significantly speed-up similarity searching, multi- be further increased by fine-tuning the neural network for the
dimensional or metric-based index structures can be utilized [32]. specific application purpose, e.g., by feeding the network with
To efficiently access data segment features, a self-organizing map training samples of categorized motion images. Such fine-tuned
is used in [33] or trie-based structure in [34]. The M-Index struc- network generates feature vectors that not only better distin-
ture in [35] is used to efficiently search for the key poses that are guish motions that are present during the training phase, but
the most similar to selected poses of a query sequence. The ob- also clusters similar samples on which the network has never
tained ranked sets of candidate poses are post-processed in tem- been explicitly trained. This is especially true when training
poral order to identify query-relevant subsequences. In this paper, samples come from a diverse dataset of actions.
we also focus on the large-scale subsequence searching. By utiliz-
• Efficiency and indexability. The fixed size of features for
ing the disk-based PPP-Codes index [36], we are able to search a variably-long motions enables a fast similarity calculation by
several-day long sequence within a single second, which outper- the Euclidean distance, without the need of using any expen-
forms the best-performing approaches. Both effectiveness and effi- sive warping technique. The fixed size of features also enables
ciency of our retrieval algorithms are evaluated and confronted to utilizing disk/main-memory-based metric index structures [32].
related work in more detail in the experimental part. The float-number values of feature vectors can be further bina-
rized to occupy less memory space and to provide much faster
comparison by the Hamming distance.
2.4. Our contributions
• Elasticity. This property denotes the ability of the similarity
measure to tolerate “imprecisely” segmented motion sequences.
We extend our recent idea [37] by proposing a new speed-
It means that the similarity measure returns a near-zero dis-
invariant retrieval algorithm for subsequence matching in mo-
tance between a given action (e.g., jump) and the same ac-
tion capture data. Speed invariance reflects the ability to search
tion with some additional/cropped frames. Fig. 2a/b shows how
for motions executed faster or slower but otherwise similar to a
these modifications influence the accuracy on the action recog-
query. The algorithm employs a segmentation technique that or-
nition scenario.
ganizes data segments within a multi-level structure. This multi-
• Speed-invariance. The same action, such as running, can be
level structure ensures a traceability of an arbitrary query-relevant
performed faster or slower. Fig. 2c shows that the proposed
data subsequence, which is bounded in its size. In particular, each
similarity measure flawlesly recognizes actions which are per-
level of this structure keeps overlapping data segments of a given
formed faster or slower. Even a 10-times faster/slower execu-
size. The number of levels and segments within each level are
tion of the same action can be recognized with a negligible er-
constructed to be minimum with respect to the query-length lim-
ror.
its and the ability of a similarity measure to compare slightly
cropped/extended/shifted motions. Elasticity, as the ability to recognize imprecisely segmented mo-
We employ a similarity measure which is tolerant to tions, is a very synergistic property that is highly leveraged in the
slower/faster and slightly changed motions and also generates proposed segmentation method. It allows for much bigger shifts
fixed-size feature vectors for motions of variable lengths. As the between consecutive overlapping segments compared to the con-
fixed-size features are compared by the Euclidean distance, the ventional methods [1,8]. Hand-in-hand with the elasticity of the
segments in each level can be efficiently indexed. Since we primar- similarity measure, the amount of the shift strongly negotiates the
ily focus on search performance, we introduce a new transforma- trade-off between the search accuracy and performance. To achieve
tion of feature vectors into a bit representation. The bit represen- the maximum versatility, a user-defined covering-factor parameter
tation further decreases space occupancy and search costs by order is introduced in the following section. This parameter strongly in-
of magnitude with a negligible impact on search effectiveness. We fluences the segmentation policy as it denotes strict requirements
also make a study on large-scale subsequence matching to enable on the allowed spacing of segments. The lower the spacing, the
searching potentially in a 48-day long motion sequence within one higher the data traceability, but also the higher space and perfor-
second. mance demands.
Additionally, the speed-invariance property brings the advan-
3. Similarity of motion data tage of searching for faster and slower motion executions. The
degree of tolerated speed-up is also controlled by a user-defined
To compare any pair of rather short motion sequences, we stiffness parameter. This parameter determines how the segmenta-
adopt a similarity measure proposed in [29]. It uses very effective tion structure is traversed to find desirably faster or slower subse-
and fixed-size feature vectors extracted from motions of variable quences with respect to a query.
lengths and compares them by the efficient Euclidean distance. The
feature extraction is based on visualizing the normalized joint tra- 4. Multi-level segmentation structure of data sequences
jectories into a 2D image, fine-tuning a deep convolutional neu-
ral network by the generated images, and extracting the 4096D To search for query-relevant subsequences, the data sequence
feature vector of a high descriptive power from the last hidden has to be partitioned into segments. Traditional methods [8] sug-
network layer. The whole feature extraction process takes about gest partitioning the data sequence into disjoint (non-overlapping)
25 ms. segments, while the query sequence into overlapping segments us-
The employed similarity measure attributes very convenient ing the sliding window principle (or vise-versa). Such partitioning
properties for segmentation-based subsequence matching because facilitates locating relevant data segments that are similar to some
it (1) extracts very effective and fixed-size 4096-dimensional fea- query segments. Although the data segments can be indexed and
ture vectors for motions of variable lengths, (2) compares these efficiently retrieved, this concept has the following disadvantages:
vectors by the efficient Euclidean distance that can be indexed, (3)
tolerates a non-trivial degree of segmentation error when compar- • Longer queries are partitioned into a higher number of query
ing similar motions, and (4) tolerates faster and slower executions segments. For each query segment an independent search (i.e.,

Please cite this article as: J. Sedmidubsky et al., Searching for variable-speed motions in long sequences of motion capture data, Infor-
mation Systems (2018), https://doi.org/10.1016/j.is.2018.04.002
JID: IS
ARTICLE IN PRESS [m5G;April 18, 2018;12:52]

4 J. Sedmidubsky et al. / Information Systems 000 (2018) 1–11

Fig. 2. Tolerance of the similarity measure on the action recognition scenario evaluated by 1-NN queries on a database of 1464 HDM05 motions divided into 15 categories.
Each annotated query action is compared against all the remaining actions that are modified by (a) removing/adding both beginning and ending frames, (b) shifting actions
to remove/add a preceding/following frame content and (c) accelerating/decelerating actions (i.e., by removing/adding frames uniformly within actions). In particular, the
database actions are (a) cropped/expanded by 10–90%, (b) shifted by 10–90%, and (c) 2–10-times accelerated/decelerated with respect to their original frame content.

sub-query) has to be executed to retrieve the most similar data 4.2. Multi-level segmentation
segments;
• The retrieved data segments of all sub-queries have to be in- A query sequence mQ is also restricted to a limited length in
telligently merged respecting chronological order of query and [lmin , lmax ]
and always considered as a single segment. To partition
retrieved segments to construct a set of relevant subsequences the data sequence m according to Objective 1, we need to cover all
as the query result. potential query-relevant subsequences. Since positions (beginning
and ending frames) of relevant hits are not known in advance, all
To overcome these problems, we propose to consider the query possible data subsequences of the restricted length have to be cov-
as a single segment. It means that only a single search is executed ered.
without the need of any other merging procedure. However, this Our idea is to define segmentation levels responsible for groups
would require sliding data segments for every potential query size, of queries in certain length intervals. Each level has its own size
which results in a huge number of data segments. Such number of segments that overlap by a fixed-size margin to cf-cover an ar-
can be dramatically reduced when the used similarity function can bitrary data subsequence. We naturally require to minimize the
deal with a certain versatility in segmentation – sliding data seg- number of such levels as well as the size of overlaps between seg-
ments can be then shifted much more than of a single frame only ments with respect to the predefined covering factor cf. These ob-
and can be constructed just for the specific sizes of queries. servations imply the following important lemma.

Lemma 1. A single segmentation level with segments of fixed-size l


4.1. Problem formalization can cf-cover the subsequences having their lengths in range [l · (1 −
c f ) , l · ( 1 + c f )] .
We partition the data sequence into segments in a way that an
arbitrary data subsequence (bounded in length) overlaps with at Proof. According to Lemma 1, a single level with segments of
least one segment in the majority of frames. Consequently, hav- fixed-size l can cover only the subsequences which are maximally
ing a query as a single segment, each query-relevant data subse- l · (1 + c f ) long. Suppose that this statement is not true, then some
quence highly overlaps with at least one data segment. The high segment m[i, j] can also cover subsequence m[i : j ] which is longer
overlap ensures that relevant subsequences are always findable just than l · (1 + c f ) = ( j − i ) · (1 + c f ) frames:
by searching for similar segments. To quantify the high overlap be- j  − i > ( j − i ) · ( 1 + c f )
tween the specific subsequence and segment, we define covering
factor cf ∈ [0, 1) which determines the maximum ratio between the j  − i > j − i + j · c f − i · c f
number of their non-overlapping frames and the segment length. j  − i > j − i + c f · ( j − i )
In other words, the covering factor quantifies the tolerance of the j  − i − j + i > c f · ( j − i )
similarity function towards cropped/added content of two similar
j  − i − j + i
motions. The following definition defines the covering factor for- > cf
j−i
mally.
|i − i | + | j − j  |
> c f,
Definition 1. Given data sequence m and covering factor cf ∈ [0, 1): j−i
We say that any subsequence m[i : j ] is cf-covered by segment m[i:
 +| j  − j | which is in contradiction with Definition 1. Similarly, a segment of
j] if and only if |i −i|j−i ≤ cf.
l frames can cover the subsequences which have minimally l · (1 −
c f ) frames. Due to the analogy with the previous case, the proof is
Our objective is to partition the data sequence into segments
omitted. 
having optimal sizes and minimum possible overlaps with respect
to the covering factor. To ensure that an arbitrary subsequence is
covered by at least one segment, we need to restrict the subse- 4.2.1. Lengths of segments
quence length by the minimum l min ∈ N and maximum l max ∈ N To minimize the total number of segmentation levels (i.e., also
value (the maximum length is supposed to be much shorter than the total number of segments), the individual levels have to cover
the length of the data sequence). According to these limits, we par- subsequences of the possibly largest length interval. At the same
tition the data sequence according to the following objective. time, the first level with segments of length l1 needs to cover the
shortest possible subsequences of length lmin :
Objective 1. Given data sequence m and minimum lmin and max-
l min
imum lmax subsequence length: Partition sequence m into a mini- l min = l 1 · (1 − c f ) ⇔ l1 =
1 − cf
mum number of segments so that an arbitrary subsequence m[i: j]
(bounded in length l min ≤ j − i ≤ l max ) is cf-covered by at least one Consequently, this level also covers the subsequences which are
segment. up to l 1 · (1 + c f ) frames long (based on Lemma 1). The second

Please cite this article as: J. Sedmidubsky et al., Searching for variable-speed motions in long sequences of motion capture data, Infor-
mation Systems (2018), https://doi.org/10.1016/j.is.2018.04.002
JID: IS
ARTICLE IN PRESS [m5G;April 18, 2018;12:52]

J. Sedmidubsky et al. / Information Systems 000 (2018) 1–11 5

level then covers the subsequences of at least l 1 · (1 + c f ) frames, 2. It can be shown that an arbitrary subsequence of length in
so l 2 = l 1 · (1 + c f )/(1 − c f ). Similarly as the first level, the second range [l r · (1 − c f ), l r · (1 + c f )] is covered by at least one seg-
one covers maximally the subsequences of l 2 · (1 + c f ) frames. This ment. This can be proven via an induction step. 
continues until the nth segmentation level covers the longest pos-
Fig. 3 illustrates the multi-level segmentation structure along
sible subsequences of lmax frames:
with segments generated at first two levels.
l max
l n−1 · (1 + c f ) < l max ≤ l n · (1 + c f ) ⇔ ln ≥ .
1 + cf 4.2.4. Number of segments
Respecting these properties, the segment length is determined by When we partition the data sequence m of |m| frames, the
constructing the individual levels. The fixed length lr of segments number nr of segments at the rth level is determined as:
at the rth level (r ∈ [1, n]) can be recursively defined as:
 
|m| − l r
nr = 1 + . (4)
l min 1 + cf lr · c f
l1 = l r = l r−1 · . (1)
1 − cf 1 − cf
4.2.5. Replication factor
4.2.2. Number of segmentation levels The covering factor cf has the most important influence on the
The number n of segmentation levels can be calculated as n = number of segmentation levels as well as the total number of gen-
x + 1, where x denotes the power parameter needed to skip to erated segments. To get an idea about “global” overlaps, we define
another level and is computed as: the replication factor rf that indicates how many times the same
 x frame is repeated in segments. Supposing that the length of data
l min 1 + cf l max sequence is much longer than lengths of segments, we express the
· =
1 − cf 1 − cf 1 + cf replication factor rf as:
 x
1 + cf l max · (1 − c f ) rf 
n
, (5)
=
1 − cf l min · (1 + c f ) cf
  where n stands for the number of segmentation levels. For exam-
1 + cf l max · (1 − c f )
x · log = log ple, having the covering factor c f = 0.2 and four segmentation lev-
1 − cf l min · (1 + c f )
  els, each frame of the original data sequence is involved twenty
l max · (1 − c f ) times in the specific segments.
x = log 1+c f
1−c f l min · (1 + c f )
   4.3. Index construction
l max · (1 − c f )
⇒n= log 1+c f + 1. (2)
1−c f l min · (1 + c f ) The data sequence is preprocessed to be partitioned into the
multi-level segmentation structure. Having specified covering fac-
4.2.3. Overlaps of segments tor cf and minimum lmin and maximum lmax query length, the
The size of overlap among segments in each level is selected number n of segmentation levels is calculated according to Eq. (2).
to minimize the number of segments while they all together cf- In each rth level (r ∈ [1, n]), the data sequence is partitioned into nr
cover all possible subsequences bounded in length [lmin , lmax ]. The segments of a fixed length of lr frames by applying Eqs. (1) and (3).
lowest possible overlap we can afford corresponds exactly to the As the data sequence is partitioned, each segment is independently
100 · (1 − c f ) % frames with respect to the segment length. The ini- processed to extract the 4096-dimensional feature vector using the
tial position sij of the jth segment at the rth level is then recur- deep convolutional neural network – see [29] for more detailed in-
sively defined as: formation.
The feature vectors within each level can be also independently
sr1 = 1 srj = srj−1 + l r · c f. (3) indexed to speedup the retrieval process. As the feature vectors are
Lemma 2. The segments at the specific rth level have to be maximally compared by the Euclidean distance, any metric-based index struc-
shifted by lr · cf frames to cf-cover any subsequence of length in [l r · ture can be utilized. We confront the naive sequential scan with
( 1 − c f ) , l r · ( 1 + c f )] . the usage of indexing structure in the experimental evaluation in
Section 6. In addition, the whole multi-level structure is dynamic
Proof. Given the specific rth segmentation level with segments of because it simply enables adding the feature vectors of segments
fixed length lr and an arbitrary subsequence m[i, j] belonging to of new data sequences into each level.
this level, then:

1. We show that the shift between segments about lr · cf frames is 5. Subsequence retrieval in multi-level segmentation structure
of a maximum possible size. Assume that the shift is higher,
i.e., l r · c f + 1 frames, and there exist two consecutive seg- Within the preprocessing phase a user specifies three compul-
ments m [i : i + l r ] and m [i + l r · c f + 1 : i + l r · c f + 1 + l r ] be- sory parameters – covering factor cf, minimum lmin and maximum
r r
tween which subsequence m[i + l ·c2f +1 , i + l ·c2f +1 + l r ] of the lmax length of query – so that the multi-level segmentation struc-
r
same length l is located. Then this subsequence has the same ture could be constructed for a long data sequence. The objective
number of non-overlapping frames with both the segments m of the retrieval phase is to search the long data sequence and lo-
and m  . Considering m and according to Definition 1, then: cate its subsequences that are the most similar to a short query se-
    quence, which is bounded in length [lmin , lmax ]. We approximate lo-
i + l r ·c f +1
− i + i + l r ·c f +1
+ l r − ( i + l r ) calization of such subsequences by traversing the multi-level struc-
2 2
r
≤ cf ⇒ ture and identifying data segments that are the most similar to the
 l r ·c f +1   l r ·c f +1  l
 +  lr · c f + 1 1
query. Although the identified data segments need not be perfectly
2 2 aligned with query-relevant subsequences, they should overlap in
≤ cf ⇒ ≤ c f ⇒ c f + r ≤ c f,
lr lr l the majority of frames, i.e., relevant subsequences are cf-covered
which is not valid for any (positive) length lr of segment.  by identified data segments according to Lemma 1.

Please cite this article as: J. Sedmidubsky et al., Searching for variable-speed motions in long sequences of motion capture data, Infor-
mation Systems (2018), https://doi.org/10.1016/j.is.2018.04.002
JID: IS
ARTICLE IN PRESS [m5G;April 18, 2018;12:52]

6 J. Sedmidubsky et al. / Information Systems 000 (2018) 1–11

Fig. 3. Graphical illustration of segmentation: Based on covering factor c f = 0.2 and query length limits l min = 100 and l max = 500, the four segmentation levels are computed
(left) and used to partition the data sequence (right), where only first- and second-level segments are visualized. E.g., the first-level segments can 0.2-cover any data
subsequence of length in [100, 150].

The retrieval phase evaluates a k-nearest-neighbor (k-NN) query spect to the query. For example, the same movement can be per-
that is specified by the number k of the most similar data seg- formed slower or faster and thus their lengths differ significantly.
ments to be returned and sequence mQ as the query object. To solve this problem, we propose to search for relevant data seg-
The query sequence is firstly preprocessed to extract its 4096- ments not only in the responsible segmentation level but also in
dimensional feature vector which can be then compared to the fea- surrounding levels. The number of surrounding levels to be ac-
ture vectors of data segments within the multi-level segmentation cessed is determined by the stiffness parameter sf ∈ (0, 1].
structure. The stiffness parameter quantifies how much shorter and longer
We propose two variants of retrieval algorithms. We firstly in- data segments can be retrieved with respect to the original query
troduce a speed-dependent algorithm that returns data segments of length |mQ |. In other words, we want to search for those data seg-
similar length as the query sequence. Then, we present a speed- ments that are relevant not only to the original query but also
invariant algorithm capable of searching for subsequences that can to its up to s1f -times slower or faster variants. For example, if
be executed slower or faster and thus the lengths of corresponding s f = 1, retrieval is the same as of the speed-dependent algorithm;
data segments can differ much more. if s f = 0.5, the algorithm searches for motion subsequences that
can be executed up to two times slower or faster.
5.1. Speed-dependent retrieval algorithm To guarantee retrieval of s1f -times slower/faster subsequences,
we search for data segments in all levels that are responsible for at
The speed-dependent algorithm searches for relevant data seg- least one query length within interval [|mQ | · s f, |mQ | · s1f ], where
ments having a similar length as the query sequence. In particular,
|mQ | stands for the actual query length. The indexes of the respon-
we firstly localize a responsible level containing the segments of a
sible levels can be determined as interval [rmin , rmax ] of integers,
similar length as the query and then search for query-similar data
where rmin and rmax values are computed on the basis of function
segments within such responsible level.
level() as:
Based on Lemma 1, a segmentation level is responsible for


queries of length within interval [(1 − c f ) · l r , (1 + c f ) · l r ], where 1
rmin = l evel |mQ | · s f rmax = l evel |mQ | · . (7)
lr is the length of segments within the rth level. In other words, sf
all possibly-relevant data subsequences of length within such in- Considering the example in Fig. 3, the four levels (i.e., rmin = 1 and
terval are cf-covered by segments in the responsible level. Con- rmax = 4) are accessed by the retrieval algorithm for the query of
sequently, the query-responsible level contains the data segments 200 frames and stiffness parameter s f = 0.5.
whose length differs maximally about c f · 100%, e.g., the length of
retrieved data segments is maximally about 20% shorter or longer 5.2.1. Query length restrictions
with respect to the query length when c f = 0.2. We determine the Since the speed-invariant algorithm accesses multiple levels
index of such responsible segmentation level for the query of |mQ | based on the query length and user-defined stiffness parameter
frames by the function l evel (|mQ | ) ∈ N as: sf, we need to either (1) create additional segmentation levels for
queries approaching minimum/maximum lmin /lmax bounds, or (2)

1
 |mQ | ≤ l min · 1+c f
1−c f restrict the admissible query-length interval, without the impact
l evel |m Q
| = |m |·[1−c f ]
Q
1 + log 1+c f l min ·[1+c f ]
otherwise. on the segmentation structure. We focus on the second option by
1−c f
restricting the query length to be within interval:
(6)  1 max

Considering the example in Fig. 3, the second level (i.e., l evel () = l min · ,l · sf . (8)
sf
2) is responsible for all queries ranging from 150 to 224 frames.
For instance, having the original admissible query-length interval
The advantage of this algorithm is that it accesses data seg-
[10 0, 50 0], we limit this interval into a new one [200, 250] by the
ments within only a single segmentation level, which gives an up-
possibility to retrieve up to two times slower/faster subsequences
per bound on search efficiency – search complexity depends on the
(i.e., s f = 0.5).
first level containing the highest number of segments. On the other
hand, if relevant data subsequences are performed slower or faster
5.3. Summary
than the query, they occur in surrounding levels of the accessed re-
sponsible level and thus this algorithm is not able to localize them.
The advantage of both retrieval algorithms against traditional
subsequence search methods is their high efficiency which is en-
5.2. Speed-invariant retrieval algorithm
sured by:
The speed-invariant algorithm solves the problem of finding 1. comparing variable-length segments by fixed-size feature vec-
data segments that are relevant but have different lengths with re- tors,

Please cite this article as: J. Sedmidubsky et al., Searching for variable-speed motions in long sequences of motion capture data, Infor-
mation Systems (2018), https://doi.org/10.1016/j.is.2018.04.002
JID: IS
ARTICLE IN PRESS [m5G;April 18, 2018;12:52]

J. Sedmidubsky et al. / Information Systems 000 (2018) 1–11 7

2. having only one query segment evaluated against data seg- Table 1
Table of symbols.
ments,
3. presenting the results of a k-NN query directly without the Symbol Description
need of any further post-processing, such as matching retrieved m data sequence
and query segments with respect to their temporal order [35]. |m| length of data sequence in number of frames
m[i: j] subsequence of data sequence m starting at the ith frame
From the effectiveness point of view, the great advantage is the (inclusive) and ending at the jth frame (exclusive), i.e.,
possibility to retrieve segments of motions that are performed up |m[i : j]| = j − i
to s1f -times slower/faster or are not perfectly aligned with relevant mQ query sequence
n number of segmentation levels
subsequences up to 100 · cf % with respect to the query length. High nr number of segments within the rth segmentation level
effectiveness is ensured by the used similarity measure that is de- lr length of segments at the rth segmentation level
signed to tolerate imprecise segmentation and accelerated/slowed srj starting frame of the jth segment at the rth segmentation
motions. level
rf replication factor
cf, sf covering factor, stiffness – used-defined parameters
6. Experimental evaluation lmin , lmax minimum/maximum query length – used-defined
parameters
We experimentally evaluate both effectiveness and efficiency
of proposed subsequence retrieval algorithms that combine the Table 2
Analysis of the preprocessing phase of the 68-min subset of the HDM05 dataset
multi-level segmentation structure with the elastic similarity mea-
for the minimum l min = 41 and maximum l max = 2063 query length and different
sure to evaluate query-to-segment similarity. We also compare the settings of covering factor cf. The “rf” column denotes the replication factor, i.e.,
results against existing subsequence matching approaches. how many times each motion frame is part of some data segment.

cf # of # of segments rf Feature
6.1. Dataset
levels total 1st level ext. [min]

Effectiveness and efficiency of both retrieval algorithms are 0.1 18 631,746 111,774 180.0 263.2
evaluated on the motion capture dataset HDM05 [38]. This dataset 0.2 9 150,971 51,230 45.0 62.9
0.3 6 66,972 31,526 20.0 27.9
contains 324 sequences performed by 5 different actors (with sam-
0.4 5 37,345 21,955 12.5 15.6
pling frequency of 120 Hz). Similarly as in [1,35], we use a subset of 0.5 4 23,669 16,393 8.0 9.9
102 motion sequences (68 min in total) for which a ground truth is
provided. This ground truth describes 1464 actions (subsequences
within the 102 sequences) by 15 non-uniformly populated motion are depicted in the last column of Table 2. The analysis shows that
categories. The shortest action takes only 0.34 s (41 frames) while settings cf ≥ 0.2 can be used for real time processing.
the longest one has 17.2 s (2063 frames).
6.4. Analysis of effectiveness
6.2. Methodology
Search effectiveness is measured for each k-NN query by pre-
We concatenate 102 sequences into a single 68-min data se- cision as a ratio between retrieved true-positive segments and all
quence and set the minimum l min = 41 and maximum l max = 2063 k retrieved segments. The segment is considered as true positive
query length according to the shortest and longest ground-truth if it overlaps with some ground-truth subsequence that is labeled
actions. By considering five settings of covering factor cf ∈ {0.1, 0.2, with the same category as a query object. As the sparsest category
0.3, 0.4, 0.5}, five different multi-level segmentation structures are in the dataset has only 6 motion instances, we analyze results for
built. k = 1 and k = 5. The global precision is then averaged over all 1464
A segmentation structure is evaluated by k-nearest-neighbor (k- queries. We analyze both speed-dependent and speed-invariant re-
NN) queries for k ∈ {1, 5}. A k-NN query is constructed for each trieval algorithms.
of 1464 ground-truth subsequences, that are used as query ob-
jects. Using the speed-dependent or speed-invariant algorithm, the 6.4.1. Speed-dependent retrieval
k most similar data segments are retrieved, excluding the segments A query is evaluated against data segments from a single level
that overlap with the query-object subsequence (i.e., exact match). only to retrieve motions of a similar length as the query. The re-
We also exclude less-relevant segments that overlap with more rel- trieved motions can be in an extreme case at most (1 + c f )-times
evant ones to finally obtain the k non-overlapping segments as the longer or (1 − c f )-times shorter. As can be seen in Fig. 4, the pre-
query result. cision @5 ranges from 82% to 87% and increases with a lower set-
ting of covering factor cf. A “finer-grained” segmentation linearly
6.3. Analysis of the preprocessing phase improves the search accuracy, however, the number of generated
segments grows exponentially, as depicted on the right axis.
The computationally demanding preprocessing phase primar- We further try to estimate how much error is introduced by the
ily involves the extraction of 4096-dimensional feature vectors for retrieval algorithm which is primarily influenced by the segmenta-
all data segments. The total number of segments depends on the tion granularity and the quality of similarity measure. To quantify
choice of covering factor cf. Its selection is always a trade-off be- an error of the similarity measure, we evaluate its accuracy on the
tween search performance and accuracy and depends on the re- action recognition scenario. The 94.13% precision (i.e., the error of
quirements of a particular application. For instance, lower cf con- 5.87%) is achieved when the similarity measure is used to eval-
stitutes a higher overlap between consecutive segments and also uate 1-NN queries on the ground-truth database of 1464 actions,
a higher number of segmentation levels, implying a higher search without the necessity of any subsequence matching. Compared to
accuracy. On the other hand, the data replication increases since the highest achieved subsequence retrieval precision of 90.12% (for
the number of data segments grows rapidly. Using a GPU imple- c f = 0.1 and k = 1), the decrease of 4.01% percentage points in ac-
mentation, the extraction of feature vectors takes 25 ms on average curacy is introduced by the segmentation. Such error is caused not
for a single segment. Total extraction times for the 68-min dataset only by (1) an imprecise alignment of data segments with respect

Please cite this article as: J. Sedmidubsky et al., Searching for variable-speed motions in long sequences of motion capture data, Infor-
mation Systems (2018), https://doi.org/10.1016/j.is.2018.04.002
JID: IS
ARTICLE IN PRESS [m5G;April 18, 2018;12:52]

8 J. Sedmidubsky et al. / Information Systems 000 (2018) 1–11

Fig. 5. Effectiveness of subsequence retrieval using the speed-invariant algorithm


Fig. 4. Effectiveness of subsequence retrieval using the speed-dependent algorithm on the 68-min data sequence by evaluating artificially modified queries. Orig-
on the 68-min data sequence with different partitionings (i.e., various settings of inal queries are artificially modified to be up to 2x faster/slower. This speed-
covering factor cf). up/deceleration corresponds to stiffness sf ∈ [0.5, 1].

to the ground truth but also by (2) the increase in data volume stiffness parameter sf that determines the number of segmentation
from 1464 actions to 111,774 segments. levels to be accessed with respect to a query. Nevertheless, seg-
In case of the speed-dependent retrieval algorithm, the covering mentation levels can be accessed in parallel and then search times
factor of c f = 0.2 is recommended because it produces a reason- of both retrieval algorithms can be, for simplicity, considered as
able number of data segments (4-times fewer segments) in balance the same.
with a high search accuracy (only 0.5% less) compared to c f = 0.1. We analyze retrieval efficiency in the following three scenar-
ios. First, the feature vectors of all data segments are stored in
6.4.2. Speed-invariant retrieval main memory and evaluated by the sequential scan approach. Sec-
A query is evaluated against data segments originating from ond, original feature vectors are transformed into bit vectors to sig-
multiple levels. This allows us to retrieve query-relevant subse- nificantly reduce space occupancy and speed-up retrieval in main
quences that are performed faster or slower. Since the used 1464 memory. Third, a disk-based index structure is utilized to scale to
ground-truth queries differ within a single category by about 25% very large databases.
in their lengths on average, we set the stiffness parameter as
s f = 0.8. Such setting improves the search precision @5 about only
6.5.1. Sequential scan in main memory
0.25% percentage points with respect to the speed-dependent al-
The most-populated first segmentation level gives the upper
gorithm.
bound on search performance. Without any index structure, the
feature vectors of data segments of the used 68-min dataset can
Artificially accelerated/decelerated queries. As the used ground truth
be stored in main memory and accessed sequentially. By chang-
contains only a small number of intra-category motions that
ing the covering factor from 0.5 to 0.1, the retrieval process needs
differ at least about 50% in their length, we artificially create
from 66 to 447 ms to evaluate a single query by browsing from
faster/slower queries. A faster query is simulated by omitting some
16 k to 112 k data segments (see Table 2 for the number of gener-
of its original frames, while a slower one is obtained by adding
ated data segments). Actual search times for particular settings of
some artificial frames into the original motion by interpolating
covering factor are also presented in Fig. 6 by the “Search time
joint coordinates of surrounding frames. We firstly filter out 745
on original vectors” curve. These search times are measured on
(out of 1464) queries that violate the query-length restriction
the speed-dependent algorithm by using a single CPU (i7 960 at
in Eq. (8) for l min = 41, l max = 2063 and fixed stiffness s f = 0.5.
3.2 GHz), that can perform approximately 250,0 0 0 distance com-
The remaining 719 queries are modified in lengths by acceler-
putations per second. For such small dataset having hundreds of
ating/decelerating them randomly to be up to 2-times faster or
thousands of data segments, the sequential scan approach is suffi-
slower.
cient.
When the modified queries are evaluated using the speed-
dependent algorithm – which corresponds to the speed-invariant
algorithm with setting s f = 1.0 –, the achieved precision is 85.16% 6.5.2. Sequential scan in main memory using the bit-vector
for k = 5. By using the speed-invariant algorithm, the precision representation
starts increasing as the stiffness parameter decreases – see Fig. 5. To further significantly speed-up the retrieval process, we trans-
The highest accuracies around 88.55% are obtained for sf ∈ [0.6, form original 4096-dimensional feature vectors of real numbers
0.7], which corresponds to an average size of query-length modifi- into 4096-dimensional vectors of bits. In particular, each dimen-
cation. Such significant gain in accuracy requires to visit only 2–3 sion of a vector is transformed either into bit 1 in case the origi-
segmentation levels in parallel on average. On the other hand, the nal dimension is a non-zero value, or into bit 0 when the original
precision decreases for stiffness sf ≤ 0.5 since additionally visited value is zero. The transformed 4096-dimensional bit vectors are
levels do not already contain relevant results and only increase a then compared for similarity by the Hamming distance. By trans-
chance for false matches. forming the original features into bit vectors, we measure the im-
pact on both search effectiveness and efficiency.
6.5. Analysis of retrieval efficiency Even though our bit-vector implementation is not optimal (each
distance computation creates a new BitSet object in Java), the per-
Efficiency of the retrieval phase primarily depends on the to- formance gain is significant. Fig. 6 demonstrates about 20-times
tal number of generated data segments, which is influenced by the more efficient search of the bit-vector approach in comparison
setting of covering factor cf. In case of the speed-invariant retrieval with the original vectors – a single CPU (i7 960 at 3.2 GHz) can
algorithm, search times are additionally influenced by user-defined compute in one second about 5.53 M Hamming distances on bit

Please cite this article as: J. Sedmidubsky et al., Searching for variable-speed motions in long sequences of motion capture data, Infor-
mation Systems (2018), https://doi.org/10.1016/j.is.2018.04.002
JID: IS
ARTICLE IN PRESS [m5G;April 18, 2018;12:52]

J. Sedmidubsky et al. / Information Systems 000 (2018) 1–11 9

Fig. 6. Trade-off between precision (left y axis) and search times (right y axis) for different settings of cf evaluated using the speed-dependent algorithm and 5-NN queries
on original and transformed bit vectors.

vectors compared to 0.25 M Euclidean distances on original vec- time (i.e., up to 1 second) to obtain the most similar subsequence
tors. On the other hand, precision of the bit-vector representation with respect to a query motion.
is only about 2 percentage points worse on average with respect
the original vectors. 6.6. Comparison with state-of-the-art approaches
The bit-vector transformation also significantly decreases the
space occupancy – a single original vector occupies about 16 kB (4 To the best of our knowledge, there is a limited number of ap-
bytes per dimension), while a transformed bit vector takes only proaches able to provide subsequence search in motion data on
0.5 kB (1 bit per dimension), which is 32-times less. This allows us a large scale. Only the following approaches [30,31,33,35,39] fairly
to possibly read a database of 8 M motions (i.e., bit vectors) into evaluate effectiveness and efficiency purely in the scenario of sub-
4GB-main memory and evaluate a query by the sequential scan up sequence matching in motion data. We compare our approach with
to 1.5 s. these papers from both points of view and also with recent well-
performing classifiers in effectiveness and annotation-based meth-
ods that could potentially scale to large volumes of motion data in
6.5.3. Efficiency study on an indexed disk-oriented approach efficiency.
Our objective is to efficiently search motion datasets having a
total length in order of days or even months. However, such large
6.6.1. Effectiveness
motion datasets are still not available, e.g., the whole CMU1 and
As there is no established benchmark for measuring effective-
HDM05 datasets take together only about 12 h of motion data. To
ness of subsequence matching in motion data, it is hard to pro-
estimate search efficiency on large motion data, we adopt the ex-
vide a fair quantitative comparison as dozens of different datasets
perimental evaluation presented by Novak et al. [36]. They intro-
(e.g., CMU,2 HDM05 [38], NTU [40]) are used. Moreover, some
duce an efficient disk-oriented approximate index structure, called
approaches [39] even use their own collection. We evaluate our
the PPP-Codes, and evaluate it on very similar feature vectors. In
approach on the same subset of 15 categories of the HDM05
particular, they use the same reference model of neural network to
dataset as the approach in [35], which achieves the precision of
extract 4096-dimensional feature vectors for common photographs.
83.7%. We outperform this result by reaching the precision of
Even if the domain of photographs is different than of our motion
89.65% using the speed-dependent algorithm (with setting c f =
images, the extracted feature vectors exhibit similar characteristics.
0.2). Even though other related works are evaluated on different
The experiment in [36] indexes 20 million image feature vectors
datasets, they achieve qualitatively comparable or worse results
that are stored on an SSD disk. In the retrieval phase, a single 1-
than our approach. For example, the A-LTK 2.0 [30] reaches the
NN query is evaluated in 800 ms on average, while achieving a 96%
top-match accuracy of up to 80% but on a very small dataset tak-
recall – recall is the percentage of the same vectors retrieved by
ing approximately 10 min in total. The probabilistic PCA-based ap-
the PPP-Codes with respect to the sequential scan approach. Such
proach [31] achieves about the 80% accuracy on average over 5
high recall value is reached by accessing only about 10,0 0 0 vectors
pre-selected categories. The index-based approach [33] uses self-
out of 20 M. The level of PPP-Codes search approximation can be
organizing maps and presents the accuracy of 77% on 13 different
further controlled by a user to find appropriate trade-off between
types of queries from the CMU dataset. The task of subsequence
search effectiveness and efficiency. In general, the PPP-Codes can
searching is performed using a user-controlled GUI in [41] whose
access two orders of magnitude fewer vectors than the sequential
effectiveness is hardly comparable as it is only supported by a field
scan, while achieving a very high recall. More detailed information
study in which domain experts analyze the quality of browsing ex-
about the experiment and the PPP-Codes structure is available in
perience by using different motion perspectives.
[36].
The majority of related work constitutes action-recognition
We could possibly employ the PPP-Codes to index each seg-
methods that are commonly represented by purposely-trained clas-
mentation level independently. By considering the same settings as
sifiers [23,26,27,42]. We show that our approach is competitive to
in Fig. 3 (i.e., covering factor c f = 0.2 and minimum query length
such classifiers. In particular, we employ our similarity measure
l min = 100), the most-populated first level with 20 million data
and assign labels to already segmented queries using a simple 1-
segments of length 125 frames corresponds to about a 48-day long
NN classifier, based on the labels of known training samples. We
motion. Such very long sequence could be then searched in real
evaluate the results on challenging subsets of the HDM05 dataset

1 2
http://mocap.cs.cmu.edu/. http://mocap.cs.cmu.edu/.

Please cite this article as: J. Sedmidubsky et al., Searching for variable-speed motions in long sequences of motion capture data, Infor-
mation Systems (2018), https://doi.org/10.1016/j.is.2018.04.002
JID: IS
ARTICLE IN PRESS [m5G;April 18, 2018;12:52]

10 J. Sedmidubsky et al. / Information Systems 000 (2018) 1–11

Table 3 7. Conclusions
Comparison in effectiveness with the state-of-the-art classifiers on the HDM05
dataset with 65 and 130 categories.
We propose a new speed-invariant subsequence matching algo-
Method Amount of Accuracy (%) rithm that uses a synergy of elastic similarity measure and multi-
training data HDM05-65 HDM05-130 level segmentation. The search space comprises overlapping seg-
Du et al. [23] 90% 96.92 N/A
ments of various sizes that ensure the bounded coverage of arbi-
Zhu et al. [27] 90% 97.25 N/A trary parts within very long motion sequences. The size of overlaps
Huang et al. [26] 50% N/A 75.78 and the total number of generated segments are bounded to be
Laraba et al. [42] 50% N/A 83.33 formally minimum with respect to the covering factor parameter,
Our approach 50% 93.52 88.15
which reflects the versatility and effectiveness of the used similar-
ity measure. Due to the efficient comparison of 4096-dimensional
segment features by the Euclidean distance, the retrieval process is
also very efficient. Even on a standard PC (i7 960 at 3.2 GHz, 4GB
containing 65 and 130 categories [29]. The results presented in RAM), about a 2-h (for c f = 0.2) motion sequence can be searched
Table 3 show that we clearly outperform classifiers in [26,42] by in real time ( < 1 s) for any query ranging from l min = 41 frames to
utilizing the same amount of 50% of training data. On a simpler l max = 2063 frames (17 s). Binarized features in combination with
subset of HDM05 dataset, which comprises of only 65 categories, the Hamming distance enable in-memory real-time search in a
classifiers in [23,27] achieve the classification accuracy of 97.25% much longer sequence of more than a day. The disk-based PPP-
and 96.9% using 90% of training data. We achieve a competitive ac- codes index even increases the searchable length into small dozens
curacy of 93.5% using only 50% of training data. Such high-quality of days. However, sequences in length of months could be searched
results prove that the used similarity measure is very convenient up to one second when query-length limits are reduced, a higher cf
to be employed for processing motion capture data, e.g., for search- value is set, or parallel access of segmentation levels is considered.
ing, subsequence matching, and recognizing actions. Besides high efficiency, we also achieve a high 1-NN search preci-
sion of 90.12% on a 68-min annotated sequence. In contrast to the
existing solutions, our approach is able to localize query-relevant
results that are executed faster or slower.
6.6.2. Efficiency
The method presented in [30] does not support indexing and Acknowledgement
requires dozens of seconds to evaluate a single query. Similarly
in [39], where more than 1s is needed to evaluate a single query This research was supported by the Czech Science Foundation
within the dataset of only 1k motion sequences, which is several grant no. GA16-18889S.
orders of magnitude less efficient than our approach. The indexed
subsequence search in [33] requires 240 ms to evaluate a sin- References
gle query within the CMU subset comprising roughly 177k frames
of training data. On the same dataset size, our approach requires [1] M. Müller, A. Baak, H.-P. Seidel, Efficient and robust annotation of motion cap-
ture data, in: ACM Symposium on Computer Animation (SCA 2009), ACM Press,
75 ms without indexing per query evaluation, even when using 2009, p. 10.
the original 4096-dimensional vectors. The algorithm proposed in [2] M.W. Chao, C.H. Lin, J. Assa, T.Y. Lee, Human motion retrieval from hand-drawn
[31] is much more efficient, taking 72 ms to search a much larger sketch, IEEE Trans. Vis. Comput. Graph 18 (5) (2012) 729–740.
[3] N. Numaguchi, A. Nakazawa, T. Shiratori, J.K. Hodgins, A puppet interface for
part of the CMU dataset of roughly 1.7M frames. If we apply the retrieval of motion capture data, in: ACM SIGGRAPH/Eurographics Sympo-
bit-vector representation on the same-size dataset, we need only sium on Computer Animation, in: SCA 2011, ACM, New York, NY, USA, 2011,
37 ms per query and still in a sequential way. Seemingly the fastest pp. 157–166.
[4] A. Vögele, B. Krüger, R. Klein, Efficient unsupervised temporal segmenta-
motion subsequence matching is provided in [34]. They index each tion of human motion, in: ACM Symposium on Computer Animation, 2014,
feature (out of 43) within an independent trie-based structure that pp. 167–176.
can be traversed within 6.5 ms on average. Depending on the num- [5] J. Barbič, A. Safonova, J.-Y. Pan, C. Faloutsos, J.K. Hodgins, N.S. Pollard, Segment-
ing motion capture data into distinct behaviors, in: Graphics Interface, Cana-
ber of user-defined query features, 25–40 ms are needed to search
dian Human-Computer Communications Society, 2004, pp. 185–194.
a 35-hour dataset. However, these times do not include additional [6] D. Bouchard, N. Badler, 7th International Conference on Intelligent Virtual
post-processing costs such as accessing and merging candidate re- Agents (IVA 2007), Springer Berlin Heidelberg, pp. 37–44.
sults. On the comparable dataset-size, our PPP-Codes based struc- [7] R. Lan, H. Sun, Automated human motion segmentation via motion regulari-
ties, Vis. Comput. 31 (1) (2015) 35–53.
ture needs only 3 ms to identify such results. Moreover, the ap- [8] C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast subsequence matching in
proach in [34] lacks any evaluation on the search accuracy. time-series databases, SIGMOD Rec. 23 (2) (1994) 419–429.
Additionally, our PPP-Codes index based approach has the [9] Y.-S. Moon, K.-Y. Whang, W.-K. Loh, Efficient time-series subsequence matching
using duality in constructing windows, Inf. Syst. 26 (4) (2001) 279–293.
bounded time complexity log (|m|/cf) disregarding the query length [10] J.F.S. Lin, M. Karg, D. Kulic, Movement primitive segmentation for human mo-
|mQ |, where |m| denotes the length of data sequence. This is not tion modeling: a framework for analysis, IEEE Trans. Hum. Mach. Syst. 46 (3)
true for approaches whose time complexity increases with the (2016) 325–339.
[11] C. Beecks, M. Hassani, J. Hinnell, D. Schüller, B. Brenger, I. Mittelberg, T. Seidl,
length of query, such as |m| · |mQ | in [35] or log (|m|/cf) · |mQ | in Spatiotemporal similarity search in 3d motion capture gesture streams, in:
[43]. 14th International Symposium on Advances in Spatial and Temporal Databases
In summary, the synergy of the similarity measure along with (SSTD 2015), Springer International Publishing, 2015, pp. 355–372.
[12] T.-Y. Kwak, Y.-J. Lee, A filtering method for searching similar multidimensional
the segmentation technique and bit-vector representation beats
sequences under the time-warping distance, Inf. Syst. 28 (7) (2003) 791–813.
the existing approaches even in a sequential way. By applying [13] B. Krüger, J. Tautges, A. Weber, A. Zinke, Fast local and global similarity
the index structure, our approach can reach much higher effi- searches in large motion capture databases, in: ACM Symposium on Computer
Animation, in: SCA 2010, Eurographics Association, 2010, pp. 1–10.
ciency yet and scale to very large datasets. Depending on the ap-
[14] J. Valcik, J. Sedmidubsky, P. Zezula, Assessing similarity models for human–
plication scope and hardware requirements, our approach can be motion retrieval applications, Comput. Animat. Virtual Worlds 27 (5) (2016)
parametrized (using the cf parameter) to achieve an appropriate 484–500.
trade-off between speed and accuracy. Furthermore, it is designed [15] C. Beecks, M. Hassani, F. Obeloer, T. Seidl, Efficient query processing in 3D mo-
tion capture databases via lower bound approximation of the gesture matching
to work in main memory or on disk in case of large volumes of distance, in: 2015 IEEE International Symposium on Multimedia (ISM 2015),
motion recordings. 2015, pp. 148–153.

Please cite this article as: J. Sedmidubsky et al., Searching for variable-speed motions in long sequences of motion capture data, Infor-
mation Systems (2018), https://doi.org/10.1016/j.is.2018.04.002
JID: IS
ARTICLE IN PRESS [m5G;April 18, 2018;12:52]

J. Sedmidubsky et al. / Information Systems 000 (2018) 1–11 11

[16] C. Ren, X. Lei, G. Zhang, Motion data retrieval from very large motion [31] Z. Deng, Q. Gu, Q. Li, Perceptually consistent example-based human motion
databases, in: International Conference on Virtual Reality and Visualization retrieval, in: Symposium on Interactive 3D Graphics and Games, in: I3D 2009,
(ICVRV 2011), 2011, pp. 70–77. ACM, 2009, pp. 191–198.
[17] L. Ye, E. Keogh, Time series shapelets: a novel technique that allows accurate, [32] P. Zezula, G. Amato, V. Dohnal, M. Batko, Similarity Search: The Metric Space
interpretable and fast classification, Data Min. Knowl. Discov. 22 (1) (2011) Approach, Advances in Database Systems, 32, Springer-Verlag, 2006.
149–182. [33] S. Wu, S. Xia, Z. Wang, C. Li, Efficient motion data indexing and retrieval
[18] F. Zhou, F.D. l. Torre, J.K. Hodgins, Hierarchical aligned cluster analysis for tem- with local similarity measure of motion strings, Vis. Comput. 25 (5) (2009)
poral clustering of human motion, IEEE Trans. Pattern Anal. Mach. Intell. 35 (3) 499–508.
(2013) 582–596. [34] M. Kapadia, I.-k. Chiang, T. Thomas, N.I. Badler, J.T. Kider Jr., Efficient motion
[19] T. chung Fu, A review on time series data mining, Eng. Appl. Artif. Intell. 24 retrieval in large motion databases, in: ACM SIGGRAPH Symposium on Inter-
(1) (2011) 164–181. active 3D Graphics and Games (I3D 2013), ACM, 2013, pp. 19–28.
[20] S. Wu, Z. Wang, S. Xia, Indexing and retrieval of human motion data by a hier- [35] J. Sedmidubsky, J. Valcik, P. Zezula, A key-pose similarity algorithm for mo-
archical tree, in: 16th ACM Symposium on Virtual Reality Software and Tech- tion data retrieval, in: Advanced Concepts for Intelligent Vision Systems (ACIVS
nology (VRST 2009), ACM Press, New York, NY, USA, 2009, pp. 207–214. 2013), in: LNCS, 8192, Springer, 2013, pp. 669–681.
[21] J.-Y. Wang, H.-M. Lee, Recognition of human actions using motion capture [36] D. Novak, J. Cech, P. Zezula, 8th International Conference on Similarity Search
data and support vector machine, in: World Congress on Software Engineer- and Applications (SISAP 2015), Springer, pp. 237–243.
ing (WCSE 2009), 1, 2009, pp. 234–238. [37] J. Sedmidubsky, P. Elias, P. Zezula, Similarity searching in long sequences of
[22] M. Müller, T. Röder, M. Clausen, Efficient content-based retrieval of motion motion capture data, in: 9th International Conference on Similarity Search and
capture data, in: ACM SIGGRAPH, ACM, 2005, pp. 677–685. Applications (SISAP 2016), Springer, 2016, pp. 271–285.
[23] Y. Du, W. Wang, L. Wang, Hierarchical recurrent neural network for skeleton [38] M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger, A. Weber, Documenta-
based action recognition, in: Int. Conference on Computer Vision and Pattern tion Mocap Database HDM05, Technical Report, CG-2007-2, Universität Bonn,
Recognition (CVPR 2015), 2015, pp. 1110–1118. 2007.
[24] Y. Wang, M. Neff, Deep signatures for indexing and retrieval in large motion [39] J.K. Tang, H. Leung, Retrieval of logically relevant 3d human motions by adap-
databases, in: 8th ACM Conference on Motion in Games, ACM, 2015, pp. 37–45. tive feature selection with graded relevance feedback, Pattern Recognit. Lett.
[25] H. Kadu, C.-C. Kuo, Automatic human mocap data classification, IEEE Trans. 33 (4) (2012) 420–430. Intelligent Multimedia Interactivity.
Multimedia 16 (8) (2014) 2191–2202. [40] A. Shahroudy, J. Liu, T.-T. Ng, G. Wang, Ntu rgb+d: a large scale dataset for
[26] Z. Huang, C. Wan, T. Probst, L. Van Gool, Deep learning on lie groups for 3d human activity analysis, in: The IEEE Conference on Computer Vision and
skeleton-based action recognition, arXiv:1612.05877 (2016) 1–10. Pattern Recognition (CVPR), 2016, pp. 1010–1019.
[27] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, X. Xie, Co-occurrence fea- [41] J. Bernard, N. Wilhelm, B. Krüger, T. May, T. Schreck, J. Kohlhammer, Motionex-
ture learning for skeleton based action recognition using regularized deep plorer: exploratory search in human motion capture data based on hierarchical
LSTM networks, in: Thirtieth AAAI Conference on Artificial Intelligence, 2016, aggregation, IEEE Trans. Vis. Comput. Graph 19 (12) (2013) 2257–2266.
pp. 3697–3703. [42] S. Laraba, M. Brahimi, J. Tilmanne, T. Dutoit, 3D skeleton-based action recog-
[28] P. Elias, J. Sedmidubsky, P. Zezula, Motion images: an effective representa- nition by representing motion capture sequences as 2d-rgb images, Comput.
tion of motion capture data for similarity search, in: 8th International Con- Animat. Virtual Worlds 28 (2017) 1–11.
ference on Similarity Search and Applications (SISAP 2015), Springer, 2015, [43] J. Sedmidubsky, P. Zezula, J. Svec, Fast subsequence matching in motion cap-
pp. 250–255. ture data, in: Advances in Databases and Information Systems, Springer, 2017,
[29] J. Sedmidubsky, P. Elias, P. Zezula, Effective and efficient similarity searching in pp. 59–72.
motion capture data, Multimed. Tools Appl. (2017) 1–22.
[30] K. Sugano, Y. Fang, K. Oku, K. Kawagoe, A coarse-to-fine method for subse-
quence matching of human behavior using multi-dimensional time-series ap-
proximation, in: 17th International Conference on Information Integration and
Web-based Applications & Services, in: iiWAS 2015, ACM, New York, NY, USA,
2015. 34:1–34:9

Please cite this article as: J. Sedmidubsky et al., Searching for variable-speed motions in long sequences of motion capture data, Infor-
mation Systems (2018), https://doi.org/10.1016/j.is.2018.04.002

You might also like