Querying Shapes of Histories: Related Papers

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Accelerat ing t he world's research.

Querying Shapes of Histories


mohamed as

Related papers Download a PDF Pack of t he best relat ed papers 

Charact erizing ASP inferences by unit propagat ion


Mart in Gebser

T he nomore++ Approach t o Answer Set Solving


André Neumann

Percept ion based pat t erns in t ime series dat a mining


Leonid Sheremet ov
Querying Shapes of Histories

Rakesh Agrawal Giuseppe Psaila* Edward L. Wimmers Mohamed .&It

IBM Almaden Research Center


650 Harry Road, San Jose, CA 95120

Abstract be histories of opening price, closing price, the high


We present a shape definition language, called SDC, for the day, the low for the day, and the trading
for retrieving objects based on shapes contained in volume.
the histories associated with these objects. It is a The ability to select objects based on the occur-
small, yet powerful, language that allows a rich variety rence of some shape in their histories is a require-
of queries about the shapes found in historical time ment that arises naturally in many applications.
sequences. An interesting feature of SDC is its ability For example, we may want to retrieve stocks whose
to perform blurry matching. A “blurry” match is one
where the user cares about the overall shape but does closing price history contains a head and shoulder
not care about specific details. Another important pattern [4]. We should be able to specify shapes
feature of SVL is its efficient implementability. The roughly. For example, we may choose to call a trend
SVC operators are designed to be greedy to reduce uptrend even if there were some down transitions as
non-determinism, which in turn substantially reduces long as they were limited to a specified number.
the amount of back-tracking in the implementation. To t,his end, we propose a shape definition
We give transformation rules for rewriting an SDL language, called SDC. It is a small, yet powerful,
expression into a more efficient form as well as an index language that allows a rich variety of queries about
structure for speeding up the execution of SVC queries.
the shapes found in histories. The most interesting
feature of SVC is its capability for blurry matching.
1 Introduction A “blurry” match is one where the user cares
Historical time sequences constitute a large portion about the overall shape but does not care about
of data stored in computers. Examples include specific details. For example, the user may .be
histories of stock prices, histories of product sales, interested in a shape that is five time periods long
histories of inventory consumption, etc. Assume a and contains at least three ups but no more than
simple data model in which the database consists of one down. SVC has been designed to make it
a set of objects. Associated with each object is a set e&y and natural to express such queries. Another
of sequences of real values. We call these sequences important feature of SDC is that it has been
histories and each history has a name. For example, designed to be efficiently implementable. Most
in a stock database, associated with each stock may of the SDC operators are greedy and therefore
there is very little non-determinism (in the sense
*Current Address: Politecnico di Milano, Italy.
Permission to copy withoat fee all or pazt of this ma-
of multiple match possibilities) inherent in an SVC
terial is granted provided that the copies are not made OT shape, which in turn substantially reduces the
distributed for direct commercial advantage, the VLDB copy- amount of back-tracking in the implementation. In
right notice and the title of the publication and its date ap- addition, SDC provides the potential for rewriting
pear, and notice is given that copying is by permission of the
Very LaTge Data Baae Endowment. To copy otherwise, OT
a shape expression into a more efficient form as
to republish, requires a fee and/or special permission f-ram well aa the potential for indexes for speeding up
the Endowment. the implementation.
Proceedings of the 21st VLDB Conference SDC benefits from a rich heritage of languages
Ziirich, Switzerland 1995 based on regular expressions, but this earlier work

502
Symbol 1 Description I lb I ub iV fv
UP slightly increasing transition .05 .19 anyvalue anyvalue
UP highly increasing transition .20 1.0 anyvalue anyvalue
down slightly decreasing transition -.19 -.05 anyvalue anyvalue
Down highly decreasing transition -1.0 -.19 anyvalue anyvalue

L
appears transition from a zero value to a non-zero value 0 1.0 zero nonzero
disappears transition from a non-zero value to a zero value -1.0 0 nonzero zero
stable the final value nearly equal to the initial value -.04 .04 snyvalue anyvalue
zero both the initial and final values are zero 0 0 zero zero

Table 1: An Illustrative Alphabet A

has a different design focus that influences which the beginning and the end of a unit time period;
expressions are easy to write, understand, optimize, that is, by considering transitions from an instant
and evaluate. For example, while the blurry to the following one. It is immediate then that
matching of SVC is reminiscent of approximate a history generates a transition sequence based
matching for strings (e.g., [9]) or for patterns in time on an alphabet whose symbols describe classes of
series [2], SVC allows the user to impose arbitrary transitions.
conditions on the blurry match but requires that
the user specify those conditions completely. The 2.1 Alphabet
event specification languages in active databases The syntax for specifying alphabet is :
[3] [5] [6] concentrate on detecting the endpoints
of events rather than concentrating on intervals as (alphabet (symbol Ib ub iv fv))
SVC does. The S&Q work of [S] focused on building
a framework for describing constructs from various Here symbol is a symbol of the alphabet being
existing sequence models. defined and the rest four descriptors provide the
definition for the symbol. The first two, lb and ub,
are the lower and upper bounds respectively of the
Organization of the Paper The rest of the pa- allowed variation from the initial value to the final
per is organized as follows. In Section 2, we intro- value of the transition. The latter two, iv and fv, can
duce SVC informally through examples; the formal be one of zero, nonzero and anyvalue, and specify
semantics is given in Appendix A. In Section 3, we constraints on the initial and final value respectively
discuss the design rationale of SVC. We discuss its of the transition.
expressive power, its capability for blurry matching, Table 1 gives an illustrative alphabet A. Consider
its ease of use, and its efficient implementability. In the time sequence 7f in Figure 1.
Section 4, we give transformation rules for rewriting
Given alphabet A, a transition sequence corre-
an SVC expression into an equivalent but a more sponding to ti will be:
efficient form. In Section 5, we describe an index
structure and show how it can be used to speed up (zero appears up up up down stable Down
the evaluation of SVC queries. We conclude in Sec- down disappears)
tion 6 with a summary. For an expanded version of
this paper, see [l]. Depending on the alphabet, there can be more than
one transition sequence corresponding to a time
2 Shape Definition Language sequence. For example, another transition sequence
corresponding to ‘H is:
We will introduce our shape definition language,
SVC, informally through examples. The formal (zero stable up up up down stable Down down
semantics is given in Appendix A. Every object in stable)
the database has associated with it several named
histories. Each history is a sequence of real values. This ambiguity does not cause inconsistency at
The behavior of a history can be described by query time because the user specifies the particular
considering the values assumed by the history at shape to be matched. For example, if the user

503
This definition has no parameters. The meaning of ’
the descriptor will become clear momentarily.

2.2.1 Elementary Shapes


The simplest shape descriptor is an elementary
shape. All the symbols of the alphabet correspond
to elementary shapes. When an elementary shape
is applied to a time sequence S, the resulting set
contains all the subsequences of S that contain only
the specified elementary shape.
For example, the shape descriptor (stable)
applied to the time sequence 3-1given in Figure 1
yields the set {?f[O,l], ‘&[1,2], ‘H[9,10]}, where
1 2 I ‘H[O,l] = (0 0), ‘H[1,2] = (0 .02) and X[9,10] = (.03
0 4 0 0 10
0). The descriptor (zero) yields the set {ti[O,l]}.
Note that the subsequence ‘H[O,l] is contained
Figure 1: Time Sequence ‘H = (0 0 .02 .17 .35 .50 in the result set of both the descriptors because
.45 .43 .15 .03 0) the transition corresponding to this subsequence
satisfies the definitions of both stable and zero.
Finally, the shape descriptor (Up) results in an
had asked for stable, we will resolve the ambiguity empty set because ‘H contains no Up transition.
between stable and zero,in the favor of stable.
We will use the alphabet A and the time sequence 2.3 Derived Shapes
‘H throughout the paper to give concrete examples. Starting with the elementary shapes, complex shapes
We will use the notation ‘H[i,j] to represent the sub- can be derived by recursively combining elementary
sequence of X consisting of elements from position and previously defined shapes. We describe next
i to the position j inclusive, 0 being the first po- the set of operators available for this purpose.
sition. ‘H[i,i] will represent the null sequence since
an elementary shape (see Section 2.2.1) requires at
Multiple Choice Operator any. The any oper-
least one transition.
ator allows a shape to have multiple values. The
2.2 Shape Descriptors syntax is
Using the alphabet of the language, we can define
(any PI P2 . . . P, 1
classes of shapes that can be matched in histories or
parts of them. The application of a shape descriptor where Pi is a shape descriptor. When a shape
P to a time sequence S produces a set of all the obtained by means of the any operator is applied
subsequences in S that match the shape P. If no to a time sequence S, the resulting set contains all
subsequence in S matches P, then the result is an the subsequences of S that match at least one of
empty set. Depending on the descriptor, a null the Pi shapes.
sequence can match a shape. For the convenience For example, the shape (any zero appears)
of the user, however, the null sequences are not applied to the time sequence 7-l yields the set
reported to the user. {‘X[O,l], ‘H[1,2]}, where ‘H[O,l] = (0 0) which is a
The syntax for defining a shape is: zero transition and ‘H[1,2] = (0 .02) which is an
appears transition.
(shape name(parameters) descriptor)

A shape definition is identified by means of a Concatenation Operator concat. Shapes can


name for the shape, which is followed by a possibly be concatenated by using the operator concat:
empty list of parameters (see Section 2.4) and then
a descriptor for the shape. For example, here is a (concat ‘PI PZ . . . P, 1
definition of a spike:
When a shape obtained by using the concat
(shape spike0 (concat Up up down Down)) operator is applied to a time sequence S, first the

504
shape PI is matched. If a matching subsequence participate in a sequence of 3 consecutive ups. Since
s is found, PZ is matched in the subsequence of S the final answer in this case is a set of null sequences
immediately following the last element of s and the and we do not report null sequences, the user will
match is accepted if it is strictly contiguous to s, see 0 as the answer. Allowing a null sequence
etc. For example, the shape descriptor to match atmost n P has the virtue that we can
naturally specify
(concat up up up (any stable down)
(any stable down) (any down Down)) (concat (atleast 2 up) (atmost 1 Down))
specifies that we are interested in detecting if an and match it to ‘H[2,5] corresponding to the transi-
upward trend (indicated by three consecutive ups) tion sequence up up up.
has reversed (indicated by two stables or downs,
followed by a down or Dovn). When applied to the Bounded Occurrences Operator in. The in
time sequence 3-1, it yields the set {3c[2,8]}, where operator is the most interesting SDL operator. It
7f[2,8] = (.02 .17 .35 .50 .45 .43 .15). The transition permits blurry matching by allowing users to state
sequence corresponding to this subsequence is (up an overall shape without giving all the specific
up up down stable Down). details. The syntax is

Multiple Occurrence Operators exact, atleast, (in length shape-occurrences)


atmost. Shapes composed of multiple contiguous Here length specifies the length of the shape in
occurrences of the same shape can be defined using number of transitions. The shape-occurrences has
three other operators, exact, atleast and atmost: two forms.
(exact n P) In the first form, the shape-occurrences can be
(atleast n P) one of
(atmost 12P) (precisely 12P)
(noless n Q)
When a shape obtained using exact/atleast/
(nomore n R)
atmost is applied to a time sequence S, it matches
all subsequences of S that contain exactly/at or a composition of them using the logical operators
least/at most n contiguous occurrences of the shape or and and.
P. In addition, the resulting subsequences are When a shape defined using this form is applied
such they are neither preceded nor followed by a to a time sequence S, the resulting set contains all
subsequence that matches P. For example, subsequences of S that are length long in terms of
number of time periods (transitions) and contain
(exact 2 up) yields 0.
precisely (no less than/ no more than) n occurrences
(atleast 2 up) yields {%[2,5]}, where
of the shape P (Q/R). The n occurrences of
‘H[2,5]= (.02 .17 .35 .50).
P (Q/R) need not be contiguous in the matched
(atmost 2 up) yields {[k, k] 1
subsequence; there may be arbitrary gap between
O~k~lV6<k~lO).
any two of them. They may also overlap. For
The first shape results in an empty set because example, the’ shape descriptor
there is no subsequence in ‘H which is exactly (in 6 (and (noless 2 (any up Up))
two transitions long, consisting entirely of up (nomore I (any down Down))))
transitions, and neither preceded nor followed by
an up transition. The second shape matched specifies that we are interested in subsequences five
the subsequence consisting of three contiguous up intervals long that have at least two ups (either
transitions. up or Up) and at most one down (either down or
The result for the third shape merits further Down). When applied to the time sequence X, it
discussion. The shape (atmost 2 up) matches yields the set {R[2,7]}, where %[2,7] = (.02 .17 .35
the null sequence at those positions of ‘Ii that do .50 .45 .43). The transition sequence corresponding
not participate in an up transition. The other to this subsequence is (up up up down stable).
null sequences are not in the answer since they Note that the subsequence ‘H[3,8] = (.17 .35 .50

505
.45 .43 .15) E (up up down stable Down) is not When a parameterized shape P is used in the
in the answer because it has two downs. As another definition of another shape Q, the parameters of P
example, consider the shape must be bound. They can be bound to concrete
values or to the parameters of Q. Here is an
(in 7 (precisely 0 Down)) example:
We are looking for sequences seven time periods
(shape doublepealc(width htl ht2)
long that do not have any Down transitions. ‘Ff[O,7]
(in width (inorder spike(ht1 htl)
is the only subsequence of 3-1 that satisfies this
spike(ht2 ht2))))
constraint.
The operators precisely, noless, nomore should
not be confused with the multiple occurrence op- 3 Design of SDL
erators exact, atleast, and atmost. The lat- SVL provides the following key advantages:
ter are “first class” operators that can be used
to introduce shapes to be matched, whereas the l a natural and powerful language for expressing
former can only appear within the in operator shape queries
and constrain the sub-shapes. More importantly,
l capability for blurry matching
precisely, noless, and noinore allow overlaps and
gaps, whereas exact, atleast, and atmost do not. l reduction of output clutter
The second form for the shape-occurrences is:
l an efficient implementation
(inorder PI PZ . . . Pn)
3.1 Expressive Power of SDL
where Pi is a shape descriptor. When a shape
Using SDL, one can express a wide variety of
obtained using this form is applied to a time
queries about the shapes found in a history. Given
sequence, each of the resulting subsequences is
a sequence and a shape, one type of query (called
length long and contains the shapes PI through P,
continuous matching in [S]) finds all the subse-
in that order. Pi and Pi+1 may not overlap, but
quences that match the shape; the other type of
they may have arbitrary gap. For example, the
query (referred to as “regular matching” in this pa-
shape descriptor
per) produces a boolean indicating whether the en-
(in 7 (inorder (atleast 2 (any up Up)) tire sequence matches the shape.
(in 4 (noless 3 (any down Down)>)>) Since SDL includes the operators concat, any,
and atleast, SDC is equivalent in expressive power
specifies that we are interested in subsequences
to regular expressions for regular matching. This
seven time periods long. The matching subsequence equivalence is proven in [l]. Because SDL is de-
must contain a subsequence that has atleast two ups
signed to provide ease of expression together with
and that must be followed by another subsequence
an efficient implementation, it has several features
four intervals long that contains at least three
to enhance its effectiveness. The atleast opera-
downs. When applied to the time sequence ‘If, it tor, which is a variant of the * operator of regu-
yields the set {X[2,9]}, where 7f[2,9] = (.02 .17 lar expressions, provides both efficiency gains and
.35 .50 .45 .43 .15 .03) s (up up up down stable expressiveness enhancements for continuous match-
Down down). ing. The * operator, once it has found the re-
2.4 Parameterized Shapes quired number of matches, is allowed (nondeter-
ministically) either to exit or to continue matching;
Shape definitions can be parameterized by specify-
whereas atleast is a greedy operator that does not
ing the names of the parameters in the parameter
exit until it has found as many matches as it can.
list following the shape name and using them in the
In the regular matching case, the greedy nature of
definition of the shape in place of concrete values.
atleast does not cause a loss of expressive power
Here is an example of a parameterized spike:
since one can always write the shape so that subse-
(shape spike(upcnt dncnt) quent shapes are not affected by the greedy nature
(concat (exact upcnt (any up Up)) of atleast. Details of this construction are given
(exact dncnt (any down Down)>)) in [l].

506
In the case of continuous matching, the greedy se- that have precisely one occurrence of each ei. The
mantics of atleast allow SVC to take advantage of straightforward approach of listing all such possible
contextual information to eliminate useless clutter. strings grows factorially. It is well-known that the
For example, given the shape (atleast 5 up), SVL permutation expression can be compacted a bit to
will find all the maximal subsequences that have at exponential size but no further compaction is possi-
least five consecutive ups. In other words, SDL does ble in regular expression notation. (See [l] for more
not report the non-maximal subsequences thereby details and for proofs.) Since at least exponential
eliminating useless clutter. Regular expressions size is required, expressing permutations in regular
would not be able to eliminate the clutter since they expression notation is tedious, error-prone, and not
are unable to “look-ahead” to provide contextual particularly readable.
information. If there happen to be seven consecu- Parameterized shapes (macros) can dramatically
tive ups in the history, SVL will report this single reduce the size of a permutation expression. One
subsequence of length 7 whereas the regular expres- can define (inductively) the parameterized shapes
sion would report six different (largely overlapping) Pi to describe all permutations of i elements as
subsequences; there would be three subsequences of follows:
length 5, two subsequences of length 6, as well as
the entire subsequence of length 7. If, in the future, (shape PI(x:~>(x:~))
finding all such subsequences becomes important, a (shape &(zl, ~2) (=y (concat XI PI(Q))
non-greedy version of atleast could be added eas- (con-t 22 PI)))

ily to SVL. (shape P~(xI,Q,Q) (my (concat $1 Pz(Q, a))


(con-t 22 P2(~1,~3))

3.2 Ease of Expression in SVC (con-t 23 P2(x1,22))


(shape Pi(Zl,...,Zi)
SVC is designed to make it easy and natural to ex- (any (concat X1 Pi-l(Xz, . . . , Xi))
press shape queries. For example, the atleast op- . . . (concat Xi Pi-l(Zl,...,Xi-I))))
erator provides a compact representation of repeti-
tions that seems natural even to someone not famil- Since each Pi has size O(i2), a permutation expres-
iar with regular expression notation. SVC provides sion for n elements has size O(n3).
a (non-recursive) macro facility (with parameters) Blurring matching provides an even more effec-
that enhances readability by allowing commonly oc- tive permutation expression. For example, (in n
curring shapes to be abstracted. (and (precisely 1 ur) . . .(precisely 1 a,)) does
One of the most exciting features of SVC is the the trick in only linear size. It is instructive to ex-
inclusion of the in operator that permits “blurry” amine the features of blurry matching that permit
matching in which the user cares about the overall such a compact permutation expression. Blurry
shape but does not care about specific details. For matching permits the use of conjunctive as well
example, to indicate a uptrend with a.subsequence as disjunctive expressions. It is well known that
specified by the in operator, the user might specify adding “and” to regular expressions does not in-
(nomore 2 down) thereby limiting the number of crease the expressive power of regular expressions
downs that can occur in the subsequence. While but does permit more compact expressions (see
the in operator can be simulated using regular Chapter 3 exercises in [7]). A permutation expres-
expressions, it is not easy to do so. The details sion is such an example. The regular expression
of the construction can be found in [l] and involve (ui 1. . . ]a,) can be used to describe all the charac-
keeping track of how many times diverse finite ters. By concatenating n copies, it is possible to
automatons have entered accepting states. The in express in O(n2)size all sequences of length exactly
operator presents a much more natural method for n. It is also easy to see that the regular expression
expressing the desired shape. (al I * * * I&-l l&-+1 I . * * I%)* ai (al I * * * /ai-ll%+l I . * *(
It is instructive to give an example. Assume that an)* expresses all sequences that have exactly one
al,..., a, are “disjoint” elementary shapes (where ui. By conjuncting these expressions together, we
two elementary shapes are disjoint if they never obtain a regular expression with conjunctions that
match the same transition sequence). Consider expresses permutations and has size O(n2). As al-
the problem of finding a “permutation” expression ready noted, a (pure) regular expression that ex-
that matches exactly those sequences of length n presses permutations must have exponential size.

507
The compactness of permutation expressions in An operator can be rewritten using only operators
blurry shape notation is primarily due to the fact belonging to the same group.
that blurry shapes permit conjunctions. Blurry
shapes also enhance readability by allowing over- 4.1 Idempotence, Commutativity, and
lap directly whereas regular expressions (even with Associativity
conjunctions) can handle overlap only indirectly by An operator has the idempotence property if the du-
coding up the overlap in a different regular ex- plicates of a shape can be removed. It has the com-
pression. Even though the permutation example is mutativity property if shapes can be permuted. The
somewhat contrived to permit the easy analysis of sssociativity property is useful for unnesting similar
the complexity and expressive of SVC versus regu- operators, after which redundant shapes can be re-
lar expressions, it is representative of a large class moved using idempotence and commutativity. The
of blurry queries that search for shapes which may any, or, and and operators are idempotent, commu-
occur in any order. tative, and associative. The concat and inorder
operators are associative (but not idempotent and
3.3 Efficient Implementability for SVL
commutative).
Since the semantics of SVL specifies that operators Here is an example of the application of these
such as atleast be greedy, any is the only oper-
properties:
ator that introduces any “non-determinism”. (In
this context, non-determinism means that there is
some starting point that has at least two different (=Y Pl (any p2 S))
* (any Pi P2 Pi) - associativity
subsequences that match starting from that partic-
* (any Pi PI P2) - commutativity
ular starting point.) This implies that the amount
* (any PI P2) - idempotence
of back-tracking an SVL implementation needs to
do is substantially reduced. For example, in the
4.2 Distributivity
shape (concat (atleast 4 P)(atleast 3 Q)) un-
der the normal regular expression semantics, after The concat and and operators distribute over any
4 P’s were found, the evaluator (i.e. automaton) and or operators:
would have to keep searching for P as well as begin
searching for Q. In the SVL semantics, the search (concat PI (any P2 P3))
for & would not begin until all the P's had been * (any (concat PI P2) (concat PI P3))
found. (ad 9 (or PZ pa))
In addition, SVL provides the potential for & (or (and PI Pz) (and PI Ps))
rewriting a shape expression into a more efficient
Deciding which form is less costly to match is
form (Section 4) as well as the potential for indexes
similar to the problem of distributing the join over
(Section 5).
the union in rel.ational query optimization, since
concat and and result in joins and any and or result
4 Shape Rewriting in a union of resulting sets (see Section 5).
We now present a set of transformation rules to
rewrite a shape expression into an equivalent but 4.3 Folding identical shapes in concat
a more eficient expression. SVC shape operators Identical shapes inside the concat operator are
can be classified into the following groups: folded using the exact operator. For example:
a concat, exact, atleast, atmost, and inorder:
(concat PI P2 P2 . . . P2 P3)
Shape arguments must appear in the specified
* (concat PI (exact n P2) P3)
order without overlap.

l precisely, noless, and nomore: Shape argu- where n is the number of occurrences of P2 in
ments must appear in the specified order but the original shape definition, and PI and P3 do
can overlap. not have a common suffix/prefix with P2. This
transformation allows the index structure presented
l and, or and any: Shape arguments may appear in Section 5 to be used to evaluate the subshape
in any order. (exact n P2).

508
4.4 Multiple Occurrences Operators When different MOOS are used inside any, we
The shape expressions involving a multiple Occur- have the following rules (the order in which different
rences Operator (MOO) can often be reduced to MOOS are written inside any is not important
simpler expressions. The transformation rules fall because any is commutative):
into three categories, depending on how the MOO
(any (exact 72P) (atleast m P))
has been used: composed with another MOO, inside ($ (atleast m P) if m 5 12,
concat, or inside any.
(atleast n P) if n = m - 1
(any (exact 71P) (atmost m P))
Composition. When a MOO, Ml, is composed (j (atmost m P) if m 2 12,
with another MOO, M2, the result depends on what (atmost n P) if m = n - 1
Ml is: (any (atmost n P) (atleast m P))
({exactlatleast} n (M2 m P)) H(atleastOP) ifmln+l
*((Ma mP)if n=l, 0 ifn>l. The above rules are the consequence of the
(atmost n (A2 m P)) following rewritings of atleast and atmost:
e (any (exact 0 (Mz m P)) (Mz m P))
if n 21. (atleast nP) * (any ( exact 12P) (exact (n+ 1) P)
. (exact (p- 1) P) (exact p P))
In the rule for the atmost operator, the shape
(atmost n P) G (any (exact 0 P) (exact 1 P) . . .
arguments to any in the right-hand side of the rule
(exact (n - 1) P) (exact n P))
correspond to 0 and 1 occurrences of the atmost
argument in the match. where p is the length of the interval over which the
matching is being performed.
Inside concat. When the concat operator is
applied to two MOOS, Ml and M2, on the same 4.5 The “in” operator
shape, the result is 0. The only exception is when Hz When composed with each other, the operators
matches the null sequence, in which case the result precisely, noless and nomore have the same
is the same as yielded by Ml. ~2 can match the null properties as the MOOS. When used inside and
sequence either because it is atmost or because the or or operators, they have the same properties as
specified number of occurrences is 0. MOOS when used inside concat or any operators,
respectively.
(concat (Mi n P) (MS m P)) 0 (Ml n P) if When the length specified for the in operator is
(Mz=atmost or m=O), and 0 otherwise. less than the guaranteed minimum length of the
shape or the interval length where the match is
Inside any. The operators atmost and atleast to be performed, then the result is empty. The
can match a range of number of occurrences of guaranteed minimum length can often be computed
the specified shape, whereas exact matches only when the shape expression involves noless or
the specified number of occurrence. Therefore, precisely.
their behavior differs inside any. Two atmost It might be tempting, but inorder cannot be
(or atleast) over the same shape are equivalent rewritten using the other in operators because
to one atmost (or atleast) with the number of it is the only .one in the in family that allows
occurrences equal to the maximum (or minimum) gaps but not overlap. For example, the following
of the original ones. transformations are not valid:
(any (atleast n P) (atleast m P)) (or (inorder Pi P2) (inorder PZ PI))
* (atleast min(n,m) P) + (ad 9 P2)
(any (atmost n P) (atmost m P)) (inorder P . . . P)
(j (atmost moz(n, m) P) $4 (precisely ra P)
If two exact over the same shape specify the same
number of occurrences, they can be reduced to 5 Indexing
one exact; otherwise, the shape expression remains A straightforward method to evaluate a shape query
unchanged. will be to scan the entire database and match the

509
specified shape against each sequence. We propose as sequences of tuples (s, k’), where k’ is the number
a storage structure and show how it is used for of contiguous occurrences of the symbol s, requiring
speeding up the implementation of S’DDL. 2x npx nseq/k entries. We generally expect np to be
much smaller than nseq. Thus, if we were to store
5.1 The Storage Structure sequences using the index storage structure, we can
The proposed hierarchical storage structure, which save storage as long as k < (2 x nseq)/(ns x np).
also acts as an index structure, consisting of four For ns = 10, np = 50, nseq = 1000, k up to 10 can
layers. The top layer is an array indexed by a save storage. In addition, the index can speed up
symbol name from the alphabet. Its size is ns where query processing.
ns is the number of symbols in the alphabet. Its
elements point to one instance of the second layer. 5.2 The Mapping Problem
An instance of the second layer is an array indexed There may be more than one transition sequence
by the start period of the first occurrence of the corresponding to a time sequence. For example,
symbol in the sequence, whose elements point to the time sequence (0 0 0) can be mapped either
one instance of the third layer. The size of an to (zero zero) or to (zero stable). One way to
array of this layer is np where np is the maximum deal with this problem is to store both mappings in
number of time periods in some time sequence. One the index. However, this may lead to an exponential
instance of the third layer is an array indexed by the explosion in the number of mappings. Instead, we
maximum number of occurrences of the associated store only one form in the index as explained below.
symbol. Each element of this array points to a Assume the existence of a set P of primitive el-
sorted list of object-ids. Consider an array at this ementary shapes that are disjoint (i.e. every tran-
layer, being pointed to from the kth element of a sition is in at most one of the primitive shapes).
second-layer array. This array will have np - 6 Thus, there is no ambiguity with regard to the mem-
elements, starting from the &h position, because bers of the set P. Further assume that every ele-
a symbol can occur at most np- k times. Thus, the mentary shape is the “union” of some subset of P
number of elements in a third-layer array depends (i.e. every transition in the given elementary shape
on its parent in the second-layer. We use NULL, is in exactly one of the primitive elementary shape
as a special value, to mark elements corresponding in the subset of P corresponding to the given el-
to empty combinations, e.g., when a given symbol ementary shape). In this case, the transformation
does not start at a specific position in any of the rule E ($ (any Pi . . . Pn) eliminates the elemen-
sequences in the database. Having created this tary shape E in favor of the corresponding primitive
structure, we no longer need the original data. elementary shapes Pi . . . Pn for which there is no
Figure 2 illustrates this structure. The specific ambiguity.
entries in this structure are for the sequence ‘BYgiven Since there might not already be a set of primitive
in Figure 1. elementary shapes, it might be necessary to add
The size of the first three layers of the structure new primitive elementary shapes. In general, this
is independent of the number of sequences in the requires an exponential number of new primitive
database, whereas the fourth layer depends on the elementary shapes since there would need to be a
number of sequences. In the worst case, the first new primitive elementary shape for every possible
three layers will have ns( 1 + np + np x (np + 1)/2) non-empty subset of the original elementary shapes.
entries, which can be approximated to ns x: npa/2. Fortunately, there is a natural sufficient condition
This case arises when all the elements of all the that requires only a linear blowup in the number of
arrays are non-NULL. In the worst case, there new primitive elementary shapes. If every primitive
can be a total of np entries in the fourth layer shape can be associated with an interval of real
for a sequence whose transition sequence does not numbers, then there is only linear blowup. TO
contain any identical symbol in two contiguous see this, imagine n elementary shapes. These give
positions. In the best case, there will be one rise to 2n endpoints. These endpoints define at
entry. If sequences have on average k identical most 2n + 1 disjoint consecutive intervals. (There
contiguous symbols, the total number of entries in may be fewer than 2n + 1 intervals since some
the index will roughly equal np x (ns x np/2 + of the endpoints might coincide.) Add a new
nseq/k). The original data sequences can be stored primitive symbol for each such interval, giving rise

510
zero BDDears up UP stable down Down disappears
transitic In (or shape) NLJLL / \ \

\ /

start
period

number of
consecutive
Occurences I \ I-* I
1

I I \
np-1

I-* P a
1

..
np-5

F. 1 np-7

2 R
+
0H
Figure 2: An index structure for SVL queries.
0R
to 2n + 1 new primitive symbolsl. Each of the eval( P, [s, e]) = { [oid : o, start : i, length : l] 132, y
original elementary shapes can clearly be expressed (o E shape[P].start[z].occur[y])A
as the “union” of the corresponding new primitive (max(s, z) < i < min(z + y, e))}
elementary shapes. Intuitively; the fact that each
of the original elementary shapes has an associated 0 exact
interval implies that most of the “intersections” eval(ezact n P, [s, e]) =
between the original elementary shapes is empty {[oid:o, start :maz(s, z), length:n] 13 2, y
and thus require no new primitive shapes, thereby (o E shape[P].start[t].occur[y]) A (z 5 e - n)A
controlling the blowup. (s+n 5z+y) A ( min(e, y+z)-mat(s) 2)= n)}

5.3 Shape Matching Using the Index When n = 0, we cannot use directly the index to
Notation In the following, P and D denote get subsequences that match the null sequence. In-
an elementary and a derived shape, respectively, stead, they are computed by the following expres-
ewal(D, [s, e]) d enotes the evaluation of shape D sion:
within the interval [s,e], and p denotes the length
eval(ezact 0 P, [s, e]) = { [oid : 0, start : s, length : 0] 1
of the interval, i.e., p = e - s. The result of
[o, s] # eval(atleast 1 P, [s, e])[oid, start])
eval is a set of tuples [aid, start, length], where
oid is the object-id, start is the start period, and 0 atmost
length the length of the matched subsequence.
The notation shape[P].start[z].occur[y] means “get eval(atmost n P, [s, e]) = {[oid: 0, start: 7na2(s, 2))
object identifiers that have y occurrences of the length: min(e, z + y) - maz(s, z)] 13 z, y
shape P starting from z”, and represents index (o E shape[P].start[z].occur[y]) A (z < n)A
traversal. The tuples resulting from matching the (s<z+y) A (min(e, z+y)-maz(s, z) In)}
null sequence have start = s and length = 0. U ewal(ezact 0 P, [s, e])
0 atleast
5.3.1 Operations on Elementary Shapes
We first consider the evaluation of elementary eval(atleast n P, [s, e]) = {[oid: 0, start : ma2(s, 2))
shapes and those shapes derived by applying mul- length : min(e, z + y) - maz(s, z)] 13 2, y
tiples occurrences operators on elementary shapes. (o E shape[P].start[x].occur[y]) A (z 5 e-n)A
l Elementary shape (s+ns z+y) A (min(e, z+y)-maz(s, z) In))

‘Extra primitive symbols may be needed to handle When n = 0, eval(ezact 0 P, [s, e]) must be
constraints on initial and final values. “unioned” to the above expression.

511
0 precisely, nomore, noless l in
The evaluation of (precisely/nomore/noless n P) The length parameter of in defines a family of
within the interval [s, e] is similar to (exact/atmost intervals inside interval [s, e] where the match
/atleast n P) except that n must be equal/greater should be performed. Thus, in is implemented by
/smaller than the sum of all P occurrences in [s, e]. the following expression:

5.3.2 Operations on Derived Shapes hd((in n D), [s, e]) = U (eval(D, [i, i + n]))
s<i<e-n
The evaluation of more complex forms of derived
shapes is performed using the index structure The precisely, nomore, and noless operators
inductively. have the same evaluation schema as exact, atmost,
0 concat and atleast, respectively, but a different definition
The result of matching one shape constrains the for the interval, predicate, and projection, because
interval in which the next shape should be searched. they allow gaps and overlap between their shape
The following expression implements it for n=Z; for arguments. Their definitions for the interval,
n>2, the evaluation is performed inductively: predicate and projection require an offset of at least
one time period between two consecutive shapes.
eval(concat D1 Dz, [s, e]) = On the other hand, inorder does not accept
overlap, and its evaluation schema is the same as
for concat with the exception that its definition of
Here Ii denotes the interval where the match-
the interval, predicate, and projection requires that
ing of D2 starts. It results from the evaluation two consecutive shapes, D1 and D2, are separated
of DI, and is given by 11 = [min(S~.stcart + by at least the length of the subsequence matched
S1 Jength), mat(Sl.start + S1 length)]. D1 is eval-
by Dl.
uated first, then 11, then D2, followed by a join
Since we allow gaps and overlap between shapes
operation between resulting sets, Sr and 5’2, us-
inside and, it is implemented as a join between
ing the predicate PR1 = (&,oid = &.oid) A
the set of subsequences that match D1 and D2.
(&.start = Sl.start + Sl.length) and projection
The shape order in the sequence does not matter.
PJ1 = [oid : Sl.oid, start : S~.start, length : The or operator over two shapes, D1 and D2, is
Sr .lengt h + S’sJengt h]. The inductive evaluation
implemented as the union of the set of subsequences
for the concatenation of n shapes stops either when
that match D1 and the set of sequences that match
the result of a join is empty or after all joins have
D2.
been performed. In the former case, the evalua-
tion returns an empty set. Since Si elements are
sorted on oid, the join ,operations are implemented 6 Summary
as merge-join. We presented SVC, a shape definition language for
l Multiple Occurrences Operators retrieving objects based on shapes contained in the
We use the same evaluation schema as for the histories associated with the objects. S’DC is de-
concat, replacing Di by D. The exact and atmost signed to be a small, yet powerful; language for ex-
operators have the same stopping condition as pressing naturally and intuitively a rich variety of
concat. The exact operator returns the result queries about the shapes found in histories. SVC
of step n if the result of step n+l is empty, and is equivalent in expressive power to the regular ex-
the empty result otherwise. The atmost operator pressions when finding if a given sequence matches a
returns the result of step i if i 5 n and the result particular shape. In the case>of continuous match-
of step i+l is empty. For atleast the evaluation ing [8], where one finds all the subsequences of a
stops when a join returns an empty set. It returns given sequence that match a particular shape, SVC
the result of step i if i>n and the step i+l returns provides context information that regular expres-
empty result. sions are unable to. Thus, SVC can discard the
l =Y non-maximal subsequences thereby eliminating use-
less clutter, whereas the regular expressions cannot
eval((any DI . , . D,), [s, e]) = IJ (evai b, el)) provide this service since they are unable to “look-
lSilt.3 ahead” to provide context information.

512
A novel feature of SDC is its ability to perform P), where S is a sequence and P a shape, to a
“blurry” matching where the user gives the overall possibly empty set of intervals. This resulting set of
shape but not all the specific details. SVL: intervals contains all subsequences of S that match
is efficiently implementable - its operators are the shape P. Notice that the definition implies that
designed to limit non-determinism, which in turn if [k, I] E X[i, j] N P, then i 5 Ic 5 1 5 j. The
reduces back-tracking. An SV,C query expression interval [k, h] denotes any null sequence since any
can be rewritten into a more efficient form using elementary shape matches only intervals that have
transformation rules and its execution can be a single transition (i.e. are of the form [k, k + 11).
speeded using our index structure.
Elementary shapes. Let X be a sequence and
Acknowledgment We thank Stefano Ceri and P one of the symbols in A. Then
John Shafer for useful discussions. [k,I] E X[i, j] N P iff ?#,I] E P and i 5 k < 1 =
k+l<j.
7 Appendix A: Formal Semantics for
SVL Derived Shape any. Let 7-1 be a sequence and
4 . . . P, some shapes. Then
Notation Let ?f be a sequence of real values
describing a history. Formally, a sequence is a Ff[i, j] 2 (any 9 PZ . . . P, > = rj Tf[i, j] N Pk.
function from an interval into the real numbers k=l
where an interval is a finite set of consecutive non-
negative integers. An interval is frequently denoted Derived Shape concat. The syntax of the con-
by [i, j]. By length(H), we indicate the number catenation operator is:
of elements in the domain of the function that
represents the sequence ‘H. Every element in ‘R (concat PI Pz . . . Pn) for n 10.
is identified by its position in the sequence. The The following formulas give the semantics:
first element for the whole history is in position 0.
We refer to the symbol in position i as ‘H[i], with ‘H[i, j] 21 (concat 1 = {[k,k] ) i _< k 5 j).
0 5, i < length(X).
If n 2 1, then [k, m] E ‘H[i, j] N (concat PI . . . P,,)
Let S C ‘H be a subsequence of H defined as iff there exists an 1 such that [h, 13 E ‘H[i, j] N PI
follows. Each element in S is identified by its
and [I, m] E ‘H[l, j] 21 (concat P2 . . , P,).
position in the original sequence ‘H and elements
in S are in the same order they are in ‘H. The first
element of S is referred to as first(S), while the Derived Shapes: exact, atleast, atmost. The
last as last(S). syntaxes are:
The subsequence of ‘H from position i to position (exact n P >
j inclusive is represented as ‘H[i, j], where 0 5 i < (atleast n P >
j < length(H). Similarly, S[i, j], where first(S) < (atmost n P )
i 5 j 5 last(S), indicates a subsequence of S. The where n 2 0.
length of S[i, j]‘is defined as length(S[i, j]) = j-i+
These operators provide richer forms of concate-
1. Notice,that S[i, j][k, 13= S[maz(i, k), min(j, l)].
nation. Their semantics is described as follows.
There exists an alphabet A of symbols and a
mapping that can map the values of any two [k, l] E 7f[i,j] z Catleast n P 1 iff
consecutive elements of Z into the symbols of A. 4r-n < k ([m, Ic] E %[i, /z] 21 P) and
Each symbol corresponds to an elementary shape. G’m 2 1 ([I, m] E %[I, j] N P) and
An elementary shape induces a class containing 3m 2 n ([h, l] E X[i, j] N (concat PI . . . Pm)
all the subsequences of 7i of length 2 that satisfy where PI = . . . = P,,, = P)
the definition of the corresponding alphabet. We [k, r] E X[i, j] 2 (atmost n P > iff
use the notation s E P to indicate a sequence s +n ,< ) ([m, h] E ‘Pf[i, k] -N P) and
belonging to the class induced by P’s definition, +lm 2 1 ([I, m] E ‘H[l, j] z P) and
where P is an elementary shape. 3m 5 n ([L, l] E 7f[i, j] 2: (concat PI . . . Pm)
The 21 operator is an application from a pair (S, where PI = . . . = P,,, = P)

513
[k, I] E ‘H[i,j] N (exact n P ) iff References
4m 5 k ([m, k] E ti[i, B] = P) and [l] R. Agrawal, G. Psaila, E. L. Wimmers, and
4rn 2 1 ([I, m] E 3c[I, j] N P) and M. Zai’t. Querying shapes of histories. IBM
([k, 11E %[i, j] N (concat PI . . . Pu) Research Report RJ 9962 (87921), IBM Al-
where PI = . . . = P, = P) maden Research Center, San Jose, California,
June 1995.
Derived Shape: in. The syntax is:
(in rz P) where n 2 0 indicates the length of the [2] D. J. Berndt and J. Clifford. Using dynamic
sequence in terms of time periods (transitions) for time warping to find patterns in time series.
which the condition expressed by the 9 argument In KDD-94: AAAI Workshop on Knowledge
must hold. Discove y in Databases, pages 359-370, Seattle,
N[i,j] N (in 12P ) = {[k, k+n] 1i 5 kAk+n Washington, July 1994.
5
jr\[k,k+n] EYl[k,k+n]2 P}. [31 S. Chakravarthy, V. Krishnaprasad, E. Anwar,
and S.-K. Kim. Composite events for active
Derived Shapes: nomore, noless, precisely. databases: Semantics, contexts, and detection.
The syntaxes are: In Proc. of the VLDB Conference, pages 606-
(nomore n P ) 617, Santiago, Chile, September 1994.
(noless n P 1
(precisely n P > PI R. D. Edwards and J. Magee. Technical Analysis
where n 2 0. of Stock Trends. John Magee, Springfield,
Massachusetts, 1966.
Even though these forms make sense in general,
they are restricted to use within the in shape. [51 S. Gatziu and K. Dittrich.Detecting composite
[k,I]E%[i,j]~(noless nP )iffi<k<l<j events in active databases using petri nets. In
and card(X[k, l] N P) > n. Proc. of the 4th Int’l Workshop on Research
Issues in Data Engineering: Active Database
[k, I] E ‘H[i, j] N ( noaore nP )iffiskll<j Systems, pages 2-9, February 1994.
and card(X[k, r] E P) 5 n.
[k,l] E 3c[i,j] II (precisely n P ) iff i 5 k < b31N. Gehani, H. Jagadish, and 0. Shmueli. Com-
15 j and card(X[k, I] N P) = n. posite event specification in an active databases:
Model 8z implementation. In Proc. of the VLDB
Conference, pages 327-338, Vancouver, British
Derived Shape: inorder. The syntax is:
Columbia, Canada, August 1992.
(inorder PI . . . P,,) for n 1 0.
Even though this form makes sense in general, it [71 J. E. Hopcroft and J. D. Ullman. Introduction
to
is restricted to use within the in shape. Automaton Theory, Languages, and Computa-
tion. Addison-Wesley, Reading, Massachusetts,
[k, m] E ?f[i, j] 1? (inorder PI . . . Pu) iff there 1979.
exist lo, kl, 11,. . . , k,, 1, such that i = lo 5 k 5
k1 < l1 5 kz 5 12.. . < k, 5 1, 5 m < j and PI P. Seshadri, M. Livny, and R. Ramakrishnan.
[k,,l,] E ‘l&1, j] N Pu for 1 2 u 5 n. SEQ: A model for sequence databases. In
Proc. of the IEEE Int’l Conference on Data
Derived Shapes: and, or. The syntaxes are: Engineering, Taiwan, 1995.
(and PI . . . P, )
(or PI . . . Pn 1 PI S. Wu and U. Manber. Fast text searching
allowing errors. Communications of the ACM,
where n 2 0. 35(10):83-91, October 1992.
Even though these forms make sense in general,
they are restricted to use within the in shape.
3c[i, j] 21 (or PI.. .P,)=‘H[i,j]-(anyPl...P,).

‘H[i, j] E (and 9 Pz . ..P.)= h ?l[i,j]zPk.


k=l

514

You might also like