Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Continuously Maintaining Quantile Summaries of the Most Recent N Elements

over a Data Stream


Xuemin Lin† , Hongjun Lu‡ , Jian Xu† , Jeffrey Xu Yu∗
† ‡ ∗
University of New South Wales Hong Kong University of Sci. & Tech. Chinese University of Hong Kong
Sydney, Australia Hong Kong, China Hong Kong, China
{lxue,xujian}@cse.unsw.edu.au luhj@cs.ust.hk yu@se.cuhk.edu.hk

Abstract quantiles with reducing space requirements and one scan


Statistics over the most recently observed data elements of data [1, 2, 17, 18, 9], techniques reported in [10] (Re-
are often required in applications involving data streams, ferred as GK-algorithm hereafter) are able to maintain an
such as intrusion detection in network monitoring, stock -approximate quantile summary for a data sequence of N
price prediction in financial markets, web log mining for elements requiring only O( 1 log(N )) space in the worst
access prediction, and user click stream mining for per- case and one scan of the data. A quantile summary is -
sonalization. Among various statistics, computing quantile approximate if it can be used to answer any quantile query
summary is probably most challenging because of its com- within a precision of N . That is, for any given rank r, an -
plexity. In this paper, we study the problem of continu- approximate summary returns a value whose rank r is guar-
ously maintaining quantile summary of the most recently anteed to be within the interval [r − N , r + N ].
observed N elements over a stream so that quantile queries While quantile summaries maintained by GK-algorithm
can be answered with a guaranteed precision of N . We have their applications, such summaries do not have the
developed a space efficient algorithm for pre-defined N concept of aging, that is, quantiles are computed for all
that requires only one scan of the input data stream and N data elements seen so far, including those seen long time
2
O( log( N ) + 12 ) space in the worst cases. We also devel- ago. There are a wide range of applications where data ele-
oped an algorithm that maintains quantile summaries for ments seen early could be outdated and quantile summaries
most recent N elements so that quantile queries on any most for the most recently seen data elements are more important.
recent n elements (n ≤ N ) can be answered with a guaran- For example, the top ranked Web pages among most re-
teed precision of n. The worst case space requirement for cently accessed N pages should produce more accurate web
2
this algorithm is only O( log (N )
). Our performance study page access prediction than the top ranked pages among all
2

indicated that not only the actual quantile estimation error pages accessed so far as users’ interests are changing. In fi-
is far below the guaranteed precision but the space require- nancial market, investors are often interested in the price
ment is also much less than the given theoretical bound. quantile of the most recent N bids. Datar et. al. considered
such a problem of maintaining statistics over data streams
1. Introduction with regard to the last N data elements seen so far and
Query processing against data streams has recently re- referred to it as the sliding window model [5]. However,
ceived considerable attention and many research break- they only provided algorithms for maintaining aggregation
throughs have been made, including processing of relational statistics, such as computing the number of 1’s and the sum
type queries [4, 6, 8, 15, 20], XML documents [7, 14], of the last N positive integers. Apparently, maintaining or-
data mining queries [3, 16, 22], v-optimal histogram main- der statistics (e.g. quantile summary) is more complex than
tenance [12], data clustering [13], etc. In the context of data those simple aggregates. Several approximate join process-
streams, a query processing algorithm is considered effi- ing [4] and histogram techniques [11] based on the sliding
cient if it uses very little space, reads each data element just window model have also been recently reported but they are
once, and takes little processing time per data element. not relevant to maintaining order statistics.
Recently, Greenwald and Khanna reported an interesting Motivated by the above, we studied the problem of
work on efficient quantile computation [10]. A φ-quantile space-efficient one-pass quantile summaries over the most
(φ ∈ (0, 1]) of an ordered sequence of N data elements is recent N tuples seen so far in data streams. Different from
the element with rank φN . It has been shown that in or- the GK-algorithm where tuples in a quantile summary are
der to compute exactly the φ-quantiles of a sequence of N merged based on capacities of tuples when space is needed
data elements with only p scans of the data sequence, any for newly arrived data elements, we maintain quantile sum-
algorithm requires a space of Ω(N 1/p )[19]. While quite a mary in partitions based on time stamps so that outdated
lot of work has been reported on providing approximate data elements can be deleted from the summary without af-

Proceedings of the 20th International Conference on Data Engineering (ICDE’04)


1063-6382/04 $ 20.00 © 2004 IEEE
fecting the precision. Since quantile information is local to t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15
each partition, a novel merge technique was developed to
produce an -approximate summary from partitions for all 12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3
N data elements. Moreover, we further extended the tech-
nique to maintain a quantile summary for the most recent Figure 1: a data stream with arrival time-stamps
N data elements in such a way that quantile estimates can
be obtained for the n most recent elements for any n ≤ N . the sequence is 16. The sorted order of the sequence is
To the best of our knowledge, no similar work has been 1, 2,3,4,5,6,7,8,9,10,10,10,11,11,11,12. So, 0.5-quantile re-
reported in the literature. The contribution of our work can turns the element ranked 8 (=0.5*16), which is 8; and 0.75-
be summarized as follows. quantile returns an element 10 which ranks 12 in the se-
quence.
• For the sliding window model where quantile sum-
Munro and Paterson showed that any algorithm that
maries are maintained for the N most recently seen el-
computes the exact φ-quantiles of a sequence of N data el-
ements in a data stream, we developed a one-pass de-
ements with only p scans of the data sequence requires a
terministic -approximate algorithm to maintain quan-
space of Ω(N 1/p )[19]. In the context of data streams, we are
tiles summaries. The algorithm requires a space of
2 only able to see the data once. On the other hand, the size
O( log( N ) + 12 ). of a data steam is theoretically infinite. It becomes imprac-
• For the n-of-N model where quantile summary with
tical to compute exact quantile for data streams. For most
a sliding window N can produce quantile estimates
applications, keeping a quantile summary so that quantile
for any n (n ≤ N ) most recent elements, we devel-
queries can be answered with bounded errors is indeed suf-
oped another one-pass deterministic approximate algo-
ficient.
rithm that requires a space of O( 12 log2 (N )). The al-
gorithm is -approximate for every n ≤ N . Note that Definition 2 (-approximate). A quantile summary for a
the sliding window model can be viewed as a special data sequence of N elements is -approximate if, for any
case of the n-of-N model. given rank r, it returns a value whose rank r is guaran-
The rest of this paper is organized as follows. In sec- teed to be within the interval [r − N , r + N ].
tion 2, we present some background information in quan-
Generally, a quantile summary may be in any form. Def-
tile computation. Section 3 and 4 provide our algorithms
inition 2 does not lead to a specific query algorithm to
for the sliding window model and the n-of-N model, re-
find a data element from an -approximate summary within
spectively. Discussions of related work and applications are
the precision of N . To resolve this, a well-structured -
briefly presented in section 5. Results of a comprehensive
approximate summary is needed to support approximate
performance study are discussed in section 6. Section 7 con-
quantile queries. In our study, we use quantile sketch or
cludes the paper.
sketch for brevity, a data structure proposed in [10] as the
2. Preliminaries quantile summaries of data sequences.
In this section, we first introduce the problem of quantile Definition 3 (Quantile Sketch). A quantile sketch S of an
computation over a sequence of data, followed by a num- ordered data sequence D is defined as an ordered sequence
ber of computation models. Finally we review some most of tuples {(vi , ri− , ri+ ) : 1 ≤ i ≤ m} with the following
closely related work. properties.
2.1. Quantile and Quantile Sketch 1. Each vi ∈ D.
In this paper, for notation simplification we will always 2. vi ≤ vi+1 for 1 ≤ i ≤ m − 1.
assume that a data element is a value and an ordered se- 3. ri− < ri+1

for 1 ≤ i ≤ m − 1.
quence of data elements in quantile computation always 4. For 1 ≤ i ≤ m, ri− ≤ ri ≤ ri+ where ri is the rank of
means an increasing ordering of the data values. Further- vi in D.
more, we use N to denote the number of data elements.
Example 2. For the sequence of data shown in Figure 1,
Definition 1 (Quantile). A φ-quantile (φ ∈ (0, 1]) of an or- {(1,1,1), (2,2,9), (3,3,10),(5,4,10),(10,10,10),(12,16,16)} is
dered sequence of N data elements is the element with rank an example quantile sketch consisting of 6 tuples.
φN . The result of a quantile query is the data element for
Greenwald and Khanna proved the following theorem:
a given rank.
Theorem 1. For a sketch S defined in Definition 3, if:
Example 1. Figure 1 shows a sample sequence of data gen-
erated from a data stream where each data element is rep- 1. r1+ ≤ N + 1,

resented by a value and the arrival order of data elements 2. rm ≥ (1 − )N ,
is from left to right. The total number of data elements in 3. for 2 ≤ i ≤ m, ri+ ≤ ri−1

+ 2N .

Proceedings of the 20th International Conference on Data Engineering (ICDE’04)


1063-6382/04 $ 20.00 © 2004 IEEE
then, for each φ ∈ (0, 1], there is a (vi , ri− , ri+ ) in S such of relational operations which is required for query opti-
that φN  − N ≤ ri− ≤ ri+ ≤ φN  + N . That is, S is mization. For a Web server, it is more appropriate to main-
-approximate. tain a sketch on accessed Web pages under the sliding win-
dow model so that a page access prediction for cache man-
Clearly, for each rank r, a tuple (vi , ri− , ri+ ) in such an agement can be based on most recent user access patterns.
-approximate sketch can be found by a linear scan, so that In a security house, a sketch of bid/ask prices maintained
r − N ≤ ri ≤ r + N . In the rest of paper, we will use the under the n-of-N model is more appropriate so that it can
three conditions in Theorem 1 to define an -approximate answer quantile queries from clients with different invest-
sketch. ment strategies.
Example 3. It can be verified that the sketch provided in 2.3. Maintaining -Approximate Sketches
Example 2 is 0.25-approximate with respect to the data In our study, we are interested in deterministic tech-
stream in Figure 1. On the other hand, the sketch {(1,1,1), niques with performance guarantees. Such algorithms are
(3,2,10), (10,10,10), (12, 16, 16)} is 0.2813-approximate. only available under the data stream model or the model of
Note that the three conditions presented in Theorem 1 are disk residence data.
more general than those originally in [10] due to our appli- Manku, Rajagopalan and Lindsay [17] developed
cations to the most recent N elements. However, the proof the first deterministic one-scan algorithm, with poly-
techniques in Proposition 1 and Corollary 1 [10] may lead logarithmic space requirement, for approximately com-
to a proof of Theorem 1; we omit the details from this pa- puting φ-quantiles with the precision guarantee of N .
per. Greenwald-Khanna algorithm [10] reduced the space com-
plexity to O( 1 log(N )) for a data stream with N el-
2.2. Quantile Sketches for Data Streams ements seen so far. The GK-algorithm [10] maintains
Quantile summaries, or in our case, quantile sketch can a sketch by one-pass scan of a data stream to approx-
be maintained for data streams under different computation imately answer quantile queries. A generated sketch is
models. guaranteed -approximate.
Data stream model. Most previous work are working with For presentation simplification, the GK-algorithm uses
the data stream model. That is, a sketch is maintained for all two parameters gi and ∆i to control ri− and ri+ for each tu-
N data items seen so far. ple (vi , ri− , ri+ ) in a generated sketch such that for each i,
 
Sliding window model. Under sliding window model, a • j≤i gj ≤ ri ≤ j≤i gj + ∆i ,
sketch is maintained for the most recently seen N data ele- • ri− = j≤i gj , (a)
ments. That is, for any φ ∈ (0, 1], we compute φ-quantiles 
• ri+ = j≤i gj + ∆i . (b)
against the N most recent elements in a data stream seen so
far, where N is a pre-fixed number. The GK-algorithm maintains the following invariants to en-
sure that a generated sketch {(vi , ri− , ri+ ) : 1 ≤ i ≤ m} is
n-of-N model. Under this model, a sketch is maintained -approximate.
for N most recently seen data elements. However, quan- Invariant 1: For 2 ≤ i ≤ m, gi + ∆i < 2N .
tile queries can be issued against any n ≤ N . That is, for Invariant 2: v1 : the first element in the ordered stream.
any φ ∈ (0, 1] and any n ≤ N , we can return φ-quantiles Invariant 3: vm : the last element in the ordered stream.
among the n most recent elements in a data stream seen so
far. 2.4. Challenges
Consider the data stream in Figure 1. Note that in the sliding window model, the actual con-
• Under the data stream model, since a sketch is main- tents of the most recent N tuples change upon new elements
tained for all N data elements seen so far, a 0.5- arrive, though N is fixed. This makes the existing summary
quantile returns 10 at time t11 , and 8 at time t15 . techniques based on a whole dataset not trivially applica-
• With the width N of the sliding window being 12, a ble.
0.5-quantile returns 10 at time t11 , and 6 at time t15 , Example 4. GK-algorithm generates the following sketch
which is ranked sixth in the most recent N = 12 ele- regarding the data stream as depicted in Figure 1 if  = 0.5.
ments. {(1, 1, 1), (10, 9, 9), (12, 16, 16)}
• Assume that the sketch is maintained since t0 with the
n-of-N model, at time t15 , quantile queries can be an- The data elements involved in the sketch are 1, 10, and 12.
If we want to compute the quantiles against the most recent
swered for any 1 ≤ n ≤ 16. A 0.5-quantile returns 6
for n = 12 and 3 for n = 4. 4 elements (i.e., 2, 3, 4, 5), this sketch is useless.
It can be seen that different sketches have different appli- Clearly, it is desirable that the data elements outside a
cations. For example, in databases, a sketch maintained un- sliding window should be removed from our considerations.
der the data stream model is useful for estimating the sizes As the contents in a sliding window continuously change

Proceedings of the 20th International Conference on Data Engineering (ICDE’04)


1063-6382/04 $ 20.00 © 2004 IEEE
upon new elements arrive, it seems infeasible to remove the merge these l sketches such that the merged sketch is η-
exact outdated elements without using a space of O(N ). approximate with respect to ∪li=1 Di . This technique is the
Therefore, the challenges are to develop a space-efficient key to ensure -approximation of our summary technique,
technique to continuously partition a data stream and then although used only in the query part.
summarize partitions to achieve high approximate accuracy. Algorithm 1 depicts the merge process. Each Si (for 1 ≤
− +
The quantile summary problem for n-of-N model seems i ≤ l) is represented as {(vi,j , ri,j , ri,j ) : 1 ≤ j ≤ |Si |}.
even harder.
Algorithm 1 Merge
3. One-Pass Summary for Sliding Windows Input:
In this section, we present a space-efficient summary {(Si , Ni ) : 1 ≤ i ≤ l}; each Si is η-approximate.
algorithm for continuously maintaining an -approximate Output:
sketch under a sliding window. The basic idea of the al- Smerge .
gorithm is to continuously divide a stream into the buckets Description:

based on the arrival ordering of data elements such that: 1: Smerge := ∅; r0 := 0; k := 0;
2: for 1 ≤ i ≤ l do
• Data elements in preceding buckets are generated ear- 3:

ri,0 := 0;
lier than those in later buckets. 4: end for
• The capacity of each bucket is  N 2  to ensure - 5: while ∪li=1 Si = ∅ do
approximation. − +
6: choose the tuple (vi,j , ri,j , ri,j ) with the smallest
• The algorithm issues a new bucket only after the pre-
vi,j ;
ceding buckets are full. − +
• For each bucket, we maintain an 4 -approximate sketch 7: Si := Si − {(vi,j , ri,j , ri,j )};
continuously by GK-algorithm instead of keeping all 8: k := k + 1; vk := vi,j ;
data elements. 9: rk− := rk−1− −
+ ri,j −
− ri,j−1 ;
• Once a bucket is full, its 4 -approximate sketch is com- 10: if k = 1 then k
pressed into an 2 -approximate sketch with a space of 11: rk+ := η i=1 Ni + 1
O( 1 ). 12: else k
• The oldest bucket is expired if currently the total num- 13: rk+ := rk−1

+ 2η i=1 Ni ;
ber of elements is N + 1; consequently the sketch of 14: Smerge := Smerge ∪ {(vk , rk− , rk+ )}.
the oldest bucket is removed. 15: end if
• Merge local sketches to approximately answer a quan- 16: end while
tile query.
We first prove that Smerge is a sketch of ∪li=1 Di . This
Figure 2 illustrates the algorithm. Note that GK-algorithm is an important issue. Suppose that for a given rank r, we
has been applied to our algorithm only because it has the can find a tuple (vk , rk− , rk+ ) in Smerge such that r − N ≤
smallest space guarantee among the existing techniques. rk− ≤ rk+ ≤ r + N . There is no guarantee that rk (the rank
Our algorithm will be able to accommodate any new algo- of vk ) is within [r − N, r + N ] unless rk is between rk−
rithms for maintaining an -approximate sketch. and rk+ .
the most recent N elements
Lemma 1. Suppose that there are l data streams Di for
ε
Ν elements
ε
Ν elements
ε
Ν elements
ε
Ν elements
1 ≤ i ≤ l and each Di has Ni data elements. Suppose
2 current bucket
2 2 2
that each Si (1 ≤ i ≤ l) is an η-approximate sketch of
expired bucket
Di . Then, Smerge generated by the algorithm Merge on
l
compressed ε/2 −approximate sketch in each bucket GK {Si : 1 ≤ i ≤ l} is a sketch of ∪li=1 Di which has i=1 |Si |
tuples.
Figure 2: our summary technique 
Proof. It is immediate that Smerge has li=1 |Si | tuples.
The rest of the section is organized as follows. We
Based on the algorithm Merge, the first 3 properties in
successively present our novel merge technique, compress
the sketch definition can be immediate verified. We need
technique, sketch maintenance algorithm, and query algo-
only to prove that for each (vk , rk− , rk+ ) ∈ Smerge , rk− ≤
rithm.
rk ≤ rk+ where rk is the rank of vk in the ordered ∪ki=1 Di .
3.1. Merge Local Sketches For each (vk , rk− , rk+ ) ∈ Smerge , (vi,jk,i , ri,j

k,i
+
, ri,jk,i
)
Suppose that there are l data streams Di for 1 ≤ i ≤ l, denotes the last tuple of Si (1 ≤ i ≤ l) merged into Smerge
and each Di has Ni data elements. Further suppose that no later than obtaining (vk , rk− , rk+ ). Note that if jk,i = 0
each Si (1 ≤ i ≤ l) is an η-approximate sketch of Di . (i.e. no tuple in Si has been merged yet), then we make
− +
In this subsection, we will present a novel technique to vi,0 = −∞ and ri,0 = ri,0 = ri,0 = 0.

Proceedings of the 20th International Conference on Data Engineering (ICDE’04)


1063-6382/04 $ 20.00 © 2004 IEEE
It should be clear that vk ≥ vi,jk,i for 1 ≤ i ≤ l. Con- Algorithm 2 Compress
  −
sequently, rk ≥ li=1 ri,jk,i ≥ li=1 ri,j k,i
. According to Input:
the algorithm Merge (line 9), it can be immediately veri- an ξ2 -approximate sketch S;
fied that for each k, −  l
− Output:
rk = ri,jk,i
. (1) an ξ-approximate sketch Scomd with at most  1ξ  + 2
i=1
tuples
Thus, rk ≥ rk− . Description:
Now we prove that rk ≤ rk+ for k ≥ 2, as it is immedi- 1: Scomd := ∅;
l
ate that r1 ≤ r1+ . Suppose that rk = i=1 pi where each 2: Add to Scomd the first tuple of S;
pi denotes the number of elements from Si not after vk in 3: for 1 ≤ j ≤  1ξ  do
the merged stream. Assume vk is from stream Dα . Clearly, 4: let s be the first tuple (vkj , rk−j , rk+j ) in S such that
+ −
pα = rα,jk,α ≤ rα,j ≤ rα,j + 2ηNα . jξN  − ξN − +
2 ≤ rkj ≤ rkj ≤ jξN  +
ξN
k,α −1
2 ;
k,α

For each pi = 0 (i = α), note that ri,j k,i
+ 2ηNi ≥ 5: Scomd := Scomd ∪ {s}.
+ − −
ri,jk,i +1 . If pi > ri,jk,i + 2ηNi , then rk > ri,k i
+ 2ηNi ≥ 6: end for
− +
ri,ki +1 ; consequently, (vi,jk,i , ri,jk,i , ri,jk,i ) is not the last 7: Add the last tuple of S to Scomd .
tuple from Si merged into Smerge before vk . Contradict!
− 3.3. Sketch Construction and Maintenance
Therefore, pi ≤ ri,j k,i
+ 2ηNi for each pi = 0 (i = α).

By the algorithm Merge (line 9), for k ≥ 2, rk−1 =
l − − +
We are now ready to present our algorithm that continu-
i=1,i=α ri,jk,i + rα,jk,α −1 . Thus, rk ≤ rk . ously maintains a sketch to answer approximately a quantile
Theorem 2. Suppose that there are l data streams Di for query under the sliding window model. According to The-
1 ≤ i ≤ l and each Di has Ni data elements. Suppose that orem 2, in our sketch maintenance algorithm we need only
each Si (1 ≤ i ≤ l) is an η-approximate sketch of Di . Then, to maintain a good approximate sketch for each bucket. The
Smerge generated by the algorithm Merge on {Si : 1 ≤ i ≤ algorithm, outlined in Algorithm 3, follows the scheme as
l} is an η-approximate sketch of ∪li=1 Di . depicted in Figure 2.

Proof. The property 1 and property 3 in the definition of Algorithm 3 SW


“η-approximate” (in Theorem 1) are immediate. From the Description:
equation (1), the property 2 immediately follows. 1: k := 0;
2: for all new generated data element d do
3.2. Sketch Compress 3: k := k + 1;
In this subsection, we will present a sketch compress al- 4: if k = N + 1 then
gorithm to be used in our sketch construction algorithm. 5: Drop the sketch of the oldest bucket;
The algorithm is outlined in Algorithm 2. It takes an ξ2 - 6: k := k −  N
2 ;
approximate sketch of a data set with N elements as an 7: end if
input, and produces an ξ-approximate sketch with at most 8: if BucketSize(current) ≥  2 N  then
 ξ1  + 2 tuples. 9: Compress(current, ξ = 2 );
Note that S is an ordered sequence of tuples. In Algo- 10: current := N ewBucket();
rithm 2, s is the first tuple satisfying the condition from the 11: end if
beginning of S. Clearly, the algorithm can be implemented 12: increase BucketSize(current) by 1;
in a linear time with respect to the size of S. 13: GK(current ∪ {d}, 4 );
Suppose that Scomd has the same ordering as S. Then it 14: end for
can be immediately verified that Scomd is a sketch. Accord-
ing to the condition given in line 4 in Algorithm 2, the prop- Note that in each bucket, the algorithm keeps 1) its
erty 3 in the definition (in Theorem 1) of ξ-approximate is sketch, 2) its time-stamp, and 3) the number of elements
immediate. Therefore, Scomd is an ξ-approximate sketch of contained in the bucket. The algorithm is quite straightfor-
the data stream. ward. For a new data item, if the total number k of elements
remained in our consideration is over the size of the sliding
Theorem 3. Suppose that S is an ξ2 -approximate sketch. window N , the sketch for the oldest bucket is dropped (line
Then, Scomd generated by the algorithm Compress on S is 4-6). If the current bucket size exceeds  2 N , its corre-
ξ-approximate, which has at most  1ξ  + 2 tuples. sponding sketch will be compressed to a sketch with a con-
Note that regarding the same precision guarantee, our stant number O( 1 ) of tuples (line 9) and a new bucket is cre-
new compress technique takes about half of the number of ated as the current bucket (line 10). The new data element is
tuples given by the compress technique in [10]. always inserted into a sketch of the current bucket by GK-

Proceedings of the 20th International Conference on Data Engineering (ICDE’04)


1063-6382/04 $ 20.00 © 2004 IEEE
algorithm for maintaining 4 -approximation (line 13). Note Algorithm 4 Lift
that in Algorithm 3, we use current to represent a sketch Input:
for the current bucket, and we also record the number of ele- S, ζ;
ments in the current bucket as BucketSize(current). Note Output:
that when a new bucket is issued (line 10), we initialize only Slif t ;
its sketch (current) and its bucket size (BucketSize), as Description:
well as assign the the time-stamp of the new bucket as the 1: Slif t := ∅;
arrival time-stamp of the current new data element. − +
2: for all tuple ai = (vi , ri , ri ) in S do
From the algorithm Compress, it is immediate that the 3: update ai by ri := ri+ +  ζ2 N )
+
local sketch for each bucket (except the current one) is 2 -
4: insert ai into Slif t
approximate restricted to their local data elements, while
5: end for
the current (last) one is 4 -approximate. Further, the follow-
ing space is required by the algorithm SW. Then, Slif t generated by the algorithm Lift on S is an ζ-
2 approximate sketch of D.
Theorem 4. The algorithm SW requires O( log( N ) + 1
2 )
space. Proof. We first prove that Slif t is a sketch of D. Suppose
Proof. The sketch in each bucket produced by the algorithm that:
2
GK takes O( log( N ) ) space which will be compressed to • (vi , ri− , ri+ + 2ζ N ) is a tuple in Slif t and (vi , ri− , ri+ )
a space of O( 1 ) once the bucket is full. Clearly, there are is a tuple in S;
O( 1 ) buckets. The theorem immediately follows.
• ri is the rank of vi in the ordered N  data items;
3.4. Querying Quantiles
• ri is the rank of vi in the ordered D.
In this subsection, we present a new query algorithm
which can always answer a quantile query within the pre- Clearly, ri ≥ ri ≥ ri− , and ri ≤ ri +  ζ2 N . As S is
cision of N , in light of Theorem 2. Note that in the algo- a sketch of the N  data items, ri ≤ ri+ . Consequently,
rithm SW, the number N  of data elements in the remain- ri ≤ ri+ +  ζ2 N . The other sketch properties of Slif t can
ing buckets may be less than N - the maximum difference be immediately verified.
(N −N ) is  N
2 −1. Consequently, after applying the algo- The ζ-approximation of Slif t can be immediately veri-
rithm Merge on the remaining local sketches, we can guar- fied from the definition (in Theorem 1).
antee only that Smerge is an 2 -sketch of the N  element.
There is even no guarantee that Smerge is a sketch of the N Our query algorithm is presented in Algorithm 5.
elements. Algorithm 5 SW Query
1, 2, 3, 4, 5, 6, 7, 8, 9
Step 1: Apply the algorithm Merge (η = /2) to the local
Expired Current sketches to produce a merged 2 -approximate scheme
Smerge on the non-dropped local sketches.
Figure 3:  = 1 and N = 8 Step 2: Generate a sketch Slif t from Smerge by the algo-
rithm Lift by asking ζ = .
For example, suppose that a stream arrives in the order 1,
2, 3, 4, ..., 9 as depicted in Figure 3. Upon the element 9 ar- Step 3: For a given rank r = φN , find the first tuple
rives, the first bucket (with 4 elements) is expired and its lo- (vi , ri− , ri+ ) in Slif t such that r − N ≤ ri− ≤ ri+ ≤
cal sketch is dropped. The local 2 -approximate sketch could r + N . Then return the element vi .
be (5, 1, 1) and (7, 3, 3) for the second bucket and (9, 1, 1)
for the current bucket. After applying the algorithm Merge According to the theorems 1, 2, and 5, the algorithm
on them, the first tuple in Smerge is (5, 1, 3.5). Note that in SW Query is correct; that is, for each rank r = φN 
this example, the most recent 8 elements should be 2, 3, 4, (φ ∈ (0, 1]) we are always able to return an element vi such
5, ... 9, and the rank of 5 is 4. As the rank 4 is not between that |r − ri | ≤ N .
1 and 3.5, Smerge is not a sketch for these 8 elements. 3.5. Query Costs
To solve this problem, we use a “lift” operation, outlined
Note that in our implementation of the algorithm
in Algorithm 4, to lift the value of ri+ by  N
2  for each tu- SW Query, we do not have to completely implement the al-
ple i.
gorithm Merge and the algorithm Lift, nor material-
Theorem 5. Suppose that there is a data stream D with ize the Smerge and Slif t . The three steps in the algo-
N elements and there is an ζ2 -approximate sketch S of the rithm SW Query can be implemented in a pipeline fash-
most recent N  elements in D where 0 ≤ N − N  ≤  ζ2 N . ion:

Proceedings of the 20th International Conference on Data Engineering (ICDE’04)


1063-6382/04 $ 20.00 © 2004 IEEE
Once a tuple is generated in the algorithm Merge, Step 2: If the number of 1-buckets is full (i.e.,  λ1  + 2),
it passes into Step 2 for the algorithm Lift. Once then merge the two oldest i-buckets into an i+1-bucket
the tuple is lifted, it passes to the Step 3. Once the (carry the oldest time-stamp) iteratively from i = 1. At
tuple is qualified for the query condition in Step 3, every level i, such merge is done only if the number of
the algorithm terminates and returns the data ele- buckets becomes  λ1  + 2; otherwise such a merge it-
ment in the tuple. eration terminates at level i.
Further, the algorithm Merge can be run in a way similar to e1 e2 e3 e4 e5 e6 e7 e8 e9 e10
a data stream
the l-way merge-sort fashion [21]. Consequently, the algo- 1−buckets

e1 e2 e3 Merge Starts
e4
rithm Merge runs in a O(m log l) where m is the total num- 2−bucket
ber of tuples in the l sketches. Since the algorithm Merge e1 e2 e3 e4 e5 Merge Starts
e6
takes the dominant costs, the algorithm SW Query also runs e1 e2 e3 e4 e5 e6 e7 Merge Starts e8
in time O(m log l). e1 e2 e3 e4 e5 e6 e7 e8 e9 Merge Starts
e10
Note that in our algorithms, we apply GK-algorithm, 4−bucket
e1 e2 e3 e4 e5 e6 e7 e8 e9 e10
which records (vi , gi , ∆i ) instead of (vi , ri− , ri+ ) for each
i. In fact, the calculations of ri− and ri+ according to the
Figure 4: EH Partition: λ = 0.5
equations (a) and (b) can be also pipelined with the steps
1-3 of the algorithm SW Query. Therefore, this does not af- Figure 4 illustrates the EH algorithm for 10 elements
fect the time complexity of the query algorithm. where λ = 12 and each ei is assumed to have the time stamp
4. One-Pass Summary under n-of-N i. The following theorems are immediate [5]:
In this section, we will present a space-efficient algo- Theorem 6. For N elements, the number of buckets in EH
rithm for summarising the most recent N elements to an- is always O( log(λN
λ
)
).
swer quantile queries for the most recent n (∀n ≤ N ) ele- Theorem 7. The number N  of elements in a bucket b has
ments. the property that N  − 1 ≤ λN where N is the number of
As with the sliding window model, it is desirable that a elements generated after all elements in b.
good sketch to support the n-of-N model should have the
properties: 4.2. Sketch Construction
In our algorithm, for each bucket b in EH, we record 1) a
• every data element involved in the sketch should be in-
sketch Sb to summarize the data elements from the earliest
cluded in the most recent n elements;
element in b up to now, 2) the number Nb of elements from
• to ensure -approximation, a sketch to answer a quan-
the earliest element in b up to now, and 3) the time-stamp
tile query against the most recent n elements should be
tb .
built on the most recent n − O(n) elements.
4−bucket 2−buckets 1−buckets 4−bucket 2−buckets 1−buckets

Different from the sliding window model, in the n-of-N f e d c b a


e f e d c b a e

model n is an arbitrary integer between 1 and N . Conse- Sf


merged new
Sf
quently, the data stream partitioning technique developed in Se
Sd
Se
Sd
the last section cannot always accommodate such requests; Sc
Sb
Sc

for instance when n <  N 2 , Example 4 is also applica-


Before insert e
Sa
After insert e Sb
Sa
ble. Se

In our algorithm, we use the EH technique [5] to parti-


tion a data stream. Below we first introduce the EH parti- Figure 5: Algorithm nN: η = 0.5
tioning technique. Once two buckets b and b are merged, we keep the in-
4.1. EH Partitioning Technique formation of b if b is earlier than b . Figure 5 illustrates the
In EH, each bucket carries a time stamp which is the time algorithm.
stamp of the earliest data element in the bucket. For a stream We present our algorithm below in Algorithm 6. Note
with N data elements, elements in the stream are processed that to ensure -approximation, we choose λ = +2 
in our
one by one according to their arrival ordering such that EH algorithm to run the EH partitioning algorithm, and GK-
maintains at most  λ1  + 1 (for a given λ ∈ (1, 0)) “i- algorithm is also applied to maintaining an 2 -approximate
buckets” for each i. Here, an i-bucket consists of i data ele- sketch.
ments consecutively generated. EH proceeds as follows. By Theorem 6, the algorithm nN maintains O( log(N 
)
)
sketches, and each sketch requires a space of O( log(N )
) ac-
Step 1: When a new element e arrives, it creates a 1-bucket 
cording to GK-algorithm. Consequently, the algorithm nN
for e and the time-stamp of the bucket is the time- 2

stamp of e. requires a space of O( log (N


2
)
).

Proceedings of the 20th International Conference on Data Engineering (ICDE’04)


1063-6382/04 $ 20.00 © 2004 IEEE
Algorithm 6 nN Algorithm 7 nN Query
Upon a new data element e arrives at time t, Input:
Sketches maintained in the algorithm nN where λ = +2 
,
Step 1 - create a new sketch: Record a new 1-bucket e, its
n (n ≤ N ), and r = φn (φ ∈ (0, 1]).
time stamp te as t, and Ne = 0. Initialize a sketch Se .
Output:
Step 2 - drop sketches: If the number of 1-buckets is full an element v  in the most recent n elements such that r −
(i.e.,  λ1  + 2), then do the following iteratively from n ≤ r ≤ r + n (r is the rank of v  in the most recent
i = 1 till j where the current number (before this new n elements).
element arrives) of j-buckets is not greater than  λ1 .
Step 1: For a given n (n ≤ N ), scan the sketch list from
• get the two oldest buckets b1 and b2 among the oldest and find the first sketch Sbn such that Nbn ≤
i-buckets; and then n.
• drop b1 and b2 from the i-th bucket list; and then
Step 2: Apply the algorithm Lift to Sbn to generate Slif t
• drop the sketch Sb2 (assume the b1 is older than where ζ = .
b2 ); and then
Step 3: For a given rank r, find the first tuple
• add b1 together with its time stamp into i + 1- (vi+ , ri+ , ri− ) in Slif t such that that r − n ≤ ri− ≤
buckets list. ri+ ≤ r + n. Return vi .
Scan the sketch list from oldest to delete the expired
buckets b - (Sb , Nb , tb ); that is Nb ≥ N . inequality (2) and Theorem 5, Slif t is an -approximate
Step 3 - maintain sketches: for each remaining sketch Sb , sketch of the most recent n elements.
add e into Sb by GK-algorithm for 2 -approximation
Case 2: n = Nbn . It is immediate that Slif t is an -
and Nb := Nb + 1.
approximate sketch of the most recent n elements accord-
ing to Theorem 5. In fact, in this case we do not have to
4.3. Querying Quantiles use the algorithm Lift; however for the algorithm presen-
In this subsection, we show that we can always get an tation simplification, we still include the operation for this
-approximate sketch among the sketches maintained Algo- case.
rithm 6 to answer a quantile query for the most recent n
(∀n ≤ N ) elements. Our query algorithm, described in Al- Note that we can also apply a pipeline paradigm to the
gorithm 7 consists of 3 steps. First, we choose an appropri- steps 2 and 3, in a way similar to that in section 3.5, to
ate sketch among the maintained sketch. Then, we use the speed-up the execution of Algorithm 7. Consequently, Al-
algorithm Lift, and finally check the query condition. gorithm 7 runs in O( log(N

)
) time.
Theorem 8. Algorithm 7 is correct; that is, 5. Discussions and Applications
• the algorithm is always able to return a data element; The algorithm developed by Alsabti-Ranka-Singn
• the element returned by the algorithm meets the re- (ARS) [2] is to compute quantiles approximately by
quired precision. one scan of datasets. Although this partition based algo-
rithm was originally designed for disk-resident data, it
Proof. To prove the theorem, we need only to prove that may be modified to support the sliding  window. How-
Slif t is an -approximate sketch of the most recent n ele-
ments. ever, this algorithm requires a space of Ω( N ) [17]. Fur-
Note that in Algorithm 6, Sbn is maintained, by GK- ther, there is also a merge technique in the algorithm ARS
algorithm for 2 -approximation, over the most recent Nbn but it was specifically designed to support the algorithm
elements. ARS; for instance, it does not support the sketches gener-
ated by GK-algorithm to retain -approximation. Therefore,
Case 1: n < Nbn . Suppose that bn is the bucket in EH,
the merge technique in the algorithm ARS is not applica-
which is just before bn . Then, Nbn > n since Nbn is the
ble to our algorithm SW. In the next section, we will also
largest number which is not larger than n. Consequently,
compare the modified ARS with the algorithm SW by ex-
n − Nbn ≤ Nbn − Nbn − 1. According to Theorem 7,
periments.
n − Nbn ≤ 2+ 
Nbn . This implies that n − Nbn < 2 n;
Note that in applying the EH partitioning technique
thus
[5], we may have another option - maintaining only local

n − Nbn ≤  n (2) sketches for each bucket and then merge two local sketches
2 when their corresponding buckets are merged. Though this
Note that since Theorem 7 also covers an expired bucket, can retain -approximation, we found that in some cases we
the inequality (2) holds if Sbn is the oldest. Based on the have to keep all the tuples from two buckets; this prevent us

Proceedings of the 20th International Conference on Data Engineering (ICDE’04)


1063-6382/04 $ 20.00 © 2004 IEEE
to work out a good space guarantee. In the next section, we 6. Performance Studies
will report the space requirements of a heuristic based on In our experiments, we modified ARS-algorithm [2] to
this option. support the sliding window model. In our modification, we
Now we show that the techniques developed in this pa- partition a data stream in the way as shown in [17], which
per can be immediately applied to other important quantile leads to the minimum space requirement. We implemented
computation problems. Below we list three of such applica- our algorithms SW and nN, as well as a heuristic - algo-
tions. rithm nN’. The algorithm nN’ also adopts the EH partition-
ing technique; however, in nN’ we maintain local sketches
Distributed and Parallel Quantile Computation: In many for each bucket in EH, and then use GK-algorithm [10] to
applications, data streams are distributed. In light of our merge two local sketches once the corresponding buckets
merge technique and Theorem 2, we need only to main- have to be merged. As discussed in the last section, we did
tain a local -approximate sketch for each local data stream not obtain a good bound on the space requirements of nN’.
to ensure the global -approximation. All the algorithms are implemented using C++. We con-
ducted experiments on a PC with an Intel P4 1.8GHz CPU
Most Recent T Period: In some applications, users may be and 512MB memory using synthetic and real datasets.
interested in computing quantiles against the data elements The possible factors that will affect φ-quantile queries
in the most recent T period. The algorithm nN may be im- are shown in Table 1 with default values. The parameters are
mediately modified to serve for this purpose. In the modi- grouped in three categories: i) N (window size),  (guaran-
fication, to ensure -approximation we need only to change teed precision), and φ (quantiles), ii) data distributions (Dd )
the expiration condition of a sketch: a sketch is expired if to specify the “sortedness” of a data stream, and iii) query
its time stamp is expired. Then we use the oldest remain- patterns (Qp ), lengths of rank intervals (I) to be queried,
ing sketch to answer a quantile query. The space require- and the number of queries (#Q).
2
ment is O( log (N
2
)
) where N is the maximum number of In our experiments, the data distributions (Dd ) tested
elements in a T period. Note that in this application, N is take the following random models: uniform, normal, and ex-
also unknown in each T ; however, it can be -approximated ponential. We also examined sorted or partially sorted data
by EH. We omit the details from the paper due to the space streams, such as global ascending, global descending, and
limitation. partially sorted.
We use the estimation error as the error metrics to evalu-
Constrain based Sliding Window: In many applications, ate the tightness of  as an upper-bound. The estimation er-
the users may be interested in only the elements meeting 
ror is represented as |r M −r|
where r is an exact rank to be
some constraints in the most recent N elements. In this ap- 
queried, r is the rank of a returned element, and M is the
plication, we can also modify the algorithm nN to support
number of most recent elements to be queried. Here, M is
an -approximate quantile computation for the elements sat-
N in a sliding window and is n in the n-of-N model.
isfying the constraints in the most recent N elements. In
the modification of the algorithm nN, we only allow the As for query patterns (Qp ), in addition to the random
elements meeting the constraints to be added while each φ-quantile queries, we also consider the 80-20 rules such
as 80% of queries will focus on a small rank interval (with
sketch still count the number of elements (even not meet-
ing the constraints) “seen” so far and the number of ele- length I) and 20% of queries will access arbitrary ranks.
ments seen so far meeting the constraints. Then, we use the 6.1. Sliding Window Techniques
oldest sketch to approximately answer a quantile query. The In this subsection, we evaluate the performance of our
2
space requirement is O( log (N
2
)
). Similar to the most re- sliding window techniques. In our experiments, we exam-
cent T period model, we need to -approximate the number ined the average and maximum errors of φ-quantile queries
of qualified elements in the most recent N elements. based on a parameter q, and make φ-quantiles be in the form
Notation Definition (Default Values) of 1/q, 2/q, ..., (q − 1)/q.
N The size of sliding window (800K) Overall Performance. An evaluation of overall perfor-
 The guaranteed precision (0.05) mance of SW and ARS is shown in Figure 6 for the sliding
φ (1/q, ..., (q − 1)/q) for a given q window model N . In this set of experiments, N = 800K
Dd Data distribution (Uniform) and a data stream is generated using a uniform distribu-
Qp The query patterns (Random) tion with 1000K elements. We tested four different s: 0.1,
I The length of a rank interval (N ) 0.075, 0.05 and 0.025.
#Q The number of queries (100,000) Figure 6 (a) illustrates the average errors when q =
32. It shows that ARS performs similarly to nN regard-
Table 1: System parameters. ing accuracy. This has been confirmed by a much larger

Proceedings of the 20th International Conference on Data Engineering (ICDE’04)


1063-6382/04 $ 20.00 © 2004 IEEE
0.06 4000 10
ε=0.025 SW
ARS SW
0.05
ε=0.05 ARS
ε=0.075 Theoretic Bound of SW

Space Usage: #tuples


Avg Estimation Error

Relative Error( error/fi)


ε=0.1 3000 1
0.04

0.03 0.1
2000

0.02
0.01
1000
0.01

0.001
0 0 1 200000 400000 600000 800000
SW ARS ε=0.1 ε=0.075 ε=0.05 ε=0.025 Rank

(a) Avg Errors (q = 32) (b) Space Consumption (c) Rank vs Errors( = 0.025)

Figure 6: Avg/Max Errors, Space Consumptions, and Rank (sliding window)


set of queries using another error metrics: relative error

( |r r−r| ). Figure 6 (c) shows the relative errors of all φ-
0.03 0.05
query cluster in rank [1, 1+I]
cluster in [1, 1+I]
query cluster in rank [300K, 300K+I]
cluster in [300K,300K+I]
query cluster in rank [600K, 600K+I]

quantile queries in the form of 1/q, 2/q, ..., (q − 1)/q (∀q ∈ cluster in [600K,600K+I]

Avg Estimation Error


0.0275

Max Estimation Error


0.04

[1, 800K] when  = 0.025. Overall, the relative error de- 0.025
0.03
creases while q increases but is not necessarily monotonic.
The fluctuations of SW is much smaller than ARS, though 0.0225
0.02

their average relative errors are very close. 0.02


50K 100K 150K 200K
800 I 0.01

600
(a) Avg Errors (b) Max Errors
Space Usage: #tuples

400
Figure 9: Query Patterns
200

our experiments also demonstrated that the the actual per-


0
Uni Norm Exp Sort Rev Block Semi
formance of SW is much better than the worst-case based
Figure 7: Space Consumptions vs Distributions theoretical bounds regarding both accuracy and space.
0.04

0.035
Avg Error
Max Error
Next we evaluate our SW techniques against the possi-
0.03 ble impact factors, such as data distributions and query pat-
Estimation Error

0.025

0.02
terns.
0.015

0.01 Data Distributions. We conducted another set of experi-


0.005

0
ments to evaluate the impacts of 7 different distributions
Uni Norm Exp Sort Rev Block Semi
on the algorithm SW. These include three random mod-
Figure 8: Errors vs Distributions (q = 32)
els: uniform, normal, and exponential, as well as four sorted
Figure 6 (b) gives the space consumptions of SW and models: sorted, reverse sorted, block sorted, and semi-block
ARS, respectively, together with the theoretical upper- sorted. Block sorted is to divide data into a sequence of
bound of SW. Note that we measure the the space con- blocks, B1 , B2 , ... Bn such that elements in each block
sumption by the maximum number of total sketch-tuples are sorted. Semi-block sorted means that elements in Bi are
happened in temporary storage during the computa- smaller than those in Bj if i < j; however, each block is
tion. In the algorithm ARS, each sketch-tuple needs only not necessarily sorted.
to keep its data element; thus, its size is about 1/3 of Figures 7 and 8 report the experiment results. Note that
those in SW. Therefore, the number of tuples we re- in the experiments, N = 800K, the total number of el-
ported for the algorithm ARS is 1/3 of the actual number ements is 1000K, and  = 0.05. The experiment results
of tuples for a fair comparison. We also showed the theoret- demonstrated that the effectiveness of algorithm SW is not
ical upper-bound of SW derived from that of GK-algorithm
 sensitive to data distributions.
and the algorithm SW. Although a space of Ω( N ) re- Query Patterns. Finally, we run 10K random φ-quantile
quired by ARS-algorithm is asymptotically much larger queries, using our SW techniques, on a sliding window of
than the theoretical space upper-bound of SW, there is N = 800K in the data stream used in the experiments in
still a chance to be smaller than that of SW when N is Figure 6.
fixed and  is small. This has been caught by the experi- In light of 80-20 rules, we allocate 80% of φ-quantile
ment when  = 0.025; it is good to see that even in this sit- queries in a small interval I. In this set of experiments,
uation, the actual space of SW is still much smaller than we tested four values of I: S = 50K, 100K, 150K and
that of ARS. 200K. We also tested an impact of the positions of these
Our experiment clearly demonstrated that the modified intervals with lengths I; specifically we allocate the inter-
ARS is not as competitive as the algorithm SW. They have vals at three positions [1, 1 + I], [300K, 300K + I] and
similar accuracy; however the algorithm SW requires only a [600K, 600K + I].
much smaller space than that in the algorithm ARS. Further, The experiment results have been shown in Figure 9. Fig-

Proceedings of the 20th International Conference on Data Engineering (ICDE’04)


1063-6382/04 $ 20.00 © 2004 IEEE
0.06 1
nN
nN nN’ nN
0.05 nN’ Theoretic Upper Bound nN’

Space Usage: #tuples

Relative Error( error/fi)


Avg Estimation Error
100K
0.04 0.1

0.03
10K
0.01
0.02

0.01
1K
0.001
0 1 200000 400000 600000 800000
ε=0.025 ε=0.05 ε=0.075 ε=0.1 ε=0.1 ε=0.075 ε=0.05 ε=0.025 Rank

(a) Avg Errors (q = 32) (b) Space Consumption (c) Rank ( = 0.025)

Figure 10: Avg/Max Errors, Space Consumptions, and Rank (n-of-N)


0.04
nN nN
100K
nN
0.03
nN’ nN’

Avg Estimation Error

Max Estimation Error


nN’
0.03
Space usage: #tuples

Theoretic space upperbound

0.02
0.02

0.01
0.01
10K

0 0
0K 200K 400K 600K 800K 1000K 200K 400K 600K 800K 1000K 200K 400K 600K 800K 1000K

(a) Space (b) Avg Errors (c) Max Errors

Figure 11: Space and Errors (n-of-N)


1e+05
10000
38925 0.03 Avg Error
SW
nN Max Error

Space Usage: #tuples


nN’

Estimation Error
10000 9648
100
Total Time(sec)

0.02

1
1000
614 0.01

0.01
100 0
SW nN nN’ SW nN nN’
0 0.01 0.025 0.05 0.075 0.1

(a) Space (b) Errors (Avg,Max)


Figure 12: Query Performance (CPU) Figure 13: Real Dataset
ure 9 (a) shows the average errors, and Figure 9 (b) shows 6.3. Query Costs
the maximum errors. Note that in Figure 9 (b), the four lines
in each query cluster correspond, respectively, to the four In Figure 12, we show the total query processing costs
different I values: 50K, 100K, 150K, and 200K. (CPU) for running 20K queries, where all the parameters
The maximum errors approach the theoretical upper- take the default values and the number of the data stream
bound simply because the number of queries is large and elements is 1000K.
therefore, the probability of having a maximum error is Note that the algorithm nN Query is the most efficient
high. one as there is no merge operation.

6.2. n-of-N Techniques


6.4. Real Dataset Testing
We repeated a similar set of tests, to evaluate nN and
nN’, to those in Figure 6. The results are shown in Figure The topic detection and tracking (TDT) is an important
10 for the n-of-N model, where we make n = 400K and issue in information retrieval and text mining (http://
N = 800K. We use the same data stream as in the exper- www.ldc.upenn.edu/Projects/TDT3/). We have
iments in Figure 6. The algorithm nN clearly outperforms archived the news stories received through Reuters real-
nN’, as well as those worst-case based theoretical bounds. time datafeed, which contains 365,288 news stories and
We conducted another set of experiments to examine the 100,672,866 (duplicate) words. All articles such as “the”
impacts of n and N on the algorithm nN and nN’. The data and “a” are removed before term stemming.
stream used is the same as above with 1000K elements. The We use N = 800K and  = 0.05. Figure 13 shows
experiment results are reported in Figure 11. space consumptions and average/max errors using q = 32,
Figure 11 (a) shows the space consumptions for the al- for SW, nN and nN’. They follow similar trends to those for
gorithm nN where N changes from N = 100K to N = synthetic data.
1000K. Note that the theoretic space upper-bound means In summary, our experiment results demonstrated that
the one for the algorithm nN. the algorithm SW and the algorithm nN perform much bet-
Figure 11 (b) and (c) show the average and maximum er- ter than their worst case based theoretical bounds, respec-
rors, respectively, when N = 1000K, and n changes from tively. In addition to good theoretical bounds, the algorithm
n = 200K to n = 1000K. For each n, we run randomly SW and the algorithm nN also outperform the other tech-
320 queries. niques.

Proceedings of the 20th International Conference on Data Engineering (ICDE’04)


1063-6382/04 $ 20.00 © 2004 IEEE
Table 2: Comparing Our Results with Recent Results in Quantile Computation
Authors Year Computation Model Precision Space Requirement

Alsabti, et. al. [2] 1997 data sequence N , deterministic Ω( N )
2
Manku, et. al. [17] 1998 data sequence N , deterministic O( log (N) )
Manku, et. al. [18] 1999 data stream/appending only N , conf = 1 − δ O(−1 log2 −1 + −1 log 2 log δ −1 )
Greenwald and Khanna [10] 2001 data stream/appending only N , deterministic O( 1 log(N ))
Gilbert et. al. [9] 2002 data stream/with deletion N , conf = 1 − δ O((log2 |U | log log(|U |/δ))/2 )
2
Lin, et. al. [this paper] 2003 data stream/sliding window N , deterministic O( log( N)
+ 1
2
)
2
Lin, et. al. [this paper] 2003 data stream/n-of-N N , deterministic O( log (N)
2 )

7. Conclusions [7] M. Garofalakis and A. Kumar. Correlating xml data streams


using tree-edit distance embeddings. In SIGMOD, 2003.
In this paper, we presented our results on maintaining
[8] J. Gehrke, F. Korn, and D. Srivastava. On computing corre-
quantile summaries for data streams. While there is quite a
lated aggregates over continual data streams. In SIGMOD,
lot of related work reported in the literature, the work re- pages 13–24, 2001.
ported here is among the first attempts to develop space [9] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J.
efficient, one pass, deterministic quantile summary algo- Strauss. How to summarize the universe: Dynamic main-
rithms with performance guarantees under the sliding win- tenance of quantiles. In VLDB, 2002.
dow model of data streams. Furthermore, we proposed new [10] M. Greenwald and S. Khanna. Space-efficient online com-
techniques for the n-of-N model which we believe has wide putation of quantile summaries. In SIGMOD, pages 58–66,
applications. As our performance study indicated, the algo- 2001.
rithms proposed for both models provide much more accu- [11] S. Guha and N. Koudas. Approximating a data stream for
rate quantile estimates than the guaranteed precision while querying and estimation: Algorithms and performance eval-
requiring much smaller space than the worst case bounds. uation. In ICDE, 2002.
In Table 2, we compare our results with the recent results [12] S. Guha, N. Koudas, and K. Shim. Data-streams and his-
in quantile computation under various models. tograms. In STOC, pages 471–475, 2001.
An immediate future work is to investigate the problem [13] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clus-
of maintaining other statistics under the n-of-N model. Fur- tering data streams. In FOCS, pages 359–366, 2000.
thermore, the technique developed in this work that merges [14] A. Gupta and D. Suciu. Stream processing of xpath queries
with predicates. In SIGMOD, 2003.
multiple -approximate quantile sketches into a single -
[15] J. Kang, J. Naughton, and S. Viglas. Evaluation window
approximate quantile sketches is expected to have applica-
joins over undounded streams. In ICDE, 2003.
tions in distributed and parallel systems. One possible work
[16] G. Manku and R. Motwani. Approximate frequency counts
is to investigate the issues related to maintain distributed over data streams. In VLDB, 2002.
quantile summaries for large systems for a sliding window. [17] G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approxi-
References mate medians and other quantiles in one pass and with lim-
ited memory. In SIGMOD, pages 426–435, 1998.
[1] R. Agrawal and A. Swami. A one-pass space-efficient algo-
[18] G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random
rithm for finding quantiles. In S. Chaudhuri, A. Deshpande,
sampling techniques for space efficient online computation
and R. Krishnamurthy, editors, COMAD, 1995.
of order statistics of large datasets. In SIGMOD, pages 251–
[2] K. Alsabti, S. Ranka, and V. Singh. A one-pass algorithm for 262, 1999.
accurately estimating quantiles for disk-resident data. In The
[19] J. Munro and M. Paterson. Selection and sorting wiith lim-
VLDB Journal, pages 346–355, 1997.
ited storage. In TCS12, 1980.
[3] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-
[20] C. Olston, J. Jiang, and J. Widom. Adaptive filters for con-
dimensional regression analysis of time-series data streams.
tinuous queries over distributed data streams. In SIGMOD,
In VLDB, 2002.
2003.
[4] A. Das, J. Gehrke, and M. Riedewald. Approximate join pro- [21] R. Ramakrishan. Database Management Systems. McGraw-
cessing over data streams. In SIGMOD, 2003. Hill, 2002.
[5] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining [22] Y. Zhu and D. Shasha. Statstream:statistical monitoring of
stream statistics over sliding windows: (extended abstract). thousands of data streams in real time. In VLDB, 2002.
In ACM-SIAM, pages 635–644, 2002.
[6] A. Dobra, M. N. Garofalakis, J. Gehrke, and R. Rastogi. Pro-
cessing complex aggregate queries over data streams. In SIG-
MOD, 2002.

Proceedings of the 20th International Conference on Data Engineering (ICDE’04)


1063-6382/04 $ 20.00 © 2004 IEEE

You might also like