Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

120 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO.

1, JANUARY 2010

The Dynamic Bloom Filters


Deke Guo, Member, IEEE, Jie Wu, Fellow, IEEE, Honghui Chen, Ye Yuan, and Xueshan Luo

Abstract—A Bloom filter is an effective, space-efficient data structure for concisely representing a set, and supporting approximate
membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false
positive probability to a sufficiently low level. By investigating mainstream applications based on the Bloom filter, we reveal that
dynamic data sets are more common and important than static sets. However, existing variants of the Bloom filter cannot support
dynamic data sets well. To address this issue, we propose dynamic Bloom filters to represent dynamic sets, as well as static sets and
design necessary item insertion, membership query, item deletion, and filter union algorithms. The dynamic Bloom filter can control the
false positive probability at a low level by expanding its capacity as the set cardinality increases. Through comprehensive mathematical
analysis, we show that the dynamic Bloom filter uses less expected memory than the Bloom filter when representing dynamic sets with
an upper bound on set cardinality, and also that the dynamic Bloom filter is more stable than the Bloom filter due to infrequent
reconstruction when addressing dynamic sets without an upper bound on set cardinality. Moreover, the analysis results hold in stand-
alone applications, as well as distributed applications.

Index Terms—Bloom filters, dynamic Bloom filters, information representation.

1 INTRODUCTION

I NFORMATION representation and processing of member-


ship queries are two associated issues that encompass the
core problems in many computer applications. Representa-
great potential for representing a set in main memory [13] in
stand-alone applications. For example, SBFs have been used
to provide a probabilistic approach for explicit state model
tion means organizing information based on a given format checking of finite-state transition systems [13], to summar-
and mechanism such that information is operable by a ize the contents of stream data in memory [14], [15], to store
corresponding method. The processing of membership the states of flows in the on-chip memory at networking
queries involves making decisions based on whether an devices [16], and to store the statistical values of tokens to
item with a specific attribute value belongs to a given set. A speed up the statistical-based Bayesian filters [17].
standard Bloom filter (SBF) is a space-efficient data The SBF has been modified and improved from different
structure for representing a set and answering membership aspects for a variety of specific problems. The most
queries within a constant delay [1]. The space efficiency is important variations include compressed Bloom filters
achieved at the cost of false positives in membership [18], counting Bloom filters [12], distance-sensitive Bloom
queries, and for many applications, the space savings filters [19], Bloom filters with two hash functions [20], space-
outweigh this drawback when the probability of an error code Bloom filters [21], spectral Bloom filters [22], general-
is sufficiently low. ized Bloom filters [23], Bloomier filters [24], and Bloom
The SBF has been extensively used in many database filters based on partitioned hashing [25]. Compressed Bloom
applications [2], for example, the Bloom join [3]. Recently, it filters can improve performance in terms of bandwidth
has started receiving more widespread attention in net- saving when an SBF is passed on as a message. Counter
working literature [4]. An SBF can be used as a summariz- Bloom filters deal mainly with the item deletion operation.
ing technique to aid global collaboration in peer-to-peer Distance-sensitive Bloom filters, using locality-sensitive
(P2P) networks [5], [6], [7], support probabilistic algorithms hash functions, can answer queries of the form, “Is x close
for routing and locating resources [8], [9], [10], [11], and to an item of S?” Bloom filters with two hash functions use a
share Web cache information [12]. In addition, SBFs have standard technique in hashing to simplify the implementa-
tion of SBFs significantly. Space-code Bloom filters and
spectral Bloom filters focus on multisets, which support
. D. Guo, H. Chen, and X. Luo are with the Key Laboratory of C4ISR queries of the form, “How many occurrences of an item are
Technology, National University of Defense Technology, Changsha
410073, China. there in a given multiset?” The SBF and its mainstream
E-mail: {guodeke, chh0808}@gmail.com, xsluo@nudt.edu.cn. variations are suitable for representing static sets whose
. J. Wu is with the Department of Computer and Information Sciences, cardinality is known prior to design and deployment.
Temple University, 1805 N. Borad Street, Philadelphia, PA 19122.
E-mail: jiewu@temple.edu.
Although the SBF and its variations have found suitable
. Y. Yuan is with the Institute of Computer Systems, Northeastern applications in different fields, the following three obstacles
University, 132#, Shen Yang City, Liao Ning Province 110004, China. still lack suitable and practical solutions:
E-mail: linuxyy@gmail.com.
Manuscript received 26 May 2007; revised 19 July 2008; accepted 10 Feb. 1. For stand-alone applications that know the upper
2009; published online 18 Feb. 2009. bound on set cardinality for a dynamic set in
Recommended for acceptance by D. Gunopulos advance, a large number of bits are allocated for an
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2007-05-0239. SBF to represent all possible items of the dynamic set
Digital Object Identifier no. 10.1109/TKDE.2009.57. at the outset. This approach diminishes the space
1041-4347/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
GUO ET AL.: THE DYNAMIC BLOOM FILTERS 121

efficiency of the SBF, and should be replaced by new sets. Section 4 evaluates the performance of DBFs in stand-
Bloom filters which always use an appropriate alone, as well as distributed applications. Section 5
number of bits as set cardinality changes. concludes this work.
2. For stand-alone applications that do not know the
upper bound on set cardinality of a dynamic set in
advance, it is difficult to accurately estimate a
2 CONCISE REPRESENTATION AND MEMBERSHIP
threshold of set size and assign optimal parameters QUERIES OF STATIC SETS
to an SBF in advance. In the event that the cardinality 2.1 Standard Bloom Filters
of the dynamic set exceeds the estimated threshold A Bloom filter for representing a set X ¼ fx1 ; . . . ; xn g of
gradually, the SBF might become unusable due to a n items is described by a vector of m bits, initially all set to 0.
high false positive probability. A Bloom filter uses k independent hash functions h1 ; . . . ; hk
3. For distributed applications, all nodes adopt the to map each item of X to a random number over a range
same configuration in an effort to guarantee the f1; . . . ; mg [1], [4] uniformly. For each item x of X, we define
interoperability of SBFs between nodes. In this case, its Bloom filter address as BfaddressðxÞ, consisting of hi ðxÞ
all nodes are required to reconstruct their local SBFs for 1  i  k, and the bits belonging to BfaddressðxÞ are set
once the set size of any node exceeds a threshold to 1 when inserting x. Once the set X is represented as a
value at the cost of large (sometimes huge) overhead. Bloom filter, to judge whether an element x belongs to X,
In addition, this approach requires that the nodes one just needs to check whether all the hi ðxÞ bits are set to 1.
with small sets must sacrifice more space so as to be If so, then x is a member of X (however, there is a probability
in accordance with nodes with large sets, hence that this could be wrong). Otherwise, we assume that x is not
reducing the space efficiency of SBFs and causing a member of X. It is clear that a Bloom filter may yield a false
large transmission overhead. positive due to hash collisions, for which it suggests that an
The SBF and variants do not take dynamic sets into element x is in X even though it is not. The reason is that all
account. To address the three obstacles, we propose indexed bits were previously set to 1 by other items [1].
dynamic Bloom filters (DBF) to represent a dynamic set, The probability of a false positive for an element not in
instead of rehashing the dynamic set into a new filter as the the set can be calculated in a straightforward fashion, given
set size changes [26]. DBF can control the false positive our assumption that hash functions are perfectly random.
probability at a low level by adjusting its capacity1 as the set Let p be the probability that a random bit of the Bloom filter
cardinality changes. We then compare the performances of is 0, and let n be the number of items that have been added
SBF and DBF in three categories of stand-alone applications, to the Bloom filters. Then, p ¼ ð1  1=mÞnk  enk=m as
which feature two different types of sets: static sets with n  k bits are randomly selected, with probability 1=m in
BF
known cardinality, and dynamic sets with or without an the process of adding each item. We use fm;k;n to denote the
upper bound on cardinality. Moreover, we evaluate the false positive probability caused by the ðn þ 1Þth insertion,
performance of DBFs in distributed applications. The major and we have the expression:
advantages of DBF are summarized as follows:
BF
fm;k;n ¼ ð1  pÞk  ð1  ekn=m Þk : ð1Þ
1. In stand-alone applications, a DBF can enhance its
capacity on-demand via an item insertion operation. In the remainder of this paper, the false positive
It can also control the false positive probability at an probability is also called the false match probability. We can
acceptable level as set cardinality increases. DBFs can calculate the filter size and number of hash functions given
shorten their capacities as the set cardinality de- the false match probability and the set cardinality according
BF
creases through item deletion and merge operations. to (1). From [1], we know that the minimum value of fm;k;n
m=n
2. In distributed applications, DBFs always satisfy the is 0:6185 when k ¼ ðm=nÞ ln 2. In practice, of course, k
requirement of interoperability between nodes when must be an integer, and smaller k might be preferred since
handling dynamic sets and occupying a suitable that would reduce the amount of computation required.
amount of memory to avoid unnecessary waste and For a static set, it is possible to know the whole set in
transmission overhead. advance and design a perfect hash function to avoid hash
3. In stand alone, as well as distributed applications, collisions. In reality, an SBF is usually used to represent
DBFs use less expected memory than SBFs when dynamic sets as well as static sets. Therefore, it is impossible
dealing with dynamic sets that have an upper to know the whole set and design k perfect hash functions
bound on set cardinality. DBFs are also more stable in advance. On the other hand, different perfect hash
than SBFs due to infrequent reconstruction when functions used by an SBF may cause hash collisions. Thus,
dealing with dynamic sets that lack an upper bound the perfect hash functions are not suitable for overcoming
on set cardinality. hash collisions in SBFs in theory, as well as practice.
On the other hand, a static set is typically not allowed to
The rest of this paper is organized as follows: Section 2
perform data addition and deletion operations once it is
surveys standard Bloom filters and presents the algebra
represented by an SBF. Thus, the bit vectors of the SBF will
operations on them. Section 3 studies the concise represen- stay the same over time, and then, the SBF can correctly
tation and approximate membership queries of dynamic reflect the set. Therefore, the membership queries based on
the SBF will not yield a false negative in this scenario.
1. The capacity of a filter is defined as the largest number of items which
could be hashed into the filter such that the false match probability does not However, the SBF must commonly handle a dynamic set that
exceed a given upper bound. is changing over time, with items being added and deleted.
122 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

In order to support the data deletion operation, an SBF represented by a logical “and” operation between their bit
hashes the item to be deleted and resets the corresponding vectors.
bits to 0. It may, however, set a location to 0, which is also Theorem 3. If BF ðA \ BÞ; BF ðAÞ, and BF ðBÞ use the same m
mapped by other items. In such a case, the SBF no longer and hash functions, then 2 BF ðA \ BÞ ¼ BF ðAÞ \ BF ðBÞ
correctly reflects the set and will produce false negative with probability ð1  1=mÞk jAA\BjjBA\Bj .
judgments with high probability. To address this problem,
Proof. Assume that the number of hash functions is k. We can
Fan et al. introduced counting Bloom filters (CBFs) [12].
derive (2) according to Definitions 1, 2, and Theorem 1:
Each entry in the CBF is not a single bit but rather a small
counter that consists of several bits. When an item is added, BF ðAÞ \ BF ðBÞ ¼ ðBF ðA  A \ BÞ \ BF ðB  A \ BÞÞ
the corresponding counters are incremented; when an item
[ BF ðA \ BÞ:
is deleted, the respective counters are decremented. The
experimental results and mathematical analysis show that ð2Þ
four bits for each counter is large enough to avoid In fact, the items of set A \ B contribute the same bits
overflows [12]. whose value is 1 to Bloom filters BF ðA \ BÞ and
2.2 Algebra Operations on Bloom Filters BF ðAÞ \ BF ðBÞ. According to (2), it is easy to derive
that BF ðAÞ \ BF ðBÞ equals to BF ðA \ BÞ only if
We use two standard Bloom filters, BF ðAÞ and BF ðBÞ, as
BF ðA  A \ BÞ \ BF ðB  A \ BÞ ¼ 0.
the representations of two different static sets A and B,
For any item z 2 ðB  A \ BÞ, the probability that bits
respectively.
hash1 ðzÞ; . . . ; hashk ðzÞ of BF ðA  A \ BÞ are 0 should be
Definition 1 (Union of standard Bloom filters). Assume that 2
pk ¼ ð1  1=mÞk jAA\Bj . Thus, we can infer that the
BF ðAÞ and BF ðBÞ use the same m and hash functions. Then,
the union of BF ðAÞ and BF ðBÞ, denoted as BF ðCÞ, can be probability that BF ðB  A \ BÞ \ BF ðA  A \ BÞ ¼ 0
2
represented by a logical “or” operation between their bit should be ð1  1=mÞk jAA\BjjBA\Bj . Theorem 3 is
vectors. true. u
t
Theorem 1. If BF ðA [ BÞ; BF ðAÞ, and BF ðBÞ use the same m 2.3 Related Works
and hash functions, then BF ðA [ BÞ ¼ BF ðAÞ [ BF ðBÞ.
The most closely related work is split Bloom filters [27].
Proof. Assume that the number of hash functions is k. We They increase their capacity by allocating a fixed s  m bit
choose an item y from set A [ B randomly, and y must matrix instead of an m-bit vector as used by the SBF to
also belong to set A or B. Bits hashi ðyÞ of BF ðA [ BÞ are represent a set. A certain number of s filters, each with m
set to 1 for 1  i  k, and at the same time, bits hashi ðyÞ bits, are employed and uniformly selected when inserting
of BF ðAÞ or BF ðBÞ are set to 1; thus, BF ðAÞ½hashi ðyÞ [ an item of the set. The false match probability increases as
BF ðBÞ½hashi ðyÞ are also set to 1. On the other hand, we the set cardinality grows. An existing split Bloom filter must
chose an item x from set A or B randomly, and said x be reconstructed using a new bit matrix if the false match
also belongs to set A [ B. Bits hashi ðxÞ of BF ðAÞ [ probability exceeds an upper bound. In practice, the split
BF ðBÞ are set to 1 for 1  i  k, and at the same time, Bloom filters also need to estimate a threshold of set
bits hashi ðxÞ of BF ðA [ BÞ are also set to 1. Thus, cardinality, and encounter the same problems faced by SBF.
BF ðA [ BÞ½i ¼ BF ðAÞ½i [ BF ðBÞ½i for 1  i  m. The- Although dynamic Bloom filters adopt a similar structure as
orem 1 is proved to be true. u
t split Bloom filters, they are different in the following
Theorem 2. The false positive probability of BF ðA [ BÞ is not aspects. First, split Bloom filters always consume s  m bits
less than that of both BF ðAÞ and BF ðBÞ. At the same time, and waste too much memory before the set cardinality
the false positive probability of BF ðAÞ [ BF ðBÞ is greater reaches ðm  ln 2Þ=k, whereas dynamic Bloom filters allo-
than or equal to that of BF ðAÞ as well as BF ðBÞ. cate memory in an incremental manner. Second, split Bloom
filters do not support the data deletion operation, which is
Proof. Assume that the sizes of sets A; B, and A [ B are required in order to really support dynamic sets, whereas
na ; nb , and nab , respectively. According to (1), we can dynamic Bloom filters do support dynamic sets. Third,
calculate the false positive probability for BF ðAÞ; BF ðBÞ, dynamic Bloom filters propose dedicated solutions for four
and BF ðA [ BÞ. different scenarios, as shown in Section 4.
In fact, given the same k and m, (1) is a monotonically Another related work is scalable Bloom filters [28], which
increasing function of n. It is true that jA [ Bj  address the same problem and adopt a similar solution
maxðjAj; jBjÞ; thus, nab is not less than na and nb . We proposed by DBFs [26]. A scalable Bloom filter also employs
could infer that the false positive probability of BF ðA [ a series of SBFs in an incremental manner, but uses a
BÞ is not less than that of BF ðAÞ and BF ðBÞ. According different method to allocate memory for each SBF. It
to Theorem 1, we know that BF ðA [ BÞ ¼ BF ðAÞ [ allocates m  ai1 bits for its ith SBF, where a is a given
BF ðBÞ; thus, the false positive probability of BF ðAÞ [ positive integer and 1  i  s, while a DBF allocates m bits
BF ðBÞ is also not less than the value of BF ðAÞ, as well as for each DBF. It achieves a lower false positive probability
BF ðBÞ. Theorem 2 is proven to be true. u
t than a DBF that holds the same number of SBFs by using
Definition 2 (Intersection of Bloom filters). Assume that more memory, but suffers from a drawback due to the use
BF ðAÞ and BF ðBÞ use the same m and hash functions. Then, of heterogeneous SBFs featuring different sizes and hash
the intersection BF ðAÞ and BF ðBÞ, denoted as BF ðCÞ, can be functions. It causes large overhead due to the need to
GUO ET AL.: THE DYNAMIC BLOOM FILTERS 123

calculate the Bloom filter address for items in each SBF inserts x into the active SBF and increments nr by one for
when performing a membership query operation. In a DBF, the active SBF. If X does not decrease after deployment,
the Bloom filter address of items in each SBF is the same. only the last SBF of the DBF will be active, whereas the
This makes it possible to optimize the storage and retrieval other SBFs are full. Otherwise, these full SBFs may become
of a DBF by using the bit slice approach, as shown in active if some items are removed from the set X.
Section 4.5. On the other hand, it also lacks an item deletion It is convenient to represent X as a DBF by invoking
operation and solutions for dedicated application scenarios. Algorithm 1 repeatedly. After achieving the DBF, we can
answer any set membership queries based on the DBF
instead of X. The detailed process is illustrated in
3 CONCISE REPRESENTATION AND MEMBERSHIP Algorithm 2, which uses an item x as input. If all the
QUERIES OF DYNAMIC SET hashj ðxÞ counters are set to a nonzero value for 1  j  k in
DBF focuses on addressing dynamic sets with changing the first SBF, then the item x is a member of X. Otherwise,
cardinality rather than static sets, which were addressed by the DBF checks its second SBF, and so on. In summary, x is
the previous version. It should be noted that DBFs can not a member of X if it is not found in all SBFs, and is a
support static sets. Throughout this paper, an SBF is called member of X if it is found in any SBF of the DBF.
active only if its false match probability does not reach a Algorithm 2. Query (x)
designed upper bound; otherwise, it is called full. Let nr be Require: x is not null
the number of items accommodated by an SBF. The nr is 1: for i ¼ 1 to s do
equal to the capacity c for a full SBF and less than c for an
2: counter 0
active SBF. In the rest of this paper, we use SBF to imply
3: for j ¼ 1 to k do
counting Bloom filters for the sake of supporting the item
4: if StandardBFi ½hashj ðxÞ ¼ 0 then
deletion operation.
5: break
3.1 Overview of Dynamic Bloom Filters 6: else
A DBF consists of s homogeneous SBFs. The initial value of 7: counter counter þ 1
s is one, and the initial SBF is active. The DBF only inserts 8: if counter ¼ k then
items of a set into the active SBF, and appends a new SBF as 9: Return true
an active SBF when the previous active SBF becomes full. 10: Return false
The first step to implement a DBF is initializing the If an item x is removed from X, the corresponding DBF
following parameters: the upper bound on false match must execute Algorithm 3 with x as the input in order to
probability of the DBF, the largest value of s, the upper reflect X as consistently as possible. First of all, the DBF
bound on false match probability of the SBF, the filter size m must identify the SBF in which all the hashj ðxÞ counters are
of the SBF, the capacity c of the SBF, and number of hash set to a nonzero for 1  j  k. If no SBF exists that satisfies
functions k of the SBF. As we will discuss further on in this the constraint in the DBF, the item deletion operation will
paper, the approaches used to initialize these parameters be rejected since x does not belong to X. If there is only one
are not identical in different scenarios. For more informa- SBF satisfying the constraint, the counters hashj ðxÞ for 1 
tion, readers may refer to Section 4. j  k are decremented by one. If there are multiple SBFs
satisfying the constraint, then x may appear to be in
Algorithm 1. Insert (x)
multiple SBFs of the DBF. Thus, it is impossible for the DBF
Require: x is not null
to know which is the right one. If the DBF persists in
1: ActiveBF GetActiveStandardBF ðÞ
removing membership information of x from it, the wrong
2: if ActiveBF is null then
SBF may perform the item deletion operation with given
3: ActiveBF CreateStandardBF ðm; kÞ
probability. The wrong item deletion operation destroys the
4: Add ActiveBF to this dynamic Bloom filter.
DBF and leads to, at most, k potential false negatives. To
5: s sþ1
avoid producing false negatives, the membership informa-
6: for i ¼ 1 to k do
tion of such items is kept by the DBF, but removed from X.
7: ActiveBF ½hashi ðxÞ ActiveBF ½hashi ðxÞ þ 1
8: ActiveBF :nr ActiveBF :nr þ 1 Algorithm 3. Delete (x)
GetActiveStandardBF() Require: x is not null
1: for j ¼ 1 to s do 1: index null
2: if StandardBFj :nr < c then 2: counter 0
3: Return StandardBFj 3: for i ¼ 1 to s do
4: Return null 4: if BF[i].Query(x) then
Given a dynamic set X with n items, we will first show 5: index i
how a DBF is represented through a series of item insertion 6: counter counter þ 1
operations. Algorithm 1 contains the details regarding the 7: if counter > 1 then
process of the item insertion operation. It is clear that the 8: break
DBF should first discover an active SBF when inserting an 9: if counter ¼ 1 then
item x of X. If there are no active SBFs, the DBF creates a 10: for i ¼ 1 to k do
new SBF as an active SBF and increments s by one. The DBF 11: BF ½index½hashi ðxÞ BF ½index½hashi ðxÞ  1
124 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

Fig. 1. False positive probabilities of dynamic and standard Bloom Fig. 2. The ratio of false positive probability of a standard Bloom filter to
filters are functions of the actual size n of a dynamic set, where the value of a DBF is a function of the actual size n of a dynamic set,
m ¼ 1; 280, k ¼ 7, and c ¼ 133. where k ¼ 7 and c ¼ 133.

12: BF ½index:nr BF ½index:nr  1 counters of bfaddressðxÞ in any SBF might have been set to
13: Merge() a nonzero value by items of X.
14: Return true If the cardinality of X is not greater than the capacity of an
15: else SBF (n  c), then the false match probability of the DBF can be
16: Return false calculated according to (1) since the DBF is just an SBF.
Merge() Otherwise, the false match probability of the DBF can be
1: for j ¼ 1 to s do calculated in a straightforward way. The false positive
BF
2: if StandardBFj :n < c then probability of the first s  1 SBFs is fm;k;c , and that of the last
BF
3: for k ¼ j þ 1 to s do SBF is fm;k;nl with nl ¼ n  c  bn=cc. Then, the probability
4: if StandardBFj :nr þ StandardBFk :nr < c that not all counters of bfaddressðxÞ in each SBF of the DBF are
BF bn=cc BF
then set to a nonzero value is ð1  fm;k;c Þ ð1  fm;k;nl
Þ. Thus, the
5: StandardBFj StandardBFj [ StandardBFk probability that all the counters of bfaddressðxÞ in at least one
6: StandardBFj :nr þ StandardBFk :nr SBF of the DBF are set to a nonzero value can be denoted as:
7: Clear StandardBFk from the dynamic DBF
 BF
bn=cc  BF

fm;k;c;n ¼ 1  1  fm;k;c;c 1  fm;k;c;n
Bloom filter. l

8: Break ¼ 1  ð1  ð1  ekc=m Þk Þbn=cc ð3Þ


  k 
Furthermore, two active SBFs should be replaced by the 1  1  ekðncbn=ccÞ=m :
union of them if the addition of their nr is not greater than
the capacity c of one SBF. The union operation of counting In the following discussion, we will use DBF, as well as
Bloom filters is similar to that of standard Bloom filters, SBF to represent X, and observe the changing trend of
DBF BF
which performs the addition operation between counter fm;k;c;n and fm;k;n as n increases continuously. For 1  n  c,
vectors instead of the logical or operation between bit the false positive probability of DBF equals that of SBF. In
vectors. Note that there is at most one pair of SBFs which this case, the DBF becomes the SBF, and (3) also becomes
satisfy the constraint of union operation after an item is (1). For n > c, the false positive probability of the DBF
removed from the DBF. increases gradually with n, while that of the SBF increases
The average time complexity of adding an item x to an SBF quickly to a high value, and then, slowly increases to almost
and a DBF is the same: OðkÞ, where k is the number of hash one. For example, when n reaches 10  c; fm;k;10cBF
becomes
functions used by them. The average time complexities of about 100 times larger than fm;k;cBF DBF
, but fm;k;c;10c is about 10
membership queries for SBF and DBF are OðkÞ and Oðk  sÞ, BF
times larger than fm;k;c . We can draw a conclusion from the
respectively. The average time complexities of a member (1), (3), and Fig. 1 that DBF scales better than SBF after the
deletion for SBF and DBF are OðkÞ and Oðk  sÞ, respectively. actual size of X exceeds the capacity of one SBF.
Furthermore, we use multiple DBFs and SBFs to
3.2 False Match Probability of Dynamic Bloom BF DBF
Filters represent X, and study the trend of fm;k;n =fm;k;c;n as the
cardinality n of X increases continuously. In our experi-
In this section, we analyze the false positive probability of a
DBF under two scenarios. Items of X are not allowed to be ments, we chose four kinds of DBFs using four SBFs, which
deleted from X in the first scenario, whereas they are were all different with respect to size. For all four SBFs, the
allowed to in the second scenario. number of hash functions is 7, and the predefined upper
As discussed above, a DBF with s ¼ dn=ce SBFs can bound on false positive probability is 0.0098. The experi-
represent a dynamic set X with n items. If we use the DBF mental results are shown in Fig. 2. It is obvious that all four
BF DBF
instead of X to answer a membership query, we may meet a curves follow a similar trend. The ratio fm;k;n =fm;k;c;n is a
false match at a given probability. We will evaluate the function of the actual size n of X. For 1  n  c, the ratio
probability with which the DBF yields a false positive equals 1. For n > c, the ratio quickly reaches to the peak due
DBF
judgment for an item x not in X. The reason is that all to the slow increase in fm;k;c;n , and the quick increase in
GUO ET AL.: THE DYNAMIC BLOOM FILTERS 125

Fig. 3. False positive probability of BF ðAÞ [ BF ðBÞ minus that of


DBF ðAÞ [ DBF ðBÞ is a function of size na of dynamic set A and nb of Fig. 4. The false positive probabilities of four kinds of DBFs are functions
set B, where m ¼ 1;280; k ¼ 7, and c ¼ 133. of the actual size n of a dynamic set, where k ¼ 7, and the predefined
threshold of false positive probability of each DBF is 0.0098.
BF
fm;k;n , and then, decreases slowly. After n exceeds c, the DBF
with a different parameter m scales better than the Theorem 5. If the size of A and B is not zero and less than c, the
corresponding SBF. In fact, the value of m has no effect on false positive probability of DBF ðAÞ [ DBF ðBÞ is less than
BF DBF that value of BF ðAÞ [ BF ðBÞ.
the trend of fm;k;n =fm;k;c;n .
We know that fm;k;n BF DBF
and fm;k;c;n are monotonically Proof. The false positive probability of DBF ðAÞ [ DBF ðBÞ
DBF
is denoted as fm;k;c;n a þnb
, and that of BF ðAÞ [ BF ðBÞ is
decreasing functions of m according to (1) and (3). In other BF
DBF DBF
denoted as fm;k;n a þnb
. Because the size of A and B is less
words, fm1 ;k;c;n
< fm 2 ;k;c;n
for m1 > m2 . This means that the than c, (4) can be simplified as (5). Let x ¼ ekna =m and
DBF DBF
curve of fm1 ;k;c;n
is always lower than the curve of fm2 ;k;c;n
as n y ¼ eknb =m , we can obtain (7), which denotes fm;k;n BF
a þnb
BF DBF
increases. In fact, so does fm;k;n . We also conduct experiments minus fm;k;c;n a þnb
according to (1) and (5):
to confirm this conclusion, and illustrate the result in Fig. 4. DBF
fm;k;c;n a þnb
¼ 1  ð1  ð1  ekna =m Þk Þ
ð5Þ
3.3 Algebra Operations on Dynamic Bloom Filters ð1  ð1  eknb =m Þk Þ;
Given two different dynamic sets A and B, we can use
two dynamic Bloom filters DBF ðAÞ and DBF ðBÞ, or
fðx; yÞ ¼ ð1  xyÞk þ ðð1  xÞð1  yÞÞk  ð1  xÞk
two standard Bloom filters BF ðAÞ and BF ðBÞ to represent ð6Þ
them, respectively.  ð1  yÞk ;

Definition 3 (Union of dynamic Bloom filters). Given the


fðaÞ  fðdÞ
same SBF, we assume that DBF ðAÞ and DBF ðBÞ use s1  m
and s2  m bit matrixes, respectively. DBF ðAÞ [ DBF ðBÞ ¼ fðdÞða  dÞ þ    þ f k1 ðdÞða  dÞk1 =ðk  1Þ! ð7Þ
k k
could result in a ðs1 þ s2 Þ  m bit matrix. The ith line vector þ f ðÞða  dÞ =k!; d <  < a;
equals the ith line vector of DBF ðAÞ for 1  i  s1 , and the
ði  s1 Þth line vector of DBF ðBÞ for s1 < i  ðs1 þ s2 Þ. fðcÞ  fðbÞ ¼ fðcÞðc  bÞ þ    þ f k1 ðbÞðc  bÞk1 =ðk  1Þ!
Theorem 4. The false positive probability of DBF ðAÞ [ þ f k ðÞðc  bÞk =k!; b <  < c:
DBF ðBÞ is larger than that of DBF ðAÞ, as well as DBF ðBÞ.
ð8Þ
Proof. Assume that DBF ðAÞ [ DBF ðBÞ; DBF ðAÞ, and
DBF ðBÞ uses the same SBF with parameters m; k; c, Let a ¼ 1  xy, b ¼ ð1  xÞ  ð1  yÞ, c ¼ 1  x, and
d ¼ 1  y. Thus, b < c < a; b < d < a because of 0 < x <
and the actual sizes of dynamic set A and B are na and
1 and 0 < y < 1. If c < d, then we obtain (6) and (7)
nb , respectively. The false positive probability of according to the Taylor formula. fðzÞ ¼ zk ; 0 < z < 1, is
DBF ðAÞ [ DBF ðBÞ is: a monotonically increasing function of z, and has a
DBF
 BF
ðbna =ccþbnb =ccÞ continuous k-rank derivative. The ith derivative is a
fm;k;c;n a þnb
¼ 1  1  fm;k;c monotonically increasing function for 1 < i  k. It is
  k 
 1  1  ekðna cbna =ccÞ=m ð4Þ obvious that a  D ¼ c  b; d < c < a; b < d < a. Thus,
  k  each item of fðaÞ is larger than the corresponding item
 1  1  ekðnb cbnb =ccÞ=m : of fðcÞ, and so, (7) is larger than 0. If c > d, the result
The false positive probabilities of DBF ðAÞ and is the same. Theorem 5 is proven to be true. u
t
DBF DBF
DBF ðBÞ are fm;k;c;n a
and fm;k;c;n b
, respectively. In fact,
DBF
given the same k; m, and c, the value of (4) minus fm;k;c;n On the other hand, we used MATLAB to calculate the
a BF DBF
DBF
is larger than 0, and the value of (4) minus fm;k;c;nb is also result of fm;k;n a þnb
minus fm;k;c;n a þnb
. As shown in Fig. 3, the
larger than 0. Thus, we can easily derive that the false false positive probability of DBF ðAÞ [ DBF ðBÞ is also less
positive probability of DBF ðAÞ [ DBF ðBÞ is larger than than that of BF ðAÞ [ BF ðBÞ, even though the size of A and
that of DBF ðAÞ, as well as DBF ðBÞ. u
t B exceeds c.
126 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

3.4 Evaluations of Item Deletion Algorithm TABLE 1


3.4.1 Mathematical Analysis Experimental Upper Bound and Real Value of r with c ¼ 133
As mentioned above, multiple SBFs tend to allocate
BfaddressðxÞ for an item x 2 X in a DBF. It is clear that only
one SBF ever truly represented x during its input process; the
other SBFs are false positives. The item deletion operation of
DBF always omits such items, keeping their set membership
information. The motivation is to prevent the DBF from
producing potential false negatives caused by an incorrect ranges from 0 to k  1. We use the SDBM MersenneT wister
item deletion. As a direct result of the item deletion method to generate the two random integers for any item x.
operation, queries of such items will yield false positives. Let the output of the SDBM Hash function as the seed of the
Recall that X has n items, and the DBF uses s ¼ dn=ce SBFs. random number generator (RNG) MersenneTwister. Then,
After the representation of X, we can consider the event the MersenneTwister will produce the two desired random
where a particular item x of X appears to be in multiple SBFs. integers. The SDBM hash function seems to have a good
If x was represented by one of the first s  1 SBFs during the overall distribution for many different sets. It also works
item insertion process, the probability of this event can be well in situations where there are high variations in the
calculated by: MSBs of the items in a set. The MersenneTwister is capable
 s2   of quickly producing very high-quality pseudorandom
BF BF
f1 ðnÞ ¼ 1  1  fm;k;c 1  fm;k;n l
: ð9Þ numbers. This mechanism requires one hash function and
one random number generator to run k  1 rounds of (14) in
If item x was represented by the sth SBF during the item
insertion process, this event means that at least one of the order to generate a Bloom filter address BfadddressðxÞ for
first s  1 SBFs produce a false positive judgment for x, and item x. It provides a considerable amount of processing
the probability of this event can be calculated by: reduction compared to using k actual hashes. Kirsch and
Mitzenmacher show that this method does not increase the
 s1
BF
f2 ðnÞ ¼ 1  1  fm;k;c : ð10Þ probability of false positives [29].
The multiple address problem causes some items to
It is easy to understand that the value of (10) is larger remain in a DBF, even after they have been deleted from
than that of (9). We use (10) as an estimated upper bound set X. The ratio of the number of such items to the
on the probability that x of X appears to be in multiple cardinality of set X is denoted as r. Recall that f2 ðnÞ is an
SBFs, and then, achieve an estimated upper bound n  estimated upper bound on r based on mathematical
f2 ðnÞ on the number of such items which have more than analysis. The experimental upper bound on r, and the real
one Bloom filter address. Our experimental results show value of r, are the average values achieved from 100 rounds
that the real number is less than the estimated upper of simulations with different sets in each round.
bound. If all items of X are deleted, the DBF will try to Note that dynamic Bloom filters are designed to
perform the same operation. However, the DBF cannot represent many possible sets, and there are no benchmark
guarantee deletion of all items due to its special item
sets in the field of Bloom filters. Our experiments do not
deletion operation. In reality, the DBF still holds the
seek particular sets, but simply use names of files at some
membership information of at most n  f2 ðnÞ items. If
peers in a peer-to-peer trace as the data source. The P2P
other items join X at the same time the original items are
trace is a snapshot of files that were shared by eDonkey
deleted, the DBF can reflect the membership information
peers between 9 December 2003 and 2 February 2004. They
of all items of X and at most n  f2 ðnÞ remaining items. It
recorded 11,014,603 distinct files. We initialize 10 sets with
is logical that the false positive probability of the DBF is
always larger than the theoretical value. Our experimental cardinality i  c for 1  i  10 using the trace data, and
findings show similar results, and the difference between then, implement 10 DBFs. In our experiment, the para-
the real value and theoretical value is small. In other meters of each SBF are m ¼ 1;280, k ¼ 7, and c ¼ 133. For
words, the negative impact of the item deletion operation each DBF, the number of items possessing multiple Bloom
on a DBF can be controlled at an accepted level. filter addresses from the corresponding set is determined,
and then, the experimental upper bound of r is calculated.
3.4.2 Experimental Analysis For each DBF, the item deletion algorithm mentioned above
In this section, we will first describe the implementation of is performed, and the ratio of the number of remaining
k random and independent hash functions. Then, we will items to the original cardinality of the corresponding set is
compare the analytical model to the experimental results for the real value of r. Table 1 shows the experimental results
the number of items which have multiple Bloom filter under different dynamic sets.
addresses. One critical factor in our experiments is creating Fig. 5 shows that the real value of r is less than the
a group of k hash functions. In our experiments, they will be experimental upper bound, the estimated upper bound, and
generated as: the false positive probability of a DBF. All four curves
increase as the size of the dynamic set increases. The curve
hi ðxÞ ¼ ðg1 ðxÞ þ i  g2 ðxÞÞ mod m; ð11Þ
for the real value r increases smoothly and maintains a low
where g1 ðxÞ and g2 ðxÞ are two independent and random level when the real size of the set is less than 10 times that of
integers in the universe with range f1; 2; . . . ; mg, and i the estimated threshold. On the other hand, the frequency
GUO ET AL.: THE DYNAMIC BLOOM FILTERS 127

Fig. 5. False positive probability of dynamic Bloom filters and the Fig. 7. The ratio of the number of items which have at least one Bloom
percentage of data which have multiple Bloom filter addresses, where filter address before they are represented to the threshold c is a function
m ¼ 1;280, k ¼ 7, and c ¼ 133. of s and the threshold f.

of deleting all items from a set and its corresponding DBF is


3.5.1 Improvement of Item Insertion Operation
often low, with the period usually being long.
The item insertion algorithm proposed previously could be
Let’s consider the dynamic sets with n=c ¼ 10; 8; 5. We
optimized in two scenarios: First, sometimes it seems to a new
first delete all items of those sets based on the item deletion item of a normal set that it has been represented by a DBF,
algorithm, and then, continuously add new items to the set. even if it is not. In this case, the new item is still inserted into
It is easy to understand that the remaining items can an active SBF of the DBF. This kind of item may cause the DBF
increase the false match probability of the DBF. Fig. 6 shows to allocate unnecessary SBFs. Second, duplicate insertions of
that the more remaining items there are, the greater the false an identical item in a multiset do not necessarily harm an SBF,
but they may cause a DBF to extend unnecessarily [30].
positive probability will be. In practice, the frequency of
Solving this issue requires a membership query before each
deleting all items from a set and its corresponding DBF is insert; however, doing so increases the insertion complexity
very low; the false positive probability of a DBF is often from k to Oðk  sÞ. Considering the additional computational
larger than the expected value, but still maintains a lower, costs, DBFs should only adopt the improved item insertion
more stable level. On the other hand, the real capacity of a algorithm if the previous Algorithm 1 causes at least one
DBF decreases with the number of those remaining items. unnecessary SBF.
Applications can use the change in the real capacity as a As discussed in Section 3.4, n  f2 ðnÞ denotes an
estimation of the number of items which have multiple
metric to evaluate the influence of the item deletion
Bloom filter addresses after all items are represented, and
operation and decide whether to represent the updated are very similar to the experimental results. The experi-
set again. In the future, we will seek a better method to mental results, as well as the estimation, are larger than the
solve the multiple address problem for an item. number of items which already have at least one Bloom
filter address before they are represented in the two
3.5 Optimizations of Dynamic Bloom Filters scenarios. Thus, the improved algorithm should be used
In this section, we consider cases in which applications do only if:
not require DBFs to provide the item deletion operation. In
these cases, DBFs use SBFs as a component and may be n  f2 ðnÞ
ratio ¼  1;
c
optimized from the following aspects. None of the
optimization strategies proposed below are suitable to which implies that Algorithm 1 incurs at least one
other cases in which DBFs use CBFs as a component. unnecessary SBF. The ratio is a monotonically increasing
function of s and f, which denotes an upper bound on the
false match probability of an SBF. As shown in Fig. 7, the
value of ratio is always less than one if f  0:01; s  10, or
f  0:001; s  20; hence, it is not necessary to use the
improved algorithm. Under other conditions, Algorithm 1
should be replaced by the improved algorithm since the
former causes at least one unnecessary SBF.
The above discussions fail to consider the impact of a
multiset. If the distribution of duplicate items in a multiset
is known in advance, a multiset could be treated as a
normal set where those duplicate items act as identical
items in a normal set. They can also be analyzed in the same
way. Otherwise, we recommend adopting the improved
Fig. 6. False positive probabilities of dynamic Bloom filters are functions
item insertion algorithm, which could be optimized by a
of the actual size n of a dynamic set and the remaining data, where better method of storing dynamic Bloom filters, as dis-
m ¼ 1;280, k ¼ 7, and c ¼ 133. cussed below.
128 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

Fig. 8. The ratio of the size of a Bloom filter to that of a DBF is a function Fig. 9. False positive probabilities of dynamic and standard Bloom filters
of a non-negative integer, which denotes the ratio of n to c. The are functions of the actual size of a dynamic set. Standard Bloom filters
experimental condition is the same as that in Fig. 3. can expand the filter size m to dn=ce  m. m ¼ 1;280, k ¼ 7, and c ¼ 133.

3.5.2 Compressed Dynamic Bloom Filters approaches to represent static sets; however, most applica-
In some distributed applications, an SBF at a node is usually tions often encounter dynamic sets without fixed cardinality,
delivered to one or more nodes as a message. In this case, as well as static sets. According to the structure of DBFs, we
there are the three metrics of SBFs we have seen so far: 1) the know that DBFs can represent dynamic sets as well as static
computational overhead of an item query operation; 2) the sets. In this section, we will first evaluate the performances of
size of the filter in memory; and 3) the false match rate, a DBFs and SBFs in stand-alone applications with three
fourth metric can be used: the size of a message used to different sets, and then, discuss the distributed applications
transmit an SBF across the network. The compressed Bloom of DBFs.
filters might significantly save bandwidth at the cost of larger
4.1 Static Set with Fixed Cardinality
uncompressed filters and some additional computation to
A dynamic set could be regarded as a series of static sets
compress and decompress the filter sent across the network.
over a sequence of discrete time. In this section, we use SBF
In the idealized setting, using compression always reduces
the false positive probability by adopting a larger Bloom filter and DBF to represent a dynamic set, and compare them
size and fewer hash functions than an SBF uses. Interested from two aspects at any given discrete time. The SBF is
readers may obtain details concerning all of the theoretical reconstructed according to its optimal configuration, as
and practical issues of compressed Bloom filters in [18]. determined by the set cardinality.
It is reasonable to compress a DBF by using compressed For a dynamic set X, let s be dn=ce, where n and c denote
Bloom filters instead of SBFs. The compressed DBFs and the cardinality of X, and capacity of SBF used by DBF,
MDDBFs could reduce both the transmission size and false respectively. For a DBF representing the set, the formula of
positive probability of the uncompressed versions, at the cost its false positive probability is simplified as (12). For an SBF
of larger memory and additional computation overheads. representing that set, it calculates how many bits it must
consume in order to achieve the same false positive
3.5.3 Approach for Storing Dynamic Bloom Filters probability as the DBF. Finally, we establish the relationship
There are two ways of storing a dynamic Bloom filter with a between (1) and (12), and achieve (13) to denote the ratio of
set number of s SBFs. These are referred to as the bit string the number of bits m1 used by an SBF, to the number of bits
and bit slice methods, respectively, [31]. The bit string s  m used by a DBF:
approach stores the s SBFs dependently and sequentially.
Instead of storing a DBF as an s  m long bit strings, the bit
DBF
fm;k;c;n ¼ 1  ð1  ð1  ekc=m Þk Þdn=ce ; ð12Þ
slice method stores a DBF as an m  s long bit slices. We
know that only a subset of the bit positions in each SBF need m1 k  c
¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð13Þ
to be examined on a query. One problem with the bit string s  m m  lnð1  k 1  ð1  yÞs Þ
approach is that all s  m bits need to be retrieved on a
query. For the bit slice approach, only a fraction of the bit The following conclusions can be drawn from (13) and
slices need to be retrieved on a query. Fig. 8. To obtain the same false match probability, SBF and
DBF use the same bits to represent X if n  c. However, the
SBF consumes fewer bits than the DBF if n > c. The
4 PERFORMANCE EVALUATIONS difference of bits used by DBF and SBF is small if s is not
We use  to denote an upper bound on the false match too large.
probability of an SBF representing a static set with fixed Let us compare the false positive probability of an SBF
cardinality n. Given  and n, the parameters k and m could be and a DBF which use the same bits to represent an identical
optimized for the SBF with m ¼ dn  logðÞ= logð0:61285Þe dynamic set. That is, an SBF is allowed to expand its size to
and k ¼ dðm=nÞ ln 2e. Given  and m, the capacity c can be s  m, and also to rerepresent the dynamic set as the set
optimized for the SBF with c ¼ dm  logð0:61285Þ= logðÞe. It cardinality grows. In this case, the standard Bloom filter is
is clear that the amount of memory allocated to an SBF defined as NBF. The false match probability of an SBF could
increases linearly with n. SBF and its variations are practical be calculated according to (1). The false match probability of
GUO ET AL.: THE DYNAMIC BLOOM FILTERS 129

DBF
a DBF should still be (3). It is necessary to compare fm;k;c;n
NBF
and fm;k;c;n under this situation, and we have:
 k
NBF
fm;k;c;n ¼ 1  ekn=ðmdn=ceÞ : ð14Þ

The experimental results are shown in Fig. 9. We


DBF NBF BF
conclude that fm;k;c;n ¼ fm;k;c;n  fm;k;c for n  c. For n >
DBF NBF
c; fm;k;c;n grows as the set cardinality increases, and fm;k;c;n
fluctuates between i  c and ði þ 1Þc, where i is any non-
negative integer. Let nx < c be any non-negative integer;
NBF NBF
thus, fm;k;c;n x þði1Þc
is not larger than fm;k;c;n x þic
. In fact,
NBF
fm;k;c;n grows as the set cardinality increases in the whole
DBF Fig. 10. The memory size of a DBF under different set cardinality
range, but the increase rate is slower than that of fm;k;c;n . distributions: a uniform distribution, a normal distribution with u ¼ dN=2e
In summary, to achieve the same false match probability, and 2 ¼ 20, and a Zipf distribution with 0.4 as parameter, where N ¼
an SBF never uses more bits than a DBF to represent any 1;330 and  ¼ 0:0098.
static version of that dynamic set if the SBF is actively
reconstructed as the increase of set cardinality. It is clear Let pi represent the probability that X has i items where
that the SBF produces large, even huge, overhead due to P
1  i  N, i.e., N i¼1 pi ¼ 1. We associate the ith SBF of the
frequent reconstructions. On the contrary, a DBF is not
DBF with a ri , which implies an upper bound on the
required to be reconstructed if its false match probability is
probability that this SBF is used. Let r1 ¼ 1 and ri ¼
controlled at an acceptable level with the increase of set PN
cardinality. Note that the false match probability of a DBF j¼cði1Þþ1 pj for i ¼ 2; 3; . . . ; s, where c denotes the
might sometimes become too large to be tolerated by many capacity of any SBF. The expected number of bits used by
P
applications. It is necessary to occasionally reconstruct the the DBF is upper bounded by si¼1 m  ri . Recall that the
DBF. In the next section, we will compare DBFs and SBFs in bits used by each SBF is m ¼ dðN=sÞ  logðÞ= logð0:6185ÞÞe.
P
the whole lifetime of a dynamic set instead of comparing We formulate the problem to minimize si¼1 ri  ðN=sÞ 
them in a series of discrete time. logðÞ= logð0:6185Þ:
4.2 Dynamic Set with an Upper Bound on Set X
s
Cardinality Minimize ri  ðN=sÞ  logðÞ= logð0:6185Þ;
i¼1
In this section, we use SBF, as well as DBF to represent a
dynamic set X with an upper bound N on set cardinality. Let Subject to  ¼ 1  ð1  Þ1=s ; s > 0:
 denote the upper bound on false match probability of the
SBF, as well as DBF. In many applications, the distribution of The optimized value of parameter s could be derived
set cardinality covers a large range [32], [33]. In such a from the solution of this optimization problem. Once we
distribution, the upper bound is sometimes several orders of have s, it can be used to calculate the false match probability
magnitude larger than the mean or minimum cardinality.  and capacity c for these homogeneous SBFs, and then,
Applications usually allocate a large number of bits for an determine the parameters m and k according to the design
SBF at the outset with m ¼ dN  logðÞ=logð0:61285Þe. These method of an SBF. The set cardinality distribution has a
bits are large enough for the SBF to accommodate all possible direct impact on the result of this optimization problem. We
items of X, while decreasing the space efficiency of the SBF. compare the minimized memory size of a DBF under five
DBF, however, allocates enough bits in an incremental and different cardinality distributions through experiments:
on-demand fashion. A common objective of SBF and DBF is normal distribution, uniform distribution, random Zipf
to guarantee that the false match probability never exceeds distribution, minimum Zipf distribution, and maximum
the , and to make sure that they are not required to be Zipf distribution.
reconstructed as the set cardinality changes. The first two distributions have been widely used for
We now address the problem of designing a minimum generating synthetic sets to emulate real sets [34]. In many
size DBF when the probability density function of the set networking applications, it is observed that the set cardin-
cardinality is known. We assume that s homogeneous SBFs ality at each node conforms to a Zipf distribution with a
make up the DBF, and  denotes the upper bound on the long tail. There is a bijective mapping from the cardinality
false match probability of each SBF. The overall false match values to the rank values. The Zipf distribution is called
probability  for the DBF has to be apportioned among the random if any rank value is mapped to a random integer
individual SBFs. According to (3), we know that over a range f1; . . . ; Ng, uniformly. It is called minimum if
 ¼ 1  ð1  Þs . As a result, we can derive the value of the largest and second largest cardinality values are
parameter  with  ¼ 1  ð1  Þ1=s . mapped to the last and second to last ranks, respectively,
The capacity of the DBF is N since at most N items of the and so on. It is called maximum if the largest and second
set are accommodated by it. Since the N items are allocated largest cardinality values are mapped to the first and
to a certain number of s SBFs evenly, the capacity of the second ranks, respectively, and so on.
ith SBF is defined as ci ¼ dN=se for 1  i  s. In order to As shown in Fig. 10, all five curves follow a similar trend
solve this problem, we must determine the parameters m, k, as s increases under the constraints that N ¼ 1;330 and
, and s such that the false match probability DBF never  ¼ 0:0098. The expected memory size of each curve first
exceeds the upper bound . decreases as s grows, and then, increases after the s exceeds
130 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

Fig. 11. Memory size under different values of set cardinality. Fig. 13. Memory size under different false match probabilities.

Fig. 12. Ratio of memory size of DBF to that of SBF under different set Fig. 14. Ratio of memory size of DBF to that of SBF under different false
cardinality distributions. match probabilities.

one or more keen points on the whole. It is clear that the


expected memory size under the maximum Zipf distribu-
tion is always larger than that under other distributions of
the same value of s. The reason is that the ri under the
maximum Zipf distribution is greater than that under other
distributions for 2  i  s and any given value of s. For
each curve of DBF, the minimum memory size is achieved
when the value of s is equal to a keen point on the curve,
and is less than that of an SBF under the same constraints.
We then evaluate the impact of set cardinality on the
minimum memory size of the SBF and the five different
DBFs where  ¼ 0:0098. Fig. 11 shows that a DBF with Fig. 15. Impact of Zipf parameter on minimum memory size.
random Zipf distribution uses almost the same amount of
memory as a DBF with uniform distribution, while DBFs form SBFs independent of false match probability. In
with maximum and minimum Zipf distributions consume experiments, we also focus on the influence of the false
the most and least memory among the five DBFs, respec- match probability on the memory size of DBF, and the ratio
tively. All DBFs, however, consistently outperform SBFs of memory size of DBF to that of SBF. As shown in Fig. 13,
independent of set cardinality, and the performance for each DBF, the memory size decreases as the false match
difference seems to widen as set cardinality increases. In probability increases. As shown in Fig. 14, for each DBF, the
experiments, we also focus on the influence of set cardin- ratio increases as the false match probability increases;
ality on the ratio of memory size of DBF to that of SBF. As however, it will always be less than 1 when the false match
shown in Fig. 12, DBFs with maximum Zipf distribution, probability is not larger than 5 percent. Actually, the ratio
random Zipf distribution, normal distribution, and mini- for each DBF will reach, at most, 1 when the false match
mum Zipf distribution can save about 5, 19, 20, and 35 probability exceeds 5 percent. That is, the five DBFs never
percent of the memory used by SBF, respectively. The consume more memory than SBF, and save more memory
experimental results show that the set cardinality has a as the false match probability decreases.
trivial impact on the ratio of the memory size of DBF to that
Fig. 15 shows the impact of the Zipf parameter on
of SBF, while the cardinality distributions have a major
memory size under different Zipf distributions,2 where N ¼
impact on that metric.
1;330 and  ¼ 0:0098. We observe that DBFs with a
Moreover, we evaluate the impact of false match
probability on the minimum memory size of SBF and the minimum Zipf distribution perform better as the parameter
five different DBFs, where N ¼ 13;300. Fig. 11 shows that value increases, and DBFs with a random Zipf distribution
DBFs with maximum and minimum Zipf distributions
2. A large Zipf parameter means that the frequencies of some cardinality
consume the most and least memory among the five DBFs, values are much higher than others. A small Zipf parameter means that the
respectively. The five DBFs, however, consistently outper- frequency of each cardinality value occurs just as often.
GUO ET AL.: THE DYNAMIC BLOOM FILTERS 131

cardinality gradually exceeds c. The false match probability


exceeds  once the set cardinality is greater than
n0  dlogð1  Þ= logð1  Þe. To deal with this issue, c is
reassigned a new value, which is at least greater than
n0  dlogð1  Þ= logð1  Þe, and the DBF is reconstructed
to represent the dynamic set again. This paper does not
discuss policies to enlarge n0 in detail for similar reasons. If
the false match probability of a new DBF exceeds  again,
the DBF must be adjusted in the same way.
In reality, it is unavoidable to reconstruct DBFs and SBFs
under specific conditions if the upper bound on set
cardinality is not known a priori. Fortunately, the adjust-
Fig. 16. Impact of standard deviation  on minimum memory size. ment frequency of DBF is lower than that of SBF, especially
when the difference between  and  is large; that is, DBF
outperform SBFs almost independent of the Zipf parameter causes less overhead and is more stable than SBF due to
value. If the Zipf parameter is less than 0.6, DBFs with a infrequent reconstructions as the increase of set cardinality.
maximum Zipf distribution also outperform SBFs; other- Note that if the difference between  and  is very low, the
wise, they perform worse than SBFs. Fig. 16 shows the benefit of our approach using left and right bounds on the
impact of standard deviation  on the memory size of DBFs false match probability becomes trivial. To address this rare
with normal distribution. We observe that DBFs with case, we use the approaches proposed in Section 4.2 after
normal distribution use more memory as the value of  estimating the cardinality distribution and initializing N ¼
increases; however, they always outperform SBFs. n0 and  ¼ .
4.3 Dynamic Set without an Upper Bound on Set 4.4 Distributed Application Scenarios
Cardinality
In the above discussions, we considered SBF and DBF as
In this section, we consider another scenario in which objects residing in memory in stand-alone applications. In
applications do not know the upper bound N in advance. Let distributed applications, however, they are not just objects
 and  denote the left and right upper bounds on the false that reside in memory, but objects that must be transferred
match probability with  < . In this scenario, applications between nodes. In this case, all nodes are required to adopt
use a pair of  and  instead of a tight upper bound . In the same configuration of m; k, and hash functions in order
reality, applications expect that the false match probability is to guarantee compatibility and interoperability of SBF or
less than , and also tolerate an event that the false match DBF between any pair of nodes.
probability is sometimes greater than , but less than . A We first consider the case in which the upper bounds N
common objective of DBF and SBF is to guarantee that the and  on the set size and false match probability over
false match probability never exceeds  as the set cardinality nodes, depending on the applications, are known a priori.
changes. In order to satisfy this objective, the DBF and the Nodes can construct a local, but homogeneous SBF with
SBF may be reconstructed as the cardinality changes. m ¼ dN  logðÞ= logð0:61285Þe and k ¼ dðm=NÞ ln 2e even
For a dynamic set, applications usually estimate a if these sets are different in size. This approach requires the
threshold n0 of the set cardinality and use an initial SBF to nodes with small sets to sacrifice more space to be in
represent the dynamic set. In practice, it is very difficult to accordance with those nodes with the large sets, hence
estimate n0 in an accurate manner. One possible approach hurting the space efficiency and causing large transmission
is to trace the change of a dynamic data set, and then, to overhead. As discussed in Section 4.2, DBF can address this
investigate the statistic metric of set size before using DBF drawback of SBF if the distribution of set sizes over nodes is
to represent that data set. The parameters m and k for an known by the relevant application. The reason is that each
SBF are initialized with m ¼ dn0  logðÞ= logð0:61285Þe node allocates just enough memory to a DBF according to
and k ¼ dðm=n0 Þ ln 2e. As shown in Fig. 1, the false match its set size, and can satisfy the requirement of compatibility
probability exceeds  dramatically when the set cardin- and interoperability of DBF with other nodes. Although the
ality exceeds n0 gradually. Moreover, the false match approaches proposed in Section 4.2 focus on stand-alone
probability exceeds  once the set cardinality is greater applications, they are also suitable to distributed applica-
than n0  logð  Þ. To handle this issue, n0 is assigned a tions. The only difference is that the expected number of
new value, which is at least greater than n0  logð  Þ, bits used by a DBF is minimized in stand-alone applica-
and then, the initial SBF is reconstructed using new tions, while the total number of bits used by DBFs at all
parameters to represent the dynamic set again. There exist nodes is minimized in distributed applications. For more
many policies to enlarge the value of n0 . This paper does information, we refer the reader to Section 4.2.
not discuss this in detail since different policies have less We then consider the case in which the upper bound
impact on the final results. If the false match probability on set sizes over nodes is not known in advance. In this
exceeds  again, the SBF is adjusted in the same way. scenario, as discussed in Section 4.3, applications impose
DBF first adopts an SBF to represent the dynamic set,  and  ( < ) as a pair of upper bounds on the false
and may expand its capacity by allocating more SBFs as the match probability over nodes, and estimate a threshold n0
set cardinality increases. As shown in Fig. 1, the false match on the upper bound on set sizes over nodes. If
probability of DBF increases slower than SBF when the set applications use SBFs to represent sets over nodes, an
132 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010

event where the size of the set at any node exceeds n0  dynamic Bloom filters do not outperform Bloom filters in
logð  Þ will trigger a reconfiguration of its SBF, thereby terms of false match probability when dealing with a static
propagating a new configuration to other nodes and set with the same size memory.
reconstructing an SBF at each node. It is clear that
frequent reconfigurations lead to huge overhead and
destroy the stability of applications. One possible solution ACKNOWLEDGMENTS
to this problem is to overestimate n0 and allocate a larger The authors would like to thank Maria Basta and Christian
SBF at each node. This solution, however, hurts the space
Esteve for their constructive comments and careful proof-
efficiency of SBF, and causes large transmission overhead.
If applications use DBFs instead of SBFs, all nodes reading. Deke Guo and Ye Yuan’s research is supported in
reconfigure their DBFs only if the set size at any node part by NSF China under Grant Nos. 60903206, 60873011,
exceeds n0  dlogð1  Þ= logð1  Þe. Note that the node 60873089, and the National Basic Research Program of
just expands its DBF without performing the consistency China (973 Program) under Grant No. 2006CB303103. The
operation over all nodes if its set size is greater than work of Jie Wu is supported in part by US National Science
n0  logð  Þ, but less than n0  dlogð1  Þ= logð1  Þe.
Foundation under Grants CNS 0422762, CNS 0434533, and
It is clear that the adjustment frequency of DBF is lower
than that of SBF, especially when the difference between CNS 0626240.
 and  is large. DBFs are more stable than SBFs with the
increase of set cardinality in this case. For more
information, we refer the reader to Section 4.3. REFERENCES
After discussing the use of approaches of DBF, we [1] B. Bloom, “Space/Time Tradeoffs in Hash Coding with Allowable
consider a necessary procedure to update a DBF in Errors,” Comm. ACM, vol. 13, no. 7, pp. 422-426, 1970.
distributed applications. For each DBF, we adopt an [2] J.K. Mullin, “Optimal Semijoins for Distributed Database Sys-
tems,” IEEE Trans. Software Eng., vol. 16, no. 5, pp. 558-560, May
incremental update procedure by only sending those SBFs 1990.
which are changed. For each varied SBF, the procedure to [3] L.F. Mackert and G.M. Lohman, “R Optimizer Validation and
send updates first inspects if an old version of the SBF exists Performance Evaluation for Distributed Queries,” Proc. 12th Int’l
in the previous DBF. If not, this must be an added SBF, and Conf. Very Large Data Bases (VLDB), pp. 149-159, Aug. 1986.
[4] A. Broder and M. Mitzenmacher, “Network Applications of
the update is simply the SBF itself. Otherwise, an update is Bloom Filters: A Survey,” Internet Math., vol. 1, no. 4, pp. 485-509,
sent by computing the xor of the current version with the 2005.
previous version. All updates can be compressed using [5] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, and D.
arithmetic coding before being sent reliably. At the other Geels, “Oceanstore: An Architecture for Global-Scale Persistent
end, the procedure to receive each updated SBF first Storage,” ACM SIGPLAN Notices, vol. 35, no. 11, pp. 190-201, 2000.
[6] J. Li, J. Taylor, L. Serban, and M. Seltzer, “Self-Organization in
inspects if a previous SBF exists. If not, this must be an Peer-to-Peer System,” Proc. ACM SIGOPS, Sept. 2002.
added new SBF, and the update is simply stored in the [7] F.M. Cuena-Acuna, C. Peery, R.P. Martin, and T.D. Nguyen,
corresponding DBF. Otherwise, the updated SBF is treated “PlantP: Using Gossiping to Build Content Addressable Peer-to-
as an incremental one, and its previous SBF is modified Peer Information Sharing Communities,” Proc. 12th IEEE Int’l
Symp. High Performance Distributed Computing, pp. 236-249, June
suitably by computing its bitwise xor with the new update. 2003.
[8] S.C. Rhea and J. Kubiatowicz, “Probabilistic Location and
Routing,” Proc. IEEE INFOCOM, pp. 1248-1257, June 2004.
5 CONCLUSION [9] T.D. Hodes, S.E. Czerwinski, and B.Y. Zhao, “An Architecture for
Secure Wide Area Service Discovery,” Wireless Networks, vol. 8,
A Bloom filter is an excellent data structure for succinctly
nos. 2/3, pp. 213-230, 2002.
representing static sets with fixed cardinality in order to [10] P. Reynolds and A. Vahdat, “Efficient Peer-to-Peer Keyword
support membership queries. However, it does not take Searching,” Proc. ACM Int’l Middleware Conf., pp. 21-40, June 2003.
dynamic sets into account. In reality, most applications [11] D. Bauer, P. Hurley, R. Pletka, and M. Waldvogel, “Bringing
often encounter dynamic data sets, as well as static sets. We Efficient Advanced Queries to Distributed Hash Tables,” Proc.
IEEE Conf. Local Computer Networks, pp. 6-14, Nov. 2004.
present dynamic Bloom filters to deal with dynamic sets, as [12] L. Fan, P. Cao, J. Almeida, and A. Broder, “Summary Cache: A
well as static sets. Dynamic Bloom filters not only inherit Scalable Wide Area Web Cache Sharing Protocol,” IEEE/ACM
the advantage of Bloom filters, but also have better features Trans. Networking, vol. 8, no. 3, pp. 281-293, June 2000.
than Bloom filters when dealing with dynamic sets. The [13] C.D. Peter and M. Panagiotis, “Bloom Filters in Probabilistic
Verification,” Proc. Fifth Int’l Conf. Formal Methods in Computer-
false match probability of Bloom filters increases exponen- Aided Design, pp. 367-381, Nov. 2004.
tially with the increase of the cardinality of a dynamic set, [14] C. Jin, W. Qian, and A. Zhou, “Analysis and Management of
while that of dynamic Bloom filters increases slowly Streaming Data: A Survey,” J. Software, vol. 15, no. 8, pp. 1172-
because it expands capacity in an incremental manner 1181, 2004.
according to the set cardinality. [15] F. Deng and D. Rafiei, “Approximately Detecting Duplicates for
Streaming Data Using Stable Bloom Filters,” Proc. 25th ACM
Through comprehensive mathematical analysis, we SIGMOD, pp. 25-36, June 2006.
prove that dynamic Bloom filters use less expected memory [16] F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G.
than Bloom filters when dealing with dynamic sets with Varghese, “Beyond Bloom Filters: From Approximate Member-
ship Checks to Approximate State Machines,” Proc. ACM
upper bounds on set cardinality, and that dynamic Bloom SIGCOMM, pp. 315-326, Sept. 2006.
filters are more stable than Bloom filters due to infrequent [17] K. Li and Z. Zhong, “Fast Statistical Spam Filter by Approximate
reconstruction when addressing dynamic sets without Classifications,” Proc. Joint Int’l Conf. Measurement and Modeling of
upper bounds on set cardinality. Moreover, the analytical Computer Systems, SIGMETRICS/Performance, pp. 347-358, June
2006.
results hold in stand-alone applications as well as dis- [18] M. Mitzenmacher, “Compressed Bloom Filters,” IEEE/ACM Trans.
tributed applications. The only disadvantage is that Networking, vol. 10, no. 5, pp. 604-612, 2002.
GUO ET AL.: THE DYNAMIC BLOOM FILTERS 133

[19] A. Kirsch and M. Mitzenmacher, “Distance-Sensitive Bloom Jie Wu is the chair of and a professor in the
Filters,” Proc. Eighth Workshop Algorithm Eng. and Experiments Department of Computer and Information
(ALENEX ’06), Jan. 2006. Sciences, Temple University. Prior to joining
[20] A. Kirsch and M. Mitzenmacher, “Building a Better Bloom Filter,” Temple University, he was a program director at
Technical Report tr-02-05.pdf, Dept. of Computer Science, the US National Science Foundation. His re-
Harvard Univ., Jan. 2006. search interests include wireless networks and
[21] A. Kumar, J. Xu, J. Wang, O. Spatschek, and L. Li, “Space-Code mobile computing, routing protocols, fault-toler-
Bloom Filter for Efficient Per-Flow Traffic Measurement,” Proc. ant computing, and interconnection networks.
23rd IEEE INFOCOM, pp. 1762-1773, Mar. 2004. He has published more than 450 papers in
[22] S. Cohen and Y. Matias, “Spectral Bloom Filters,” Proc. 22nd ACM various journals and conference proceedings.
SIGMOD, pp. 241-252, June 2003. He serves on the editorial board of the IEEE Transactions on Mobile
[23] R.P. Laufer, P.B. Velloso, and O.C.M.B. Duarte, “Generalized Computing. Dr. Wu was also general cochair for IEEE MASS 2006,
Bloom Filters,” Technical Report Research Report GTA-05-43, IEEE IPDPS 2008, and DCOSS 2009. He has served as an IEEE
Univ. of California, Los Angeles (UCLA), Sept. 2005. Computer Society distinguished visitor and is the chairman of the IEEE
[24] B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal, “The Bloomier Technical Committee on Distributed Processing (TCDP). He is a fellow
Filter: An Efficient Data Structure for Static Support Lookup of the IEEE.
Tables,” Proc. Fifth Ann. ACM-SIAM Symp. Discrete Algorithms
(SODA), pp. 30-39, Jan. 2004.
[25] F. Hao, M. Kodialam, and T.V. Lakshman, “Building High
Honghui Chen received the MS degree in
Accuracy Bloom Filters Using Partitioned Hashing,” Proc.
operational research and the PhD degree in
SIGMETRICS/Performance, pp. 277-287, June 2007.
management science and engineering from the
[26] D. Guo, J. Wu, H. Chen, and X. Luo, “Theory and Network
National University of Defense Technology,
Applications of Dynamic Bloom Filters,” Proc. 25th IEEE
Changsha, China, in 1994 and 2007, respec-
INFOCOM, Apr. 2006.
tively. Currently, he is a professor of Information
[27] M. Xiao, Y. Dai, and X. Li, “Split Bloom Filters,” Chinese J.
System and Management, National University of
Electronic, vol. 32, no. 2, pp. 241-245, 2004.
Defense Technology, Changsha, China. His
[28] P.S. Almeida, C. Baquero, N.M. Preguiça, and D. Hutchison,
research interests are requirement engineering,
“Scalable Bloom Filters,” Information Processing Letters, vol. 101,
Web services, information grid, and peer-to-peer networks. He has
no. 6, pp. 255-261, 2007.
coauthored three books in the past five years.
[29] A. Kirsch and M. Mitzenmacher, “Less Hashing, Same Perfor-
mance: Building a Better Bloom Filter,” Proc. 14th Ann. European
Symp. Algorithms (ESA ’06), pp. 456-467, Sept. 2006.
[30] J. Wang, M. Xiao, J. Jiang, and B. Min, “I-DBF: An Improved Ye Yuan received the BS and MS degrees in
Bloom Filter Representation Method on Dynamic Set,” Proc. Fifth computer science from Northeastern University,
Int’l Conf. Grid and Cooperative Computing Workshops, pp. 156-162, China, in 2004 and 2007, respectively. He is
Sept. 2006. currently working toward the PhD degree in the
[31] A. Kent and R.S. Davis, “A Signature File Scheme Based on Department of Computer Science, Northeastern
Multiple Organizations for Indexing Very Large Text Databases,” University. He is also a visiting scholar in the
J. Am. Soc. for Information Science, vol. 41, no. 7, pp. 508-534, 1990. Department of Computer Science and Engineer-
[32] M. Faloutsos, C. Faloutsos, and P. Faloutsos, “On Power-Law ing, Hong Kong University of Science and
Relationships of the Internet Topology,” Proc. ACM SIGCOMM, Technology. His current research focuses on
pp. 251-262, Aug. 1999. peer-to-peer computing, graph databases, uncertain database, prob-
[33] F. Hao, M. Kodialam, and T.V. Lakshman, “Incremental Bloom abilistic database, and wireless sensor networks.
Filters,” Proc. IEEE INFOCOM, 2008.
[34] S. Melnik and H.C. Molina, “Adaptive Algorithms for Set
Containment Joins,” ACM Trans. Database Systems, vol. 28,
Xueshan Luo received the BE degree in
pp. 56-99, 2003.
information engineering from Huazhong Institute
of Technology, Wuhan, China, in 1985, and the
Deke Guo received the BE degree in industry MS and PhD degrees in system engineering
engineering from Beijing University of Aeronautic from the National University of Defense Tech-
and Astronautic, China, in 2001, and the PhD nology, Changsha, China, in 1988 and 1992,
degree in management science and engineering respectively. He was a faculty member and an
from the National University of Defense Tech- associate professor at the National University of
nology, Changsha, China, in 2008. He was a Defense Technology from 1992 to 1994 and
visiting scholar in the Department of Computer from 1995 to 1998, respectively. Currently, he is a professor of
Science and Engineering, Hong Kong University Information System and Management, National University of Defense
of Science and Technology, from January 2007 Technology. His research interests are in the general areas of
to January 2009. Currently, he is an assistant information system and operation research. His current research
professor of Information System and Management, National University focuses on architecture of information system.
of Defense Technology, Changsha, China. His current research focuses
on peer-to-peer networks, Bloom filters, MIMO, data center networking,
wireless sensor networks, and wireless mesh networks. He is a member
of the ACM and the IEEE. . For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

You might also like