Rosetta: A Robust Space-Time Optimized Range Filter For Key-Value Stores

Rosetta: A Robust Space-Time Optimized
Range Filter for Key-Value Stores

Siqiang Luo, Subarna Chatterjee, Rafael Ketsetsidis
Niv Dayan, Wilson Qin, Stratos Idreos
Harvard University
ABSTRACT
range size (length in domain)

We introduce Rosetta, a probabilistic range filter designed Analytical system is best
(e.g. column store)
specifically for LSM-tree based key-value stores. The core Memory
64 too high
intuition is that we can sacrifice filter probe time because LSM-tree KV-store with
SuRF is best to use a
it is not visible in end-to-end key-value store performance, filter
which in turn allows us to significantly reduce the filter false 32
LSM-tree KV-store with
positive rate for every level of the tree.
Rosetta is best
Rosetta indexes all binary prefixes of a key using a hi- 0
erarchically arranged set of Bloom filters. It then converts
10 13 memory (bits/key) 26
each range query into multiple probes, one for each non-
overlapping binary prefix. Rosetta has the ability to track Figure 1: Rosetta brings new performance properties
workload patterns and adopt a beneficial tuning for each improving on and complementing existing designs.
individual LSM-tree run by adjusting the number of Bloom 1 INTRODUCTION
filters it uses and how memory is spread among them to
optimize the FPR/CPU cost balance. Log Structured Merge (LSM)-Based Key-value Stores.
We show how to integrate Rosetta in a full system, RocksDB, LSM-based key-value stores in industry [2, 4, 5, 15, 30, 36,
and we demonstrate that it brings as much as a 40x improve- 43, 53, 55, 71, 73] increasingly serve as the backbone of ap-
ment compared to default RocksDB and 2-5x improvement plications across the areas of social media [6, 12], stream
compared to state-of-the-art range filters in a variety of work- and log processing [14, 16], file structures [47, 65], flash
loads and across different levels of the memory hierarchy memory firmware [26, 69], and databases for geo-spatial
(memory, SSD, hard disk). We also show that, unlike state- coordinates [3], time-series [50, 51] and graphs [32, 52].
of-the-art filters, Rosetta brings a net benefit in RocksDB’s The core data structure, LSM-tree [58] allows systems to
overall performance, i.e., it improves range queries without target write-optimization while also providing very good
losing any performance for point queries. point read performance when it is coupled with Bloom fil-
ters. Numerous proposals for variations on LSM-tree seek
ACM Reference Format: to strike different balances for the intrinsic read-write trade-
Siqiang Luo, Subarna Chatterjee, Rafael Ketsetsidis and Niv Dayan,
off [7], demonstrated in a multitude of academic systems
Wilson Qin, Stratos Idreos. 2020. Rosetta: A Robust Space-Time
[8, 11, 24, 25, 27, 28, 42, 45–47, 54, 56, 61, 62, 64, 67, 72].
Optimized Range Filter for Key-Value Stores. In Proceedings of the
2020 ACM SIGMOD International Conference on Management of Data Filtering on LSM-trees. As LSM-based key-value stores
(SIGMOD’20), June 14–19, 2020, Portland, OR, USA. ACM, New York, need to support fast writes, the ingested data is inserted into
NY, USA, 16 pages. https://doi.org/10.1145/3318464.3389731 an in-memory buffer. When the buffer grows beyond a pre-
determined threshold, it is flushed to disk and periodically
merged with the old data. This process repeats recursively,
Permission to make digital or hard copies of all or part of this work for effectively creating a multi-level tree structure where ev-
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear ery subsequent level contains older data. Every level may
this notice and the full citation on the first page. Copyrights for components contain one or more runs (merged data) depending on how
of this work owned by others than ACM must be honored. Abstracting with often merges happen. The size ratio (i.e., how much bigger
credit is permitted. To copy otherwise, or republish, to post on servers or to every level is compared to the previous one) defines how
redistribute to lists, requires prior specific permission and/or a fee. Request deep and wide the tree structure grows [48], and also affects
permissions from permissions@acm.org.
SIGMOD’20, June 14–19, 2020, Portland, OR, USA
the frequency of merging new data with old data. As the ca-
© 2020 Association for Computing Machinery. pacity of levels increases exponentially, the number of levels
ACM ISBN 978-1-4503-6735-6/20/06. . . $15.00 is logarithmic with respect to the number of times the buffer
https://doi.org/10.1145/3318464.3389731 has been flushed. To support efficient point reads, LSM-trees
use in-memory Bloom filters to determine key membership queries, then a write-optimized multi-level key-value store
within each persistent run. Each disk page of every run is is far from optimal regardless of the filter used: instead, a
covered by fence pointers in-memory (with min-max infor- scan-optimized columnar-like system is appropriate [1].
mation). Collectively Bloom filters and fence pointers help Problem 2: Lack of Support for Workloads With Key-
reduce the cost of point queries to at most one I/O per run Query Correlation or Skew. Unlike hash-based filters1
by sacrificing some memory and CPU cost [45]. such as Bloom and Cuckoo filters [10, 37], both Prefix Bloom
Range Queries Are Hard. Range queries are increasingly filters and SuRF are dependent on the distribution of keys
important to modern applications, as social web application and queries, and as such, are affected by certain properties of
conversations [18], distributed key-value storage replica- the distribution. Specifically, a major source of false positives
tion [63], statistics aggregation for time series workloads [49], in these filters is when a queried key does not exist but shares
and even SQL table accesses as tablename prefixed key re- the prefix with an existing one. Such workloads are common
quests [17, 35] are all use cases that derive richer functional- in applications that care about the local properties of data. For
ity from building atop of key-value range queries. While LSM- example, a common use case of an e-commerce application
based key-value stores support efficient writes and point is to find the next order id corresponding to a given order
queries, they suffer with range queries [25, 27, 28, 34, 57]. id. This translates to querying the lexicographically smallest
This is because we cannot rule out reading any data blocks key that is greater than the current key. Other examples
of the target key range across all levels of the tree. Range include time-series applications that often compare events
queries can be long or short based on selectivity. The I/O cost occurring within a short time range for temporal analysis
of a long range query emanates mainly from accessing the or geo-spatial applications that deal with points-of-interest
last level of the tree because this level is exponentially larger data which are queried to search a local neighborhood. In
than the rest, whereas the I/O cost of short range queries is fact, the RocksDB API [36] accesses keys through an iterator
(almost) equally distributed across all levels. that steps through the keys in lexicographical order.
Similarly, existing filters do not take advantage of skew,
State of the Art. Fence pointers can rule out disk blocks for instance when key distributions are sparse and target
with contents that fall outside the target key range. However, ranges contain empty gaps in key range.
once the qualifying data blocks are narrowed down, the
Overall Problem: Sub-Optimal Performance for Diverse
queried range within a data block may be empty – a concern
Workloads and Point Queries. Overall, existing filters in
especially for short range queries. To improve range queries,
key-value stores suffer in FPR performance due to variation
modern key-value stores utilize Prefix Bloom filters [33]
of one or more of the following parameters – range sizes, key
that hash prefixes of a predefined length of each key within
distributions, query distributions, and key types. Another
Bloom filters [36]. Those prefixes can be used to rule out
critical problem is that neither indexing prefixes with Prefix
large ranges of data that are empty. While this is a major
Bloom filters nor using SuRF (even SuRF-Hash) can help
step forward, a crucial restriction is that this works only for
(much) with point queries. In fact, point query performance
range queries that can be expressed as fixed-prefix queries.
degrades significantly with these filters. Thus, for workloads
The state of the art is Succinct Range Filter (SuRF) [74]
that contain both point queries and range queries, an LSM-
that can filter arbitrary range queries by utilizing a compact
tree based key-value store with such filters needs to either
trie-like data structure. SuRF encodes a trie of n nodes with
maintain a separate Bloom filter per run to index full keys
maximum fanout of 256. The trie is culled at a certain prefix
or suffer a high false positive rate for point queries.
length. The basic version of SuRF stores minimum-length
Our Intuition: We Can Afford to Give Up CPU. All state-
prefixes such that all keys can be uniquely represented and
of-the-art filters such as, Bloom filters, Quotient filters [9],
identified. Other SuRF variants store additional information
and SuRF trade among the following to gain in end-to-end
such as hash bits of the keys (SuRF-Hash) or extra bits of the
performance in terms of data access costs.
key suffixes (SuRF-Real).
(1) Memory cost for storing the filter
Problem 1: Short and Medium Range Queries are Sig- (2) Probe cost, i.e., the time needed to query the filter
nificantly Sub-Optimal. Although trie-culling in SuRF is (3) False positive rate, which reflects how many empty
effective for long range queries, it is not very effective on queries are not detected by the filter, thus leading to
short ranges as (i) short range queries have a high probability unnecessary storage I/O
of being empty and (ii) the culled prefixes may not help as
the keys differ mostly by the least significant bits. Similarly, These tradeoffs make sense because an evolving hardware
fence pointers or Prefix Bloom filters cannot help with short trend is that contemporary speed of CPU processing facil-
range queries unless the data is extremely sparse. On the itates fast in-memory computations in a matter of tens of
other hand, if a workload is dominated by (very) long range 1 SuRF-Hash uses hashes to improve point-queries, not for range filtering.
nanoseconds, whereas even the fastest SSD disks take tens memory cost, and false positive rate. This is done by
of microseconds to fetch data. In this paper, we use an ad- controlling the number of Bloom filters being used as
ditional insight which we also demonstrate experimentally well as the memory used at each Bloom filter.
later on. When a range filter is properly integrated into a full • We show how to integrate Rosetta in a full system,
system, the probe cost of the filter is, in fact, a small portion RocksDB and the effect of several design factors that
of the overall key-value store system CPU cost (less than 5%). affect full system performance including the number
This is because of additional CPU overhead emanating from of iterators per query, deserialization costs, seek cost,
range queries traversing iterators (requiring a series of fence and SST file size.
pointer comparisons) for each relevant run, and resulting • We demonstrate that compared to RocksDB (with fence
deserialization costs for maintaining the system block cache pointers and Prefix Bloom filters), Rosetta brings an
for both metadata (filters and fence pointers) and data. Thus, up to 40x improvement for short and medium range
since we target specifically LSM-tree based key-value stores, queries, while compared to RocksDB with SuRF, Rosetta
we have even more leeway to optimize the range filter design improves by a factor of 2-5x. These improvements are
with respect to both probe cost (CPU) we can afford and FPR observed across diverse workloads (uniform, corre-
we can achieve. lated, skewed) and across different memory hierarchy
layers (memory, SSD, hard disk).
Rosetta. We introduce a new filter, termed Rosetta (A Robust
• As components of end-to-end performance, we show
Space-Time Optimized Range Filter for Key-Value Stores),
the fundamental tradeoffs between filter construction
which allows for efficient and robust range and point queries
latency, filter probe latency, and disk access latency for
in workloads where state-of-the-art filters suffer. Rosetta is
a given memory budget. We also relate these costs to
designed with the CPU/FPR tradeoff from the ground up and
CPU and I/O of the whole key-value store and demon-
inherently sacrifices filter probe cost to improve on FPR. The
strate that by utilizing more CPU in accessing filters,
core of the new design is in translating range queries into
Rosetta significantly improves the overall key-value
prefix queries, which can then be turned into point queries.
store performance compared to state-of-the-art filters.
Rosetta stores all prefixes for each key in a series of Bloom
• We also demonstrate that Rosetta brings a net benefit
filters organized at different levels. For every key, each prefix
in RocksDB by not hurting point query performance.
is hashed into the Bloom filter belonging to the same level
Rosetta can process worst case point queries as well as
as that of the prefix length. For example, when the key 6
point-query optimized RocksDB or better (while Prefix
(which corresponds to 0110) is inserted to Rosetta, all possible
Bloom filters and SuRF incur a 60% and 40% overhead
prefixes ∅, 0, 01, 011, and 0110 are inserted to the first, second,
for point queries, respectively).
third, fourth, and fifth Bloom filter respectively. A range
• We also show the holistic positioning of Rosetta com-
query of size R (i.e., with the form [A, A+R−1]) is decomposed
into at most 2 log2 (R) dyadic ranges (i.e., ranges of size 2r that
pared to existing methods so applications can decide
when to use which filter: Figure 1 shows this sum-
share the same binary prefix for some length r ). In essence,
mary. Given the same memory budget, Rosetta is ro-
this design builds a series of implicit Segment Trees on top
bust across workloads and beneficial with short range
of the Bloom filters. Due to the multiple probes required
queries. SuRF is beneficial for medium range queries
(for every prefix) Rosetta has a high probe cost. Rosetta also
while once query range gets bigger a key-value store
adaptively auto-tunes to sacrifice as much probe cost as
is not beneficial anymore.
possible in favor of FPR by monitoring workloads’ patterns
and deciding optimal bit allocation across Bloom filters (by
flattening the implicit Segment Tree when possible). 2 ROSETTA
Our contributions are as follows:
Rosetta in an LSM-tree KV-Store. A Rosetta instance is
• We introduce a new range filter design for key-value created for every immutable run of an LSM-tree. For every
stores which utilizes auxiliary CPU resources to achieve run of the tree, a point or range query, first probes the corre-
a low false positive rate. The core design is about stor- sponding Rosetta filter for this run, and only tries to access
ing all prefixes of keys in a series of Bloom filters the run on disk if Rosetta returns a positive. As with Bloom
organized in an implicit Segment Tree structure. filters and fence pointers in LSM-trees, every time there is a
• We show that Rosetta approaches the optimal mem- merge operation, the Rosetta filters of the source runs are
ory lower bound for any false positive rate within a dropped and a new Rosetta instance is created for the new
constant factor. run. We discuss in detail the integration of Rosetta in a full
• We show that Rosetta can be configured in versatile key-value store in Section 4. The terms used in the following
ways to balance the trade-off between filter probe cost, discussion can be referred to Table 1.
insert(3, 6, 7, 8, 9, 11)
Algorithm 1: Insert
insert(0011, 0110, 0111, 1000, 1001, 1011) Function Insert(K):
base 10 representation
1
base 2 representation
BF 1 0 1 0 1
/* K is the key to be inserted. */
BF 2 00 01 10 0 1 2 2 L ← number of bits in a key
3 B ← L Bloom filters
BF 3 001 011 100 101 1 3 4 5
4 for i ∈ {L, . . . , 1} do
BF 4 0011 0110 0111 1000 1001 1011 3 6 7 8 9 11 5 Bi .Insert(K)
6 K ← ⌊K/2⌋
Figure 2: Rosetta indexes all prefixes of each key.
Term Definition Rosetta instances across the tree can have a different number
n Number of keys of Bloom filters.
R Maximum range query size Implicit Segment Tree in Rosetta. Rosetta effectively forms
M Memory budget an implicit Segment Tree [68] (as shown in Figure 3) which is
Bi Bloom filter corresponding to length-i prefix used to translate range queries into prefix queries. Segment
ϵi False positive rate of Bi Trees are full binary trees and thus perfectly described by
Mi Memory allocated for Bi their size, so there is no need to store information about the
ϵ Resulting false positive rate for a prefix query position of the left and right children of each node.
after considering recursive probes
ϕ General reference of the FPR of a Bloom filter
2.2 Range and Point Queries with Rosetta
at Rosetta 2.2.1 Range Lookups. We now illustrate how a range lookup
is executed in Rosetta. For every run in an LSM-tree, we first
Table 1: Terms used.
probe the respective Rosetta instance. Algorithm 2 shows
the steps in detail and Figure 3 presents an example.
First, the target key range is decomposed into dyadic in-
2.1 Constructing Rosetta for tervals. For each dyadic interval, there is a sub-tree within
LSM-tree Runs Rosetta where all keys within that interval have been indexed.
For every run in an LSM-tree, a Rosetta exists which contains For every dyadic interval, we first search for the existence of
information about all the entries within the run. A Rosetta the prefix that covers all keys within that interval. This marks
is created either when the in-memory buffer is flushed to the root of the sub-tree for that interval. In the example of
disk or when a new run is created because two or more runs Figure 3, the range query range(8, 12) breaks down into
of older data are merged into a new one. Thus, insertion two dyadic ranges: [8, 11] (the red shaded area) and [12, 12]
in Rosetta is done once all the keys which are going to be (the green shaded area). We first probe the Bloom filter re-
indexed by this Rosetta instance are known. All keys within siding at the root level of the sub-tree. The Segment Tree is
the run are broken down into variable-length binary prefixes, implicit, so the length of the prefix determines which Bloom
thereby generating L distinct prefixes for each key of length filter to probe. For example, the prefix 10∗ contains every
L bits. Each i-bit prefix is inserted into the i t h Bloom filter of key in [8, 11] and nothing more (i.e., 1000, 1001, 1010, 1011),
Rosetta. The insertion procedure is described in Algorithm 1, so for the range [8, 11] we will probe 10 from the second
and an example is shown in Figure 2. Let us take inserting 3 Bloom filter. If this probe returns a negative, we continue in
(corresponding to 0011) as an example. As shown in Figure 2 the same way for the next dyadic range. If the probe from
(the red numbers), we insert prefixes 0, 00, 001 and 0011 every dyadic range returns negative, the range of the query
into B1 , B2 , B3 and B4 respectively. Here insertion means is empty. As such, in the example of Figure 3, we will probe
that we index every prefix in the Bloom filter by hashing it the second Bloom filter for the existence of 10 and, if that
and setting the corresponding bits in the Bloom filter. The returns negative, we will probe the fourth Bloom filter for the
prefix itself is not stored. Figure 3 shows the final state of existence of 1100. If both probes return negative, the range
Rosetta after inserting all keys in a run (keys 3, 6, 7, 8, 9, 11 is empty (when inserting a key we insert all of its prefixes to
in our example). We do not need to take care of dynamic the corresponding Bloom filter, so for example if there was
updates to the data set (e.g., a new key arriving with a bigger a key in the range [8, 11], we would have inserted 10 to the
length) since each separate instance of Rosetta is created for second Bloom filter).
an immutable run at a time. In addition, given that every run If at least one prefix in the target range returns positive,
of the LSM-tree might have a different set of keys, different then Rosetta initiates the process of doubting. We probe
range(8, 12) range(1000, 1100)
BF 1 0 1
Algorithm 2: Range Query
split point
BF 2 0 1 0 1 search(10, 11) 1 Function RangeQuery(Low, Hiдh, P = 0, l = 1):
BF 3 0 1 0 1 0
search(100, 101)
1
search(110) 0 1 /* [Low, Hiдh]: the query range; */
BF 4 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 search(1000, 1001, 1010, 1011)
search(1100)
/* P: the prefix. */
2 if P > Hiдh or P + 2 L−l +1 − 1 < Low then
Figure 3: Rosetta utilizes recursive prefix probing. /* P is not contained in the range */
3 return false
4 if P ≥ Low and P + 2 L−l +1 − 1 ≤ Hiдh then
/* P is contained in the range */
Bloom filters residing at deeper levels of the sub-tree and in 5 return Doubt(P, l)
turn, determine the existence of all possible prefixes within 6 if RangeQuery(Low, Hiдh, P, l + 1) then
the particular dyadic interval. In the example of Figure 3, 7 return true
when querying the range [8, 11], if the probe for prefix 10∗ is
8 return RangeQuery(Low, Hiдh, P + 2 L−l , l + 1)
positive, we further probe for 100∗ and 101∗ the third Bloom
filter. If both of these probes are negative, we can mark the 9 Function Doubt(P, l):
sub-range as empty, since any key with prefix 10∗ will either 10 if ¬Bl .PointQuery(P) then
have the prefix 100∗ or the prefix 101∗. 11 return false
In general, all Bloom filters at the subsequent levels of 12 if l > L then
Rosetta are recursively probed to determine the existence 13 return true
of intermediate prefixes by traversing the left and right sub- 14 if Doubt(P, l + 1) then
trees in a pre-order manner. The left sub-tree corresponds to 15 return true
the current prefix with 0 appended, while the right sub-tree 16 return Doubt(P + 2 L−l , l + 1)
corresponds to the current prefix with 1 appended. Recursing
within a sub-tree terminates if either (a) a Bloom filter probe
corresponding to a left-leaf in the sub-tree returns positive
which marks the whole subrange as positive, or (b) case (a) how to further optimize FPR through memory allocation and
does not happen while any Bloom filter probe corresponding further CPU sacrifices during probe time.
to a right node returns negative in which case the whole
sub-range is marked as negative. 2.3 FPR Optimization Through
In the event that the range is found to be a positive, there Memory Allocation
is still a chance for Rosetta to reduce I/O by “tightening" the
A natural question arises: how much memory should we
effective range to touch on storage. That is, if Rosetta finds
devote to each Bloom filter in a Rosetta instance for a given
that intervals at the edge of the requested range are empty,
LSM-tree run? Alternatively, what is the false positive rate
it will proceed to do I/O only for the narrower, effective key
ϵi that we should choose for the filters in order to minimize
range which is determined as positive.
the false positive rate for a given memory budget? Equally
2.2.2 Point Lookups. The last level of Rosetta indexes the distributing the memory budget among all Bloom filters over-
whole prefix of every key in the LSM-tree run that it repre- looks the different utilities of the Bloom filters at different
sents. This is equivalent to indexing the whole key in a Bloom Rosetta levels. In this section, we discuss this design issue.
filter which is how traditional Bloom filters are used in the A First-Cut Solution: FPR Equlibrium Across Nodes.
default design of LSM-trees to protect point queries from Our first-cut solution is to achieve an equilibrium, such that
performing unnecessary disk access. In this way, Rosetta can an equal FPR ϵ is achieved for any segment tree node u by
answer point queries in exactly the same way as point-query considering the filters both at u and u’s descendant nodes.
optimized LSM-trees: For every run of the LSM-tree, a point As we will explain later in Section 3, this strategy gives
query only checks the last level of Rosetta for this run, and very good theoretical interpretation in terms of memory
only tries to access disk if Rosetta returns a positive. usage– it achieves a 1.44-approximation optimal memory
usage. The detailed calculation for the FPR ϕ of the filter
2.2.3 Probe Cost vs FPR. The basic Rosetta design described at u’s level is as follows. By the aforementioned objective
above intuitively sacrifices CPU cost during probe time to equilibrium, we suppose that a node u’s children all have
improve on FPR. This is because of the choice to index all false positive rate ϵ, after taking the filters of their respective
prefixes of keys and subsequently recursively probe those descendant nodes into account. If we consider the filters of u
prefixes at query time. In the next two subsections, we show and its descendants all together, an ultimate false positive is
achieved by either 1) the filter at u returns a positive, and u’s Then, by the AM-GM inequality 3 and let h = 1 + ⌊log R⌋,
left sub-tree returns a positive. This case leads to a positive we have,
rate ϕ ·ϵ; or 2) the filter at u returns a positive, u’s left sub-tree
! h1
returns a negative while u’ right sub-tree returns a positive.
− Mnr ·(ln 2)2 − Mnr ·(ln 2)2
Õ Ö
This case gives us a positive rate ϕ · (1 − ϵ) · ϵ. Putting these д(r )e ≥h д(r )e
two cases together gives the following equation. 0≤r ≤ ⌊log R ⌋ 0≤r ≤h−1
! h1
1
− Mnr ·(ln 2)2
Õ Ö M 2 h
1
ϕ · ϵ + ϕ · (1 − ϵ) · ϵ = ϵ ⇒ϕ · (2 − ϵ) = 1⇒ ϕ = ⇒ д(r )e ≥h д(r ) e − n ·(ln 2)
2−ϵ 0≤r ≤ ⌊log R ⌋ 0≤r ≤h−1
− Mnr
Õ
This means that we may set each non-terminal node’s false ⇒ д(r ) · e ·(ln 2)2
≥ (1 + ⌊log R⌋)·
positive rate to 1/(2 − ϵ). 0≤r ≤ ⌊log R ⌋
Optimized Allocation of Memory Across Levels. Based 1
⌊log R⌋
(ln 2)2
M ⌊log R⌋
on the aforementioned idea, there are still many Bloom filters Ö
д(r )® · e− n
© ª
(6)
that are treated equally (except for the last level). However,
0≤r ≤ ⌊log R ⌋
we find that in general, the Bloom filters at shorter prefix « ¬
Rosetta levels are probed less than deeper levels. Intuitively, Given parameters R, r , M, n, the last value in Equation 6 is
we should set lower FPRs for Bloom filters that are probed a constant. Hence, the minimum achieves when the equality
more frequently. To formalize this idea, we analyze the fre- condition of the arithmetic-mean-inequality holds, i.e.,
Mr
д(r ) · e − n ·(ln 2) = C
2
quency of a segment tree node u being accessed (i.e., the (7)
Bloom filter at node u is probed), if we issue every query of Î
⌊log R ⌋
(ln 2)2
⌊log1 R⌋ M ⌊log R⌋
size R once. Our analytical results are as follows. where, C = 0≤r д(r ) · e− n .
Let level-r of Rosetta be the level which is r level higher Transforming Equation 7 gives the optimal memory bits
than the leaf level. Then, the frequency of a level-r node allocation as shown in Equation 3. When Mr solved gives a
being accessed, д(r ), is expressed as, negative value, we reset Mr = 0 and rescale the other Mr ′
Õ for r ′ , r to maintain the sum of memory bits to be M.
д(r ) = д(r + c, R − 1) (1)
0≤c ≤ ⌊log R ⌋−r 2.4 FPR Optimization Through Further


 1, x ∈ [0, ⌊log R⌋) Probe Time Sacrifice
R−2 x +1
where, д(x, R − 1) =


2x , x = ⌊log R⌋ (2) As we will show in Section 5, filter probe time is a tiny portion
x > R⌋

⌊log of overall query time in a key-value store. This means that

 0,

To incorporate the idea of achieving lower FPRs for less we can explore opportunities to further sacrifice filter probe
frequently accessed filters, we aim to minimize the objective time if that might help us to further improve on FPR. In this
of 0≤r ≤ ⌊log R ⌋ д(r ) · ϵr (where ϵr is the FPR of the level-
Í section, we discuss such an optimization.
r Bloom filter of Rosetta), subject to the overall memory The core intuition of the optimization in this section is
budget M. Let Mr be the number of bits assigned to the level- that the multiple levels of Rosetta require valuable bits. Ef-
r Bloom filter. Then the following expression minimizes the fectively, all levels but the last one in each Rosetta instance
aforementioned function. help primarily with probe time as their utility is to prune the
parts of the range we are going to query for at the last level
n C
Mr = − ln (3) of Rosetta.
(ln 2)2 д(r ) The fewer such ranges the less the probe cost of the filter.
1
⌊log R⌋ (ln 2)2 At the same time, though, these bits at the top levels of
M ⌊log R⌋
© Ö
where, C = д(r )® · e− n Rosetta could be used to store prefixes with fewer conflicts
ª
(4)
« 0≤r ≤ ⌊log R ⌋ ¬ at the bottom level, improving FPR. This gives a balance
The derivation of the Equation 3 is as follows. First, we between FPR and probe cost. We present two new strategies
transformÕthe objective function 2 based on this observation.
Õ: Mr
д(r ) · ϵr = д(r ) · e − n ·(ln 2)
2
(5) (i) Single-level filter for small ranges. This is an extreme
0≤r ≤ ⌊log R ⌋ 0≤r ≤ ⌊log R ⌋ design option of using only a single level Rosetta. This means
1
3x + x2 + . . . + xh ≥ x i h , For x i ≥ 0, the equality holds
ease of analysis, we assume each Bloom filter at level in [0, ⌊log R ⌋]
1 Î
2 For 1 h · 1≤i ≤h
has n unique keys. when all x i are equal.
1.0
that all of Rosetta is stored in a single Bloom filter maximiz-
50
Variable−level Variable−level
Probe Cost (microsecs)

0.8
ing the utility of the available memory budget but increasing Original Original
40
Single−level Single−level
probe time as all ranges need to be queried. Thus the probe
0.6
30
FPR
time is linear to the range size.
0.4
20
(ii) Variable-level filter for large ranges. An alternative ap-
0.2
10
proach to the extreme single-level design is to selectively
0.0
0
push more bits at lower levels (e.g., more aggressively push- 2
2 4
4 88 16
16 32
32 64
64 128 256
256 512 22 44 88 16
16 32
32 64
64 128 256
256 512
ing bits at lower levels to the point where a level might be Range Range
empty). The amount of bits chosen determines the CPU/FPR

Figure 4: Depending on range size, a different bits-
balance. We achieve this by associating a Bloom filter with a
allocation mechanism can be used.
weight as the sum of its probe frequency and all the probe-
frequencies of its above Bloom filters. The new weight can information to extract the weights and to decide on single
be expressed as follows. vs variable levels for each run.
Õ
w(Bi ) = д(i)
i ≤r ≤ ⌊log R ⌋ 3 THEORETICAL INSIGHTS
Then we apply Equation 3 to compute bits-allocation based We show that Rosetta is very close to be space optimal, and
on the new weights (i.e., by replacing д(r ) with w(Br )). In this the query time is moderate.
way, we shift more bits from upper levels to bottom levels.
Also, when Mr solved by Equation 3 gives a negative value, 3.1 Space Complexity is Near-optimal
we set Mr = 0 and update w(Bi ) by adding up its current It has been shown in [44] that in order to achieve a FPR of ϵ
size R, any data structure should use
weight with all the weights of its above Bloom filters. for any range query of
To evaluate the FPR performance and probe time of the
at least n log R ϵ
1−O (ϵ )
− O(n) of bits, where n is the number
original mechanism (by Equation 3), single-level filter, and
new variable-level filter, we perform a test on 10 million of keys.
keys with a Rosetta with 10 bits-per-key. Figure 4 varies a For memory allocation in Rosetta, we have introduced the
range size from 2 to 512 (x-axis). Each bar represents FPR first-cut solution and the optimized approaches. The first-
(right figure) or probe cost (left figure). While the single-level cut solution is relevant to the given FPR guarantee ϵ, while
filter has the best FPR, its probe cost is significantly higher the spirit of the optimized approaches consider the utilities
than the other two methods starting from a range size of 32. of each filter thereby being workload aware. As the space
This range size is also the turning point where the FPR per- lower bound is given regarding ϵ, our discussion of the space
formance of the new variable-level filter starts to overtake complexity will focus on the first-cut solution, which is with
the original mechanism, while achieving a more reasonable regard to ϵ directly.
probe cost compared with the single-level filter. This obser- The first-cut solution of memory allocation gives an ϵ FPR
1
vation motivates a hybrid mechanism of both single-level for the last level but 2−ϵ FPR for the other levels. Since each
and variable-level filters depending on the workload: for level in Rosetta may have at most n nodes, we get that the
workloads where small ranges (i.e., at most size 16) are dom- total memory required is:
inating, we employ the single-level filter, whereas in other log e · (n log(1/ϵ) + n log(u) log(2 − ϵ)) ≈ 1.44 · n log (u/ϵ),
cases, we build Rosetta with the mediated filter. where u is the size of the key space. If we have a bound R
Finally, note that our solution to distribute the bits across on the maximum size of range queries, we may disregard
the different Rosetta levels based on actual runtime filter use, some levels of the Rosetta, which gives us a memory usage
requires workload knowledge. This is knowledge attainable of 1.44 · n log (R/ϵ). This means Rosetta is very close to the
by extending the native statistics already captured in a key- aforementioned theoretical lower bound.
value store. Given that a Rosetta instance is created for every
run, we keep counters and histograms for query ranges, 3.2 Computational Complexity is Low
invoked Rosetta instances and the corresponding hit rates on Construction. Given that each run in an LSM-tree contains
the underlying runs. More specifically we record the query sorted keys, we can avoid inserting common prefixes of
ranges seen and also record the utilization and performance consecutive keys, which means that we only insert unique
of each Bloom filter in the invoked Rosetta instances. At prefixes to the Bloom filters. Thus, the number of insertions
compaction time, we reconcile these statistics and create to Bloom filters that we perform is equal to the size of a
the post-compaction Rosetta instances using this workload binary trie containing the keys, which is upper bounded by
n · L. This complexity is very close to the lower bound, which
barely reads every bit of the keys.
Querying. The query time complexity is proportional to Pi = Ci · p i · (1 − p)i+1 since for each unique binary tree,
the maximum number of prefixes that we break the range the probability of Rosetta making the probes exactly in the
down into. In a Segment Tree that number is equal to 2L [29]. shape of that tree is equal to p i · (1 − p)i+1 , since a positive
Similarly, a range of size R will need up to 2 log2 (R) queries occurs with probability p and a negative with probability
i
since it can break into at most that many prefixes. In the (1 − p). By the approximation Ci ≃ i 3/24√π [66] we have that
worst-case scenario, the number of probes needed for a range
query is equal to the number of Segment Tree nodes under (1 − 4θ 2 )i (1 − 4θ 2 )i
Pi ≃ √ · (1 − p) ≤ √
the sub-trees of every dyadic range, which is O(R). However, i iπ i iπ
2 i
computation will end if there are no positive probes that Then, Ei = Pi (2i + 1) ≤ (1−4θi √)i π(2i+1) . We have that,
we can recurse over any longer. Our analysis shows that in
∞ ∞ ∞
expectation, the number of probes is typically much less than Õ Õ (1 − 4θ 2 )i (2i + 1) Õ 3i(1 − 4θ 2 )i
the worst case. We divide the analysis into multiple cases. E − E0 = Ei ≤ √ ≤ √
i=1 i=1 i iπ i=1 i iπ
Uniform Bits-Allocation. Consider a Rosetta with each of ∞
Õ 3(1 − 4θ 2 )i 3
its Bloom filters of FPR equal to p = 0.5 + θ, θ , 0, |θ | < 0.5. ≤ √ ≤ 2√
We analytically show that the expected runtime of a query i=1 π 4θ π
log R Finally, since E 0 is a constant and we have at most 2 log R
is less than O( θ 2 ) for an empty range. For an empty range,
dyadic ranges, we conclude that the probe cost to be at most
the process of doubting ends once either no positives are log R
remaining, or the level of recursion reaches the last level O( θ 2 ), with the infinite Rosetta. This also upper bounds
of Rosetta. Let E be the expected number of Bloom filter the probe cost of the original Rosetta, and thus the overall
log R
probes under the condition that the level of recursion never expected complexity of a range query is less than O( θ 2 ).
reaches the last level (i.e., level L), and F the expected num- Non-Uniform Bits-Allocation. We can relax the assump-
ber of probes under the condition that it does. By linearity tion by allowing unequal FPRs among the Bloom filters. We
of expectation, we have that the total expected number of denote the FPR of level-i Bloom filter by pi . Let pmax =
probes is equal to E + F . To upper bound E + F , we consider a max{pi } and pmin = min{pi }. Similar to the above deriva-
Rosetta containing an infinite number of Bloom filters where tion, we have
each has an FPR equal to p. Querying with such an “infinite” i
Pi ≤ Ci pmax (1 − pmin )i+1
Rosetta is at least as costly as querying with the original
4i
Rosetta. To see this, if a query never reaches level L or any ≃ √ (pmax (1 − pmin ))i · (1 − pmin ) (8)
probe in level L always returns a negative, then probe cost i 3/2 π
with either Rosetta will be the same. However, if a query To give the same complexity analysis as in the case where
reaches level L and a probe in this level returns a positive, all pi are equal, it is sufficient to let pmax (1 − pmin ) < 14 .
q
then with the original Rosetta the query stops immediately Let θ ′ = 14 − pmax (1 − pmin ). Following our complexity
by returning positive, whereas with the infinite Rosetta the log R
analysis, we show that the expected probe time is O( θ ′2
).
probe can further recurse to deeper levels, thereby incurring
more probe cost. Next, we analyze the probe cost within each
dyadic range in the infinite Rosetta, and we reuse the term 4 INTEGRATING RANGE FILTERS TO AN
E defined previously for ease of presentation. LSM-BASED KEY-VALUE STORE
Let Ei be the expected number of Bloom filter probes, and Standarization of Filters in RocksDB. We integrated Rosetta
has encountered exactly i positives, which means that by into RocksDB v6.3.6, a widely used LSM-tree based key-value
linearity of expectation, E = i=0 Ei . Each positive probe
Í∞
store. We constructed a master filter template that exposes
either recurses into two negative probes (which then recurse the API of the fundamental filter functionalities – populating
no further), two positive probes (both of which recurse), or the filter, querying the filter about the existence of one or
one positive (which recurses) and one negative (which does more keys (point lookups and range scans), and serializing
not). Therefore the probes that we perform form a binary and deserializing the filter contents and its structure.
tree whose leaves are the negative probes, and every other Implementing Full Filters. Each run of the LSM-tree is
node is a positive probe. From this property, we have that physically stored as one or more SST files - stands for Static
there are i + 1 negatives, and as such 2i + 1 probes in total. Sorted Table and is effectively a contiguously stored set of
Then, we have that Ei = Pi (2i +1), where Pi is the probability ordered key-value pairs. A Rosetta instance is created for
that the number of positive probes is equal to i. Let Ci be every SST file4 . Filters are serialized before they are persisted
the ith Catalan number, which is equal to the number of
unique binary trees with 2i + 1 nodes [66]. We have that 4 Using block-based filters is also deprecated in RocksDB.
on disk5 . During background compactions, a new filter in- of deserialization in SuRF is low as it can choose to deserialize
stance is built for the merged content of the new SST, while only a part of the filter as needed. On the other hand, Rosetta
the filter instances for the old SSTs are destroyed. being based on Bloom filters cannot take advantage of partial
Minimizing Iterator-Overhead Through Configuration. filter bits. To this end, we construct a dictionary containing
For each range query, RocksDB creates a hierarchy of itera- the mapping of the deserialized bits of each Rosetta instance
tors. The number of iterators is equal to the number of SST and its corresponding run in the LSM-tree. Every time a
files and thus the number of filter instances. Each such itera- Rosetta instance is constructed, we add this mapping to the
tor is called a two-level iterator as it further maintains two dictionary which prevents frequent calls to deserialization at
child iterators to iterate over the data and non-data (filters runtime. After each compaction, we delete the corresponding
and fence pointers) blocks (in the event that the correspond- entries from the dictionary.
ing filter returns a positive).
We found that a big part of the CPU cost of range queries 5 EXPERIMENTAL EVALUATION
in RocksDB [13] is due to the maintenance and consolidation We now demonstrate that by sacrificing filter probe cost,
of data coming from the hierarchical iterators. Unlike other Rosetta achieves drastically lower FPR, bringing up to 70%
levels, L0 in RocksDB comprises of more than one run flushed net improvement in end-to-end performance in default RocksDB,
from the in-memory buffer. This design decision helps with and 50% against a SuRF-integrated RocksDB.
writes in the key-value store. But at the same time, it means Baselines. We use the following baselines for comparison
that we need to create multiple iterators for L0 causing a – state-of-the-art range query filter, SuRF7 [74] integrated
big CPU-overhead for empty queries (queries that do not into RocksDB, RocksDB with Prefix Bloom filters and vanilla
move data). To reduce this overhead, we bound the number RocksDB with no filters but only fence pointers. For SuRF,
of L0 files (in our experiments we use 3 runs at L06 ). Setting we used the open-source implementation8 with additional
the number of L0 iterators too low may throttle compaction stubs to tailor the code path to the master template. For the
jobs and create long queues of pending files to be compacted. other baselines (Prefix Bloom filters and vanilla RocksDB
Hence, this value needs to be tuned based on the data size. We with no filters but only fence pointers), we use the default
also restrict the growth of L0 thereby enforcing compaction RocksDB implementation of each of them. All baselines use
and spawning most iterators only per level (and not per file) the same source RocksDB version v6.3.6.
by setting max_bytes_for_level_base. Workload and Setup. We generate YCSB key-value work-
Implementation Overview of a Range Query. For each loads that are variations of Workload E, a majority range
range query in [low, high], we first probe all filter instances scan workload modeling a web application use case [18]. We
to verify the existance of the range. If all filters answer nega- generate 50 million (50 × 106 ) keys each of size 64 bits (and
tive, we delete the iterator and return an empty result. If one 512 byte values) using both uniform, and normal distribu-
or more filters answer positive, we probe the fence pointers tions. We vary the maximum size of range queries, and target
and seek the lower end of the query using seek(low) which scan distribution. We set a default budget of 22 bits per key
incurs I/O. If seek() returns a valid pointer, we perform a to each filter unless specified otherwise. Rosetta can tune
next() operation on the two-level iterator which advances itself to achieve a specific memory budget, while with SuRF
the child iterators to read the corresponding fence pointers we need to implicitly trade memory for FPR by increasing a
followed by the data blocks. When the iterator reaches (or built-in parameter called suffix length. When given a mem-
crosses) the upper bound high, next() returns false. ory budget, we use a binary search on the suffix length to
Minimizing Deserialization-Overhead. To perform a fil- locate the SuRF version that gives the closest memory cost to
ter probe for an SST file, we first need to retrieve the filter. If the given memory budget. Exceptional cases happen when
the filter comes from disk, we need to first deserialize it. Fil- the minimum possible memory occupied by SuRF (when
ter serialization incurs a low, one-time CPU-cost as it takes we set suffix length to 0) is still greater than the memory
place once during the creation of a filter instance. However, budget. In this case, we use the SuRF version that achieves
by default, deserialization takes place as many times as the its minimum possible memory budget. As filters are meant
number of queries. Being a trie with a tree structure, the cost to improve performance by capturing false positives, and
5 For
thus reducing the number of I/Os, a filter is best evaluated
caching filters and fence pointers, we set the
cache_index_and_filter_blocks to true. We also ensure that
in presence of empty queries. Therefore, for all our experi-
the fence pointers and filter blocks have a higher priority ments, our workload is comprised of empty range and point
than data blocks when block cache is used. For this, we set queries to capture worse-case behavior.
cache_index_and_filter_blocks_with_high_priority=true and
pin_l0_filter_and_index_blocks_in_cache=true. 7 https://github.com/efficient/SuRF
6 level0_file_num_compaction_trigger=3 8 https://github.com/efficient/rocksdb
IO cost CPU-IO overlap CPU cost serialized get deserialization filter probe residual seek
Workload execution latency
18 10
(S) (S) (S) (S) (S) (S) (S) 0.03
SuRF
16 (S) (S) (S) (S) (S) (S) (S)
(R) (R) 88 (R)
(second)
14
latency (sec)
Workload execution latency (ms)
(R)
12
12 0.02
tta
(second)
66
10
(R) Rose
execution
FPR
88 (R) (R) (R) (R)
cost
44 0.01
6 (R) (R) (R) (R) (R)
Workload
CPU
44 22
2
0.00
0 0
1SuRFRosetta1
Rosetta1 2SuRFRosetta1
4SuRFRosetta1
8SuRFRosetta1
16 32SuRFRosetta1
SuRFRosetta1 64SuRF 1SuRFRosetta
Rosetta 2SuRFRosetta
4SuRFRosetta
8SuRFRosetta
16 32SuRFRosetta
SuRFRosetta 64SuRF 2 4 8 16 32 64
(A1) Uniform data (50M) + uniform workload + 22 bits/key (A2) Break-down of CPU cost for uniform workload (A3) Comparison of FPR
Workload execution latency
(S) 12
12 (S)
75
(S)
in ce
35
(S)
rs
po en
(S) (S)
rs om
te
30 (S) 60
F
(S) (S) (S) (S) 10

10
(S) (R)
(S) (S)
lte lo
(R)
Fi x B
25
88
(second)
i
40
ef
20
20
Pr
66 (R)
15 (R) (R) (R) (R) (R) (R) SuRF
44 20
10
10
5
(R) (R) (R) (R) (R) 22 Rosetta
0
0 0
1SuRFRosetta1
2SuRFRosetta1
4SuRFRosetta1
8SuRFRosetta1
16 32SuRFRosetta1 2 4 8 16 32 64
1 Rosetta1
Rosetta1
SuRF 2 Rosetta1
SuRF 4 Rosetta1
SuRF 8SuRFRosetta1
16 32
SuRFRosetta1 64SuRF
SuRFRosetta1 Rosetta1 SuRFRosetta1 64
SuRF
Range size Range size
(B) Uniform data (50M) + correlated workload + 22 bits/key (C) Normal data (50M) + uniform workload + 22 bits/key (D) Rosetta outperforms all baselines
Figure 5: Rosetta improves end-to-end RocksDB performance across diverse workloads.
Experimental Infrastructure. We use a NUC5i3ryh server range queries, whereas, Rosetta takes advantage of its lower
with a 5th Generation Intel Core i3 CPU, 8GB main memory, level Bloom filters for answering short range queries. This
1 TB SATA HDD at 7200 RPM, and 120GB SSD with version significantly reduces the FPR achieved by Rosetta for shorter
Samsung 850 EVO m.2 120GB Samsung EVO 850 SSD. The ranges, as shown in Figure 5(A3). The mean FPR of Rosetta
operating system is Ubuntu version 16.04 LTS. for shorter ranges is 0.00012 whereas, it is 0.0242 for SuRF.
Rosetta Improves the Performance of RocksDB. In our This leads to a 70% reduction of I/O cost for shorter ranges
first experiment we demonstrate the benefits of Rosetta in in Rosetta. The gap closes for bigger ranges as more of the
a uniform workload while varying the query range from 1 data has to be scanned and more areas in the filters need to
to 64. For each range size, we measure the the end-to-end be probed. As the range size increases beyond 16, although
performance in terms of workload execution latency, for both the FPR of Rosetta increases, it is still 1.5× less than that of
Rosetta-integrated RocksDB and SuRF-integrated RocksDB. SuRF which reduces the I/O cost by 13.6%.
Figure 5(A1) shows the results denoting the measurements Sacrificing Filter Probe Cost Leads To Better Perfor-
of Rosetta and SuRF using R and S, respectively. Figure 5(A1) mance with Rosetta. In Figure 5(A2), we magnify the CPU
also reports how the total cost for each system is spread cost reported in Figure 5(A1) and break it down further into
among disk I/Os, CPU usage, and the overlap between the sub-costs: fetching serialized filter bits before a filter probe,
two. The total time emanating from disk I/Os and overall deserializing filter bits, probing the filter, and performing
CPU usage are obtained using RocksDB-reported metric residual operations seek() which includes routine jobs per-
– block_read_time and iter_seek_cpu_nanos. The cost formed by RocksDB iterators – looking for checksum mis-
for I/O and CPU taken together exceeds the total cost due match and I/O errors, going forward and backward over the
to cycle-stealing within the processor architecture which data, filters and fence pointers, and creating and managing
prevents wasting CPU cycles. Therefore, the CPU-IO overlap database snapshots for each query. Figure 5(A2) shows that
contributes to both the CPU and I/O cost. For measuring the cost of serialization is negligible (about 0.7% for both
the sub-costs due to serialization, deserialization, and filter SuRF and Rosetta), and hence is not visible within the stacked
probe, we have embedded custom statistics into RocksDB breakdown. The CPU overhead of deserialization is signifi-
using the internal stopwatch() support. cant yet close across both filters – about 14.5% and 11.1% for
Figure 5(A1) shows that the performance of Rosetta is up to Rosetta and SuRF, respectively. Overall, the cost of serializa-
4× better compared to SuRF. For shorter ranges till range size tion and deserializtion do not fluctuate across diverse range
of 16, Rosetta incurs a negligible I/O overhead compared to sizes as both these operations account for constant amount
SuRF with 22 bits per key. This behavior is due to the design CPU work for each query.
differences between SuRF and Rosetta – SuRF employs a The critical observation from Figure 5(A2) is that filter
trie-culling design that prunes a full-trie’s bottom elements, probe cost is a non-dominant portion of the overall CPU
which contain important information for answering short cost. Rosetta does incur a higher filter probe cost compared
to SuRF (19% of total CPU cost compared to 5%). The reason is it as positive, since it will have likely culled that key’s pre-
that Rosetta probes its internal Bloom filters once per dyadic fixes which contain the information to answer the query. In
interval of the target query range. Within each interval, the contrast, for any query, Rosetta is able to decompose it into
deeper Bloom filters are recursively probed until one of them prefix checking which is insensitive to the correlation.
returns a negative. On the other hand, with SuRF a probe
involves only checking the existence of (i) a smaller number Rosetta Enhances RocksDB for Skewed Key Distribu-
of (ii) fixed-length prefixes. tion. For this experiment, we generate a dataset compris-
However, this CPU-intensive design leads to bigger gains ing of 50M keys following a normal distribution and exe-
for Rosetta. It allows Rosetta to achieve lower FPR (Fig- cute a uniform workload on the dataset. In Figure 5(C), we
ure 5(A3)) resulting in 2 − 4× end-to-end improvement for demonstrate that Rosetta offers up to 2× performance bene-
RocksDB (Figure 5(A1)). The improved FPR does not only fits over SuRF with skewed keys for end-to-end performance
help with I/O but it also affects the fourth CPU-overhead in RocksDB. As more skewed keys are added to a run, the
contributing to the residual seek cost. On averaging over dif- number of distinct keys within the run decreases. This leads
ferent range sizes, the overhead due to residual seek is about to more prefix collisions which makes the trie-culling ap-
65% and 82% for Rosetta and SuRF, respectively, thereby dom- proach of SuRF vulnerable to false positives. On the other
inating the overall CPU cost. This overhead comprises of hand, Rosetta can resolve the collision through more probes
a (i) fixed component (invariable across varied range sizes) at the deeper level Bloom filters. On average, Figure 5(C)
due to creation and maintenance of top-level iterators in shows that Rosetta enhances performance over SuRF by 52%
RocksDB for all queries and (ii) a variable component due to and 19% for short and long range queries, respectively.
longer usage of child iterators per query due to reading more Rosetta improves default RocksDB by as much as 40x.
data or fence pointer blocks in case of false or true positives. We now demonstrate the benefits of Rosetta against the
For example, as we compare the results across Figures 5(A1), default filters of RocksDB using the 50M uniform workload
(A2), and (A3), we notice that Rosetta shows a fixed over- setup (as in Figure 5(A)). In Figure 5(D), we observe that both
head of about 2.7 seconds for shorter ranges, i.e., when FPR Prefix Bloom filters and fence pointers offer significantly
is significantly close to 0. The variable cost starts showing higher latency compared to both SuRF and Rosetta. This is
up as the FPR increases for range size 32 and 64. Due to high because fence pointers cannot detect empty range queries
FPR in SuRF, we observe that this part of the CPU-overhead especially when the range size is significantly lower than the
is significantly higher. size of the key domain (264 ). Prefix Bloom filters also cause
Therefore, sacrificing the CPU cost due to filter probe a high FPR due to frequent prefix collisions as the range
significantly reduces (a) the CPU overhead originating from size increases. Rosetta offers the lowest latency among all
seek() (which also dominates the total CPU cost) and of baselines bringing an up to 40× improvement over default
course (b) the total I/O cost of the workload due to lower RocksDB. While the benefit reduces with larger range sizes,
FPR. The consolidated effects of (a) and (b) make a significant key-value stores are fundamentally meant to support point
difference in the end-to-end performance for Rosetta. lookups, writes, and short range queries as opposed to scans
and long range queries [38, 57, 60].
Rosetta Enhances RocksDB for Key-Correlated Work-
loads. In order to investigate the benefits of Rosetta across Rosetta’s Construction Cost Adds Minimal Overhead.
diverse workload types, we repeat the previous experiment All types of filters in an LSM-based key-value store need to
on both correlated and skewed workloads. First, we analyze be reconstructed after a merge occurs. In our next experi-
workloads that exhibit prefix similarity between queried ment, we demonstrate the impact of this construction cost
keys and stored keys. For this experiment, we introduce a as we vary the data size. First, we set the size of L0 to be
correlation factor θ such that a range query with correla- very high (30 runs) so that the entire data fits into L0 and
tion degree θ has its lower bound at a distance θ from the there is no compaction. This way, we can isolate and measure
lower bound generated using the distribution. Therefore, for the filter construction cost without taking into account the
a range query of size R, if the generated lower-bound is l, overlap with the compaction overhead. We vary the SST file
then we set the actual lower-bound to be at key l + θ . For size between 256 MBs and 1 GB which, in turn, increases the
this experiment, we set θ = 1. number of filter instances created from 5 to 59. Figure 6(A)
Figure 5(B) depicts that when the workload entails a key shows that the average filter construction cost of Rosetta is
correlation, Rosetta-integrated RocksDB on average achieves about 14% less than that of SuRF. This is because Rosetta cre-
a 75% lower latency than that of SuRF-integrated RocksDB. ates a series of Bloom filters which are densely packed arrays,
This is because when the lower bound of a query falls close causing fewer memory misses compared to populating a tree.
to a key, even if the range is empty, SuRF will always answer Next, in order to study the more holistic impact of filters on
write costs, we restored the size of L0 to 3 (as in other exper- FPR. It keeps up the performance for larger memory bud-
iments) and measure the end-to-end workload execution la- gets compared to the Bloom filters on RocksDB and even
tency for both the filters as well as vanilla RocksDB with only improves on it as more memory becomes available.
fence pointers (denoted by F in the results). In Figure 6(B), The most critical observation from this experiment is that
we break down (S) the filters which are meant
SST size: 256 MB Prefix Bloom Fillter
(S) (R)
25 SST size: 1 GB
Filter creation latency
the overall cost (R) to be used for range queries, SuRF-Hash

20 (S) 0.1 SuRF-Real
into separate costs (A) (S) (R) i.e., SuRF-Real and Prefix
FPR
(sec)
15 (R)
due to reads and Bloom filters cannot give ad- 0.01 Bloom filter
10 (S) (R) (S)
(R)
writes and writes 5 equate point query perfor- 0.001 Ros
etta
are further bro- mance. Thus, a key-value
700
ken down into read (F) store would need to either
Workload execution
600
600
(S)
filter creation
bits/key
latency (sec)
500
sub-costs emanat- compaction maintain two filters per run
400 (F)
ing from both com- (B) (S) (R) (one for point queries and Figure 7: Rosetta main-
300
(R)
200
200 (F)
paction and filter 100
(R) (S) one for range queries), or tains the point query per-
creation. We ob- 0
10M
RosettaSuRF None Rosetta20M
SuRF None Rosetta30M
SuRF None
lose out on performance for formance on RocksDB.
serve that write- Data size one of the query types. On
overhead is less Figure 6: Rosetta incurs minimal the other hand, Rosetta can efficiently support both query
for fence point- filter construction overhead dur- types.
ers as there is no ing compactions. Robust Benefits Across the Memory Hierarchy. For this
filter construction overhead, but reads are expensive due experiment, we compare SuRF and Rosetta outside RocksDB
to high false-positives. On the other hand, both SuRF and to verify whether we see the same behavior as in the full
Rosetta have higher write-overhead due to the additional system experiments. We use 10M keys each of size 64 bits.
work of recreating filters after each compaction. To give more The experiment is performed using a uniform workload.
insights, we measure the total compaction time (T), and the We measure end-to-end la-
Response Time (105 microsecs)

2.0
bytes read (R) and written (W) during compaction and denote tency to access data in Rosetta
SuRF p: 0.70
1.5
the compaction overhead to be T/(R+W). We observe this memory, HDD, or SSD, as 0.95 FPR
p: 0.70
overhead to be 0.006µs/byte, 0.008µs/byte, and 0.007µs/byte shown in Figure 9. For all 0.95 FPR
1.0
for Rosetta, SuRF, and fence pointers respectively. Thus, the storage media, Rosetta does p: 0.71
0.5
overall overhead in Rosetta due to the more complex filter 0.95 FPR
spend more time in filter
p: 2.57 p: 2.64 p: 2.64
structure is minimal compared to SuRF or even compared probe cost (represented by
0.0
0.09 FPR 0.09 FPR
0.09 FPR
to having no filter at all. Critically, this overhead is over- p), but the improved re- Mem SSD HDD
shadowed by the improved FPR which, in turn, drastically sulting FPR translates to Figure 9: Rosetta improves
improves query performance on every run of the tree. Of end-to-end benefits even in performance across the
course, for purely insert-heavy workloads, no filters should memory. Thus, this result memory hierarchy.
be maintained (Rosetta, SuRF, or Bloom filters) as they would verifies the same end-to-end behavior we observe when in-
be constructed bur never (or hardly ever) be used. tegrating the filters in RocksDB.
Rosetta Provides Competitive Performance with String
Rosetta Maintains RocksDB’s Point-Query Efficiency. Data. Given that SuRF is based on a trie design while Rosetta
We now show that unlike other filters, Rosetta not only is based on hashing, the natural expectation with string
brings benefits in range queries but also does not hurt point data is that SuRF will be better. We now show that Rosetta
queries. We experiment with a uniform workload of 50M achieves competitive performance to SuRF with string data.
keys. We vary the bits per key from 10 to 20 and measure We use again a standalone experiment here, outside RocksDB
the change in FPR for both SuRF and Rosetta, compared to to further isolate the properties of the filters.
the Bloom filters on RocksDB. Figure 7 shows the results. We use a variable-length string data set, Wikipedia Ex-
SuRF-Hash and SuRF-real offer a drastically higher FPR. For traction (WEX)9 comprising of a processed dump (of size
example, the point-query FPR of SuRF-Hash is 10× worse 6M) of English language in Wikipedia. The wiki markup for
compared to that of the default Bloom filter in RocksDB as each article is transformed into machine-readable XML, and
it incurs more hash collisions with large number of keys. common relational features such as templates, infoboxes,
Similarly, if we were to use Prefix Bloom filters, FPR for categories, article sections, and redirects are extracted in tab-
point queries goes all the way up to 1 in this case. On the ular form. We generate a workload comprising of 1 million
other hand, Rosetta only uses its last level Bloom filter which
effectively indexes all bits of the keys and thus, brings great 9 https://aws.amazon.com/de/datasets/wikipedia-extraction-wex/
Rosetta (latency) SuRF (latency) Rosetta (FPR) SuRF (FPR) Analytical systems Rosetta SuRF
Workload Data size: 10M 20M 30M

type: 10 13 memory (bits/key) 24 26
32 32 32 1.0
(A) (B) (C) 0.8 (D)
24 24 24 Memory
64
range size
0.6 too high
16 16 16
uniform 0.4 to use a
32
8 8 8
0.2 filter
2 2 2 0
workload execution latency (sec)
10 15 20 25 30 10 15 20 25 30 10 15 20 25 30
32 32 1.0
False positive rate

(E) (F) 32 (G) (H)
24 24 0.8 Memory
64
range size
24 0.6 too high
correlated 16 16 to use a
16 0.4 32
8 8
filter
8 0.2
2 2 0
2
10 15 20 25 30 10 15 20 25 30 20 10 15 20 25 30 1.0
20 20
(I) (J) (K) (L)
16 16 16 0.8
64
Memory
range size
0.6 too high
skew to use a
8 0.4 32
8 8
filter
0.2
2 0
2 2
10 15 20 25 30 10 15 20 25 30 10 15 20 25 30
bits/key bits/key bits/key
Figure 8: Rosetta exhibits a superior FPR-Memory tradeoff, compared to SuRF, across multiple workloads. Rosetta
serves a larger spectrum of key-value workloads under diverse range sizes and memory budget.
disk access insertion and filter probe low memory budget and still keep up a low FPR leading to
FPR: .9 .9 .8 .8 .8 .9 .7 .9 .7 .9 robust end-to-end performance.
Rosetta
Rosetta
Rosetta
Rosetta
Rosetta
Rosetta
Rosetta
Rosetta Exhibits a Superior FPR-Memory Tradeoff

Mean response time (ms)
SuRF
SuRF
SuRF
Across Diverse Workloads and Memory Budgets. We

Response Time (microsecs)
2500030
now return to the full RocksDB experiments and proceed

to give a more high level positioning on when to use which
20
filter. We do that by fixing the range size to the longest range

size of 64 (i.e., the worst case for Rosetta) varying the bits
00 10000
per key allocated to each filter from 10 to 32. We also vary

10
the data set to contain 10M to 30M keys. We measure both

the FPR and the end-to-end workload execution latency for
6 10 14 18 22 26 30 uniform, correlated, and skewed workloads.
Bits per key Figures 8(A)-(C) depict results for uniform workloads. Fig-
Figure 10: Rosetta achieves good performance with ures 8(E)-(G) depict results for correlated workloads. And
strings across all memory budgets. finally, Figures 8(I)-(K) depict result for skewed workloads.
Across all workloads, we observe consistently that Rosetta
queries drawn uniformly from the data set. We set the range achieves an overall superior performance and that it can
size to be 128 and we vary the memory budget from 6 bits make better use of additional memory to lower the FPR. Intu-
per key to 30 bits per key. For each setting of memory budget, itively, this is because its core underlying structure is Bloom
we record the FPR, the mean execution time of each query, filters where every additional bit may be utilized more, com-
and the filter probe cost. Figure 10 shows that overall both pared to SuRF where where the prefixes are culled beyond
Rosetta and SuRF offer similar FPR with SuRF outperforming a definite length. Although the additional suffixes for SuRF-
Rosetta marginally. However, we also observe that there is a real do help in reducing the FPR, it still falls behind the FPR
wide spectrum of memory budgets where SuRF cannot be achieved by Rosetta.
applied as it demands a minimum memory of 20 bits per key When memory budget is very tight, SuRF does gain over
to store the prefixes. Rosetta can support workloads with Rosetta. However, the latency achieved at these constrained
Rosetta (latency) SuRF (latency) Rosetta (FPR) SuRF (FPR) been applied widely for streaming applications [19, 20, 23]
12 1.0 1.0
workload execution
False positive rate

10 (A) range: 8 (B) range: 16 (C) range: 32 to enable computation of sophisticated aggregates – quan-
0.8 0.8
latency (sec)
8 tiles [41], wavelets [40], and histograms [39]. The canoni-

0.6 0.6
6
0.4 0.4 cal dyadic decomposition has also been exploited to deter-
2 0.2 0.2
mine and track the most frequently used data within a data
10 15 20 25 30 10 15 20 25 30 10 15 20 25 30 store [22] as well as to support range query investigations
bits/key bits/key bits/key
for privacy-preserving data analysis, i.e., processing noisy
Figure 11: Rosetta always improves performance for
variants of confidential user data [21]. Other efforts in this
smaller range sizes.
direction include designing persistent sketches so that the
sketches can be queried about a prior state of data [59, 70]
memory levels is substantially worse than with a few addi-
and facilitating range queries using multi-keyword search
tional bits per key. Therefore, for most practical applications,
with security-performance trade-offs [31].
especially on the cloud, it makes sense to devote the ad-
Among these, the closest to our work are the 1) Count-
ditional bits. This decision also depends on other factors
Min sketch [20, 23] and 2) Persistent Bloom Filters (PBF) [59].
such as the cloud cost policy, e.g., increasing memory bud-
The Count-Min sketch integrates bloom filters within dyadic
get may lead to upgrading the hardware to high-end virtual
partitions or Segment Trees. The work demonstrates the effi-
machines, which also improves the cap on the maximum
ciency of approximate query answering using the integrated
allowed throughput and hence, may significantly reduce la-
framework. However, Rosetta is different from the Count-
tency. Figure 11 repeats the same experiment for smaller
Min sketch framework in the following ways: (i) Rosetta han-
ranges and shows that Rosetta is nearly always better.
dles diverse workloads composed of different query types
Holistic Positioning of SuRF and Rosetta. Overall, our
and interleaved access patterns, (ii) Rosetta specifically con-
results indicate that both SuRF and Rosetta can bring signifi-
sider the CPU-memory contention affecting workload per-
cant advantages complementing each other across different
formance, and (iii) Rosetta considers resource-constrained
workloads. Figures 8(D), (H), and (L) summarize when each
situations where it is crucial to design an optimal allocation
filter is most suitable with respect to data size, range size and
scheme for storing the filters and dyadic data partitions.
memory budget. All three figures denote the spectrum of sce-
PBF [59] solves a different problem, i.e., answering a point
narios of executing workloads on key-value stores. Within
query over a time range (e.g. does this IP address appear
each figure, we partition horizontally to indicate different
between 9pm and 10pm?). Our solutions use similar design
range sizes and vertically to indicate different memory bud-
elements, e.g., Bloom filters and dyadic range decomposition,
gets. The area beyond a memory budget of 26 bits per key
but ours differs significantly in several ways. Firstly, we
is too high for filters to be used and the area beyond range
achieve a new balance, trading more CPU time for a better
size of 64 is meant to be served using analytical systems or
FPR. Furthermore, Rosetta achieves memory-optimality with
column stores. Our observation is that, Rosetta works better
respect to the maximum range size and FPR, due to being able
for short and medium range queries across all workload-
to effectively assign different memory budgets to different
memory combinations. For long range queries, although the
Bloom filters.
point of separation slightly varies with the workload type,
overall SuRF performs better with low memory budget and
Rosetta covers the area with high memory budget.
7 CONCLUSION
6 RELATED WORK We introduce Rosetta, a probabilistic range filter designed
We now discuss further related work on range filtering. specifically for LSM-tree based key-value stores. The core
Range Filter with Optimal Space. Goswami et al. [44] idea is to sacrifice filter probe time which is not visible in end-
analyzes the space lower bound for a range filter. The pri- to-end key-value store performance in order to significantly
mary technique of the proposed data structure is based on a reduce FPR. Rosetta brings benefits for short and medium
pairwise-independent and locality-hash function. The keys range queries across various workloads (uniform, skewed,
are hashed by the function, and then a weak prefix search correlated) without hurting point queries.
data structure is built based on the hashed keys for answer-
ing range queries. Their algorithm does not use Bloom filters,
dyadic ranges, nor memory allocation strategies, which are ACKNOWLEDGEMENTS
the core design components in Rosetta. We thank the anonymous reviewers for their valuable feed-
Dyadic Intervals for Other Applications. The idea of back. This work is partially funded by the USA Department
decomposing large ranges of data to dyadic intervals has of Energy project DE-SC0020200.
REFERENCES Foundations and Trends in Databases (2011).
[1] Abadi, D. J., Boncz, P. A., Harizopoulos, S., Idreos, S., and Mad- [20] Cormode, G., Garofalakis, M. N., Haas, P. J., and Jermaine, C.
den, S. The Design and Implementation of Modern Column-Oriented Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches.
Database Systems. Foundations and Trends in Databases 5, 3 (2013), Foundations and Trends in Databases 4, 1-3 (2012), 1–294.
197–280. [21] Cormode, G., Kulkarni, T., and Srivastava, D. Answering Range
[2] Alsubaiee, S., Altowim, Y., Altwaijry, H., Behm, A., Borkar, Queries Under Local Differential Privacy. In arXiv:1812.10942 (2018).
V. R., Bu, Y., Carey, M. J., Cetindil, I., Cheelangi, M., Faraaz, K., [22] Cormode, G., and Muthukrishnan, S. What’s Hot and What’s Not:
Gabrielova, E., Grover, R., Heilbron, Z., Kim, Y.-S., Li, C., Li, G., Tracking Most Frequent Items Dynamically. In PODS (2003).
Ok, J. M., Onose, N., Pirzadeh, P., Tsotras, V. J., Vernica, R., Wen, [23] Cormode, G., and Muthukrishnan, S. An Improved Data Stream
J., and Westmann, T. AsterixDB: A Scalable, Open Source BDMS. Summary: The Count-Min Sketch and Its Applications. In LATIN 2004:
Proceedings of the VLDB Endowment 7, 14 (2014), 1905–1916. Theoretical Informatics, 6th Latin American Symposium, Buenos Aires,
[3] Alsubaiee, S., Carey, M. J., and Li, C. Lsm-based storage and indexing: Argentina, April 5-8, 2004, Proceedings (2004), vol. 2976, pp. 29–38.
An old idea with timely benefits. In Second International ACM Workshop [24] Dayan, N., Athanassoulis, M., and Idreos, S. Monkey: Optimal Navi-
on Managing and Mining Enriched Geo-Spatial Data (2015), pp. 1–6. gable Key-Value Store. In Proceedings of the ACM SIGMOD International
[4] Apache. Accumulo. https://accumulo.apache.org/ . Conference on Management of Data (2017), pp. 79–94.
[5] Apache. HBase. http://hbase.apache.org/ . [25] Dayan, N., Athanassoulis, M., and Idreos, S. Optimal Bloom Filters
[6] Armstrong, T. G., Ponnekanti, V., Borthakur, D., and Callaghan, and Adaptive Merging for LSM-Trees. ACM Transactions on Database
M. Linkbench: a database benchmark based on the facebook social Systems (TODS) 43, 4 (2018), 16:1–16:48.
graph. In Proceedings of the 2013 ACM SIGMOD International Conference [26] Dayan, N., Bonnet, P., and Idreos, S. GeckoFTL: Scalable Flash
on Management of Data (2013), pp. 1185–1196. Translation Techniques For Very Large Flash Devices. In Proceedings
[7] Athanassoulis, M., Kester, M. S., Maas, L. M., Stoica, R., Idreos, of the ACM SIGMOD International Conference on Management of Data
S., Ailamaki, A., and Callaghan, M. Designing Access Methods: (2016), pp. 327–342.
The RUM Conjecture. In Proceedings of the International Conference on [27] Dayan, N., and Idreos, S. Dostoevsky: Better space-time trade-offs
Extending Database Technology (EDBT) (2016), pp. 461–466. for lsm-tree based key-value stores via adaptive removal of superflu-
[8] Balmau, O., Didona, D., Guerraoui, R., Zwaenepoel, W., Yuan, H., ous merging. In Proceedings of the 2018 International Conference on
Arora, A., Gupta, K., and Konka, P. TRIAD: Creating Synergies Management of Data (2018), pp. 505–520.
Between Memory, Disk and Log in Log Structured Key-Value Stores. [28] Dayan, N., and Idreos, S. The log-structured merge-bush & the
In Proceedings of the USENIX Annual Technical Conference (ATC) (2017), wacky continuum. In Proceedings of the 2019 International Conference
pp. 363–375. on Management of Data (2019), pp. 449–466.
[9] Bender, M. A., Farach-Colton, M., Johnson, R., Kraner, R., Kusz- [29] de Berg, M., van Kreveld, M., Overmars, M., and Schwarzkopf,
maul, B. C., Medjedovic, D., Montes, P., Shetty, P., Spillane, R. P., O. C. More geometric data structures. In Computational Geometry.
and Zadok, E. Don’t Thrash: How to Cache Your Hash on Flash. 2000, pp. 211–233.
Proceedings of the VLDB Endowment 5, 11 (2012), 1627–1637. [30] DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Laksh-
[10] Bloom, B. H. Space/Time Trade-offs in Hash Coding with Allowable man, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels,
Errors. Communications of the ACM 13, 7 (1970), 422–426. W. Dynamo: Amazon’s Highly Available Key-value Store. ACM SIGOPS
[11] Bortnikov, E., Braginsky, A., Hillel, E., Keidar, I., and Sheffi, G. Operating Systems Review 41, 6 (2007), 205–220.
Accordion: Better Memory Organization for LSM Key-Value Stores. [31] Demertzis, I., Papadopoulos, S., Papapetrou, O., Deligiannakis, A.,
Proceedings of the VLDB Endowment 11, 12 (2018), 1863–1875. and Garofalakis, M. Practical Private Range Search Revisited. In
[12] Bu, Y., Borkar, V. R., Jia, J., Carey, M. J., and Condie, T. Pregelix: ACM SIGMOD (2016).
Big(ger) Graph Analytics on a Dataflow Engine. Proceedings of the [32] Dgraph. Badger Key-value DB in Go. https://github.com/dgraph-
VLDB Endowment 8, 2 (2014), 161–172. io/badger.
[13] Callaghan, M. CPU overheads for RocksDB queries. [33] Dharmapurikar, S., Krishnamurthy, P., and Taylor, D. E. Longest
http://smalldatum.blogspot.com/2018/07/query-cpu-overheads- prefix matching using bloom filters. In Proceedings of the 2003 Con-
in-rocksdb.html, July 2018. ference on Applications, Technologies, Architectures, and Protocols for
[14] Cao, Z., Chen, S., Li, F., Wang, M., and Wang, X. S. LogKV: Exploiting Computer Communications (2003), pp. 201–212.
Key-Value Stores for Log Processing. In Proceedings of the Biennial [34] Dong, S., Callaghan, M., Galanis, L., Borthakur, D., Savor, T., and
Conference on Innovative Data Systems Research (CIDR) (2013). Strum, M. Optimizing Space Amplification in RocksDB. In Proceedings
[15] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., of the Biennial Conference on Innovative Data Systems Research (CIDR)
Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. Bigtable: A (2017).
distributed storage system for structured data. In 7th USENIX Sympo- [35] Facebook. MyRocks. http://myrocks.io/ .
sium on Operating Systems Design and Implementation (OSDI) (2006), [36] Facebook. RocksDB. https://github.com/facebook/rocksdb.
pp. 205–218. [37] Fan, B., Andersen, D. G., Kaminsky, M., and Mitzenmacher, M.
[16] Chen, G. J., Wiener, J. L., Iyer, S., Jaiswal, A., Lei, R., Simha, N., Cuckoo Filter: Practically Better Than Bloom. In Proceedings of the
Wang, W., Wilfong, K., Williamson, T., and Yilmaz, S. Realtime ACM International on Conference on emerging Networking Experiments
data processing at facebook. In Proceedings of the 2016 International and Technologies (CoNEXT) (2014), pp. 75–88.
Conference on Management of Data (2016), pp. 1087–1098. [38] Gembalczyk, D., Schuhknecht, F. M., and Dittrich, J. An Ex-
[17] CockroachLabs. CockroachDB. perimental Analysis of Different Key-Value Stores and Relational
https://github.com/cockroachdb/cockroach. Databases. In Datenbanksysteme fur Business, Technologie und Web
[18] Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R., and Sears, (BTW’17) (2017).
R. Benchmarking cloud serving systems with YCSB. In Proceedings of [39] Gilbert, A., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., and
the ACM Symposium on Cloud Computing (SoCC) (2010), pp. 143–154. Strauss, M. Fast, small-space algorithms for approximate histogram
[19] Cormode, G. Sketch techniques for approximate query processing. In maintenance. In Proceedings of the 34t h ACM Symposium on Theory of
Computing (2002). 1195–1206.
[40] Gilbert, A., Kotidis, Y., Muthukrishnan, S., and Strauss, M. Surf- [55] LinkedIn. Voldemort. http://www.project-voldemort.com.
ing wavelets on streams: One-pass summaries for approximate aggre- [56] Lu, L., Pillai, T. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H.
gate queries. In Proceedings of International Conference on Very Large WiscKey: Separating Keys from Values in SSD-conscious Storage. In
Data Bases (2003). Proceedings of the USENIX Conference on File and Storage Technologies
[41] Gilbert, A. C., Kotidis, Y., S.Muthukrishnan, and M.Strauss. How (FAST) (2016), pp. 133–148.
to summarize the universe: Dynamic maintenance of quantiles. In [57] Luo, C., and Carey, M. J. LSM-based Storage Techniques: A Survey.
Proceedings of International Conference on Very Large Data Bases (2002). arXiv:1812.07527v3 (2019).
[42] Golan-Gueta, G., Bortnikov, E., Hillel, E., and Keidar, I. Scaling [58] O’Neil, P. E., Cheng, E., Gawlick, D., and O’Neil, E. J. The log-
Concurrent Log-Structured Data Stores. In Proceedings of the ACM structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351–
European Conference on Computer Systems (EuroSys) (2015), pp. 32:1– 385.
32:14. [59] Peng, Y., Guo, J., Li, F., Qian, W., and Zhou, A. Persistent bloom filter:
[43] Google. LevelDB. https://github.com/google/leveldb/ . Membership testing for the entire history. In Proceedings of the 2018
[44] Goswami, M., Grønlund, A., Larsen, K. G., and Pagh, R. Approx- International Conference on Management of Data (2018), pp. 1037–1052.
imate range emptiness in constant time and optimal space. In Pro- [60] Pirza, P., Tatemura, J., Po, O., and Hacigumus, H. Performance
ceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete Evaluation of Range Queries in Key Value Stores. J Grid Computing
algorithms (2014), pp. 769–775. (2012).
[45] Idreos, S., Dayan, N., Qin, W., Akmanalp, M., Hilgard, S., Ross, A., [61] Raju, P., Kadekodi, R., Chidambaram, V., and Abraham, I. Peb-
Lennon, J., Jain, V., Gupta, H., Li, D., and Zhu, Z. Design continuums blesDB: Building Key-Value Stores using Fragmented Log-Structured
and the path toward self-designing key-value stores that know and Merge Trees. In Proceedings of the ACM Symposium on Operating
learn. In Biennial Conference on Innovative Data Systems Research Systems Principles (SOSP) (2017), pp. 497–514.
(CIDR) (2019). [62] Ren, K., Zheng, Q., Arulraj, J., and Gibson, G. SlimDB: A Space-
[46] Jagadish, H. V., Narayan, P. P. S., Seshadri, S., Sudarshan, S., and Efficient Key-Value Storage Engine For Semi-Sorted Data. Proceedings
Kanneganti, R. Incremental Organization for Data Recording and of the VLDB Endowment 10, 13 (2017), 2037–2048.
Warehousing. In Proceedings of the International Conference on Very [63] Sears, R., Callaghan, M., and Brewer, E. Rose: Compressed, log-
Large Data Bases (VLDB) (1997), pp. 16–25. structured replication. 526–537.
[47] Jannen, W., Yuan, J., Zhan, Y., Akshintala, A., Esmet, J., Jiao, Y., [64] Sears, R., and Ramakrishnan, R. bLSM: A General Purpose Log Struc-
Mittal, A., Pandey, P., Reddy, P., Walsh, L., Bender, M. A., Farach- tured Merge Tree. In Proceedings of the ACM SIGMOD International
Colton, M., Johnson, R., Kuszmaul, B. C., and Porter, D. E. BetrFS: Conference on Management of Data (2012), pp. 217–228.
A Right-optimized Write-optimized File System. In Proceedings of [65] Shetty, P., Spillane, R. P., Malpani, R., Andrews, B., Seyster, J., and
the USENIX Conference on File and Storage Technologies (FAST) (2015), Zadok, E. Building Workload-Independent Storage with VT-trees. In
pp. 301–315. Proceedings of the USENIX Conference on File and Storage Technologies
[48] Jermaine, C., Omiecinski, E., and Yee, W. G. The Partitioned Expo- (FAST) (2013), pp. 17–30.
nential File for Database Storage Management. The VLDB Journal 16, [66] Stanley, R. P. Catalan numbers. Cambridge University Press, 2015.
4 (2007), 417–437. [67] Thonangi, R., and Yang, J. On Log-Structured Merge for Solid-State
[49] Kahveci, T., and Singh, A. Variable length queries for time series Drives. In Proceedings of the IEEE International Conference on Data
data. In Proceedings 17th International Conference on Data Engineering Engineering (ICDE) (2017), pp. 683–694.
(2001), pp. 273–282. [68] Van Kreveld, M., Schwarzkopf, O., de Berg, M., and Overmars, M.
[50] Kondylakis, H., Dayan, N., Zoumpatianos, K., and Palpanas, T. Computational geometry algorithms and applications. Springer, 2000.
Coconut palm: Static and streaming data series exploration now in [69] Vincon, T., Hardock, S., Riegger, C., Oppermann, J., Koch, A., and
your palm. In Proceedings of the 2019 International Conference on Petrov, I. Noftl-kv: Tackling write-amplification on kv-stores with
Management of Data (2019), pp. 1941–1944. native storage management.
[51] Kondylakis, H., Dayan, N., Zoumpatianos, K., and Palpanas, T. [70] Wei, Z., Luo, G., Yi, K., Du, X., and Wen, J. R. Persistent Data Sketch-
Coconut: sortable summarizations for scalable indexes over static and ing. In ACM SIGMOD (2015).
streaming data series. The VLDB Journal (09 2019). [71] WiredTiger. Source Code. https://github.com/wiredtiger/wiredtiger.
[52] Kyrola, A., and Guestrin, C. Graphchi-db: Simple design for a scal- [72] Wu, X., Xu, Y., Shao, Z., and Jiang, S. LSM-trie: An LSM-tree-based
able graph database system–on just a pc. arXiv preprint arXiv:1403.0701 Ultra-Large Key-Value Store for Small Data Items. In Proceedings of
(2014). the USENIX Annual Technical Conference (ATC) (2015), pp. 71–82.
[53] Lakshman, A., and Malik, P. Cassandra - A Decentralized Structured [73] Yahoo. bLSM: Read-and latency-optimized log structured merge tree.
Storage System. ACM SIGOPS Operating Systems Review 44, 2 (2010), https://github.com/sears/bLSM (2016).
35–40. [74] Zhang, Y., Li, Y., Guo, F., Li, C., and Xu, Y. ElasticBF: Fine-grained
[54] Li, Y., He, B., Yang, J., Luo, Q., Yi, K., and Yang, R. J. Tree Indexing on and Elastic Bloom Filter Towards Efficient Read for LSM-tree-based
Solid State Drives. Proceedings of the VLDB Endowment 3, 1-2 (2010), KV Stores. In Proceedings of the USENIX Conference on Hot Topics in
Storage and File Systems (HotStorage) (2018).

Rosetta: A Robust Space-Time Optimized Range Filter For Key-Value Stores

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rosetta: A Robust Space-Time Optimized Range Filter For Key-Value Stores

Uploaded by

Copyright:

Available Formats

Rosetta: A Robust Space-Time Optimized

Range Filter for Key-Value Stores

range size (length in domain)

Probe Cost (microsecs)

empty). The amount of bits chosen determines the CPU/FPR

Workload execution latency (ms)

the overall cost (R) to be used for range queries, SuRF-Hash

Response Time (105 microsecs)

Workload Data size: 10M 20M 30M

False positive rate

Rosetta Exhibits a Superior FPR-Memory Tradeoff

Across Diverse Workloads and Memory Budgets. We

now return to the full RocksDB experiments and proceed

filter. We do that by fixing the range size to the longest range

per key allocated to each filter from 10 to 32. We also vary

the data set to contain 10M to 30M keys. We measure both

False positive rate

8 tiles [41], wavelets [40], and histograms [39]. The canoni-

You might also like