Professional Documents
Culture Documents
Rosetta: A Robust Space-Time Optimized Range Filter For Key-Value Stores
Rosetta: A Robust Space-Time Optimized Range Filter For Key-Value Stores
base 10 representation
1
base 2 representation
BF 1 0 1 0 1
/* K is the key to be inserted. */
BF 2 00 01 10 0 1 2 2 L ← number of bits in a key
3 B ← L Bloom filters
BF 3 001 011 100 101 1 3 4 5
4 for i ∈ {L, . . . , 1} do
BF 4 0011 0110 0111 1000 1001 1011 3 6 7 8 9 11 5 Bi .Insert(K)
6 K ← ⌊K/2⌋
Figure 2: Rosetta indexes all prefixes of each key.
Term Definition Rosetta instances across the tree can have a different number
n Number of keys of Bloom filters.
R Maximum range query size Implicit Segment Tree in Rosetta. Rosetta effectively forms
M Memory budget an implicit Segment Tree [68] (as shown in Figure 3) which is
Bi Bloom filter corresponding to length-i prefix used to translate range queries into prefix queries. Segment
ϵi False positive rate of Bi Trees are full binary trees and thus perfectly described by
Mi Memory allocated for Bi their size, so there is no need to store information about the
ϵ Resulting false positive rate for a prefix query position of the left and right children of each node.
after considering recursive probes
ϕ General reference of the FPR of a Bloom filter
2.2 Range and Point Queries with Rosetta
at Rosetta 2.2.1 Range Lookups. We now illustrate how a range lookup
is executed in Rosetta. For every run in an LSM-tree, we first
Table 1: Terms used.
probe the respective Rosetta instance. Algorithm 2 shows
the steps in detail and Figure 3 presents an example.
First, the target key range is decomposed into dyadic in-
2.1 Constructing Rosetta for tervals. For each dyadic interval, there is a sub-tree within
LSM-tree Runs Rosetta where all keys within that interval have been indexed.
For every run in an LSM-tree, a Rosetta exists which contains For every dyadic interval, we first search for the existence of
information about all the entries within the run. A Rosetta the prefix that covers all keys within that interval. This marks
is created either when the in-memory buffer is flushed to the root of the sub-tree for that interval. In the example of
disk or when a new run is created because two or more runs Figure 3, the range query range(8, 12) breaks down into
of older data are merged into a new one. Thus, insertion two dyadic ranges: [8, 11] (the red shaded area) and [12, 12]
in Rosetta is done once all the keys which are going to be (the green shaded area). We first probe the Bloom filter re-
indexed by this Rosetta instance are known. All keys within siding at the root level of the sub-tree. The Segment Tree is
the run are broken down into variable-length binary prefixes, implicit, so the length of the prefix determines which Bloom
thereby generating L distinct prefixes for each key of length filter to probe. For example, the prefix 10∗ contains every
L bits. Each i-bit prefix is inserted into the i t h Bloom filter of key in [8, 11] and nothing more (i.e., 1000, 1001, 1010, 1011),
Rosetta. The insertion procedure is described in Algorithm 1, so for the range [8, 11] we will probe 10 from the second
and an example is shown in Figure 2. Let us take inserting 3 Bloom filter. If this probe returns a negative, we continue in
(corresponding to 0011) as an example. As shown in Figure 2 the same way for the next dyadic range. If the probe from
(the red numbers), we insert prefixes 0, 00, 001 and 0011 every dyadic range returns negative, the range of the query
into B1 , B2 , B3 and B4 respectively. Here insertion means is empty. As such, in the example of Figure 3, we will probe
that we index every prefix in the Bloom filter by hashing it the second Bloom filter for the existence of 10 and, if that
and setting the corresponding bits in the Bloom filter. The returns negative, we will probe the fourth Bloom filter for the
prefix itself is not stored. Figure 3 shows the final state of existence of 1100. If both probes return negative, the range
Rosetta after inserting all keys in a run (keys 3, 6, 7, 8, 9, 11 is empty (when inserting a key we insert all of its prefixes to
in our example). We do not need to take care of dynamic the corresponding Bloom filter, so for example if there was
updates to the data set (e.g., a new key arriving with a bigger a key in the range [8, 11], we would have inserted 10 to the
length) since each separate instance of Rosetta is created for second Bloom filter).
an immutable run at a time. In addition, given that every run If at least one prefix in the target range returns positive,
of the LSM-tree might have a different set of keys, different then Rosetta initiates the process of doubting. We probe
range(8, 12) range(1000, 1100)
BF 1 0 1
Algorithm 2: Range Query
split point
BF 2 0 1 0 1 search(10, 11) 1 Function RangeQuery(Low, Hiдh, P = 0, l = 1):
BF 3 0 1 0 1 0
search(100, 101)
1
search(110) 0 1 /* [Low, Hiдh]: the query range; */
BF 4 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 search(1000, 1001, 1010, 1011)
search(1100)
/* P: the prefix. */
2 if P > Hiдh or P + 2 L−l +1 − 1 < Low then
Figure 3: Rosetta utilizes recursive prefix probing. /* P is not contained in the range */
3 return false
4 if P ≥ Low and P + 2 L−l +1 − 1 ≤ Hiдh then
/* P is contained in the range */
Bloom filters residing at deeper levels of the sub-tree and in 5 return Doubt(P, l)
turn, determine the existence of all possible prefixes within 6 if RangeQuery(Low, Hiдh, P, l + 1) then
the particular dyadic interval. In the example of Figure 3, 7 return true
when querying the range [8, 11], if the probe for prefix 10∗ is
8 return RangeQuery(Low, Hiдh, P + 2 L−l , l + 1)
positive, we further probe for 100∗ and 101∗ the third Bloom
filter. If both of these probes are negative, we can mark the 9 Function Doubt(P, l):
sub-range as empty, since any key with prefix 10∗ will either 10 if ¬Bl .PointQuery(P) then
have the prefix 100∗ or the prefix 101∗. 11 return false
In general, all Bloom filters at the subsequent levels of 12 if l > L then
Rosetta are recursively probed to determine the existence 13 return true
of intermediate prefixes by traversing the left and right sub- 14 if Doubt(P, l + 1) then
trees in a pre-order manner. The left sub-tree corresponds to 15 return true
the current prefix with 0 appended, while the right sub-tree 16 return Doubt(P + 2 L−l , l + 1)
corresponds to the current prefix with 1 appended. Recursing
within a sub-tree terminates if either (a) a Bloom filter probe
corresponding to a left-leaf in the sub-tree returns positive
which marks the whole subrange as positive, or (b) case (a) how to further optimize FPR through memory allocation and
does not happen while any Bloom filter probe corresponding further CPU sacrifices during probe time.
to a right node returns negative in which case the whole
sub-range is marked as negative. 2.3 FPR Optimization Through
In the event that the range is found to be a positive, there Memory Allocation
is still a chance for Rosetta to reduce I/O by “tightening" the
A natural question arises: how much memory should we
effective range to touch on storage. That is, if Rosetta finds
devote to each Bloom filter in a Rosetta instance for a given
that intervals at the edge of the requested range are empty,
LSM-tree run? Alternatively, what is the false positive rate
it will proceed to do I/O only for the narrower, effective key
ϵi that we should choose for the filters in order to minimize
range which is determined as positive.
the false positive rate for a given memory budget? Equally
2.2.2 Point Lookups. The last level of Rosetta indexes the distributing the memory budget among all Bloom filters over-
whole prefix of every key in the LSM-tree run that it repre- looks the different utilities of the Bloom filters at different
sents. This is equivalent to indexing the whole key in a Bloom Rosetta levels. In this section, we discuss this design issue.
filter which is how traditional Bloom filters are used in the A First-Cut Solution: FPR Equlibrium Across Nodes.
default design of LSM-trees to protect point queries from Our first-cut solution is to achieve an equilibrium, such that
performing unnecessary disk access. In this way, Rosetta can an equal FPR ϵ is achieved for any segment tree node u by
answer point queries in exactly the same way as point-query considering the filters both at u and u’s descendant nodes.
optimized LSM-trees: For every run of the LSM-tree, a point As we will explain later in Section 3, this strategy gives
query only checks the last level of Rosetta for this run, and very good theoretical interpretation in terms of memory
only tries to access disk if Rosetta returns a positive. usage– it achieves a 1.44-approximation optimal memory
usage. The detailed calculation for the FPR ϕ of the filter
2.2.3 Probe Cost vs FPR. The basic Rosetta design described at u’s level is as follows. By the aforementioned objective
above intuitively sacrifices CPU cost during probe time to equilibrium, we suppose that a node u’s children all have
improve on FPR. This is because of the choice to index all false positive rate ϵ, after taking the filters of their respective
prefixes of keys and subsequently recursively probe those descendant nodes into account. If we consider the filters of u
prefixes at query time. In the next two subsections, we show and its descendants all together, an ultimate false positive is
achieved by either 1) the filter at u returns a positive, and u’s Then, by the AM-GM inequality 3 and let h = 1 + ⌊log R⌋,
left sub-tree returns a positive. This case leads to a positive we have,
rate ϕ ·ϵ; or 2) the filter at u returns a positive, u’s left sub-tree
! h1
returns a negative while u’ right sub-tree returns a positive.
− Mnr ·(ln 2)2 − Mnr ·(ln 2)2
Õ Ö
This case gives us a positive rate ϕ · (1 − ϵ) · ϵ. Putting these д(r )e ≥h д(r )e
two cases together gives the following equation. 0≤r ≤ ⌊log R ⌋ 0≤r ≤h−1
! h1
1
− Mnr ·(ln 2)2
Õ Ö M 2 h
1
ϕ · ϵ + ϕ · (1 − ϵ) · ϵ = ϵ ⇒ϕ · (2 − ϵ) = 1⇒ ϕ = ⇒ д(r )e ≥h д(r ) e − n ·(ln 2)
2−ϵ 0≤r ≤ ⌊log R ⌋ 0≤r ≤h−1
− Mnr
Õ
This means that we may set each non-terminal node’s false ⇒ д(r ) · e ·(ln 2)2
≥ (1 + ⌊log R⌋)·
positive rate to 1/(2 − ϵ). 0≤r ≤ ⌊log R ⌋
Optimized Allocation of Memory Across Levels. Based 1
⌊log R⌋
(ln 2)2
M ⌊log R⌋
on the aforementioned idea, there are still many Bloom filters Ö
д(r )® · e− n
© ª
(6)
that are treated equally (except for the last level). However,
0≤r ≤ ⌊log R ⌋
we find that in general, the Bloom filters at shorter prefix « ¬
Rosetta levels are probed less than deeper levels. Intuitively, Given parameters R, r , M, n, the last value in Equation 6 is
we should set lower FPRs for Bloom filters that are probed a constant. Hence, the minimum achieves when the equality
more frequently. To formalize this idea, we analyze the fre- condition of the arithmetic-mean-inequality holds, i.e.,
Mr
д(r ) · e − n ·(ln 2) = C
2
quency of a segment tree node u being accessed (i.e., the (7)
Bloom filter at node u is probed), if we issue every query of Î
⌊log R ⌋
(ln 2)2
⌊log1 R⌋ M ⌊log R⌋
size R once. Our analytical results are as follows. where, C = 0≤r д(r ) · e− n .
Let level-r of Rosetta be the level which is r level higher Transforming Equation 7 gives the optimal memory bits
than the leaf level. Then, the frequency of a level-r node allocation as shown in Equation 3. When Mr solved gives a
being accessed, д(r ), is expressed as, negative value, we reset Mr = 0 and rescale the other Mr ′
Õ for r ′ , r to maintain the sum of memory bits to be M.
д(r ) = д(r + c, R − 1) (1)
0≤c ≤ ⌊log R ⌋−r 2.4 FPR Optimization Through Further
1, x ∈ [0, ⌊log R⌋) Probe Time Sacrifice
R−2 x +1
where, д(x, R − 1) =
2x , x = ⌊log R⌋ (2) As we will show in Section 5, filter probe time is a tiny portion
x > R⌋
⌊log of overall query time in a key-value store. This means that
0,
To incorporate the idea of achieving lower FPRs for less we can explore opportunities to further sacrifice filter probe
frequently accessed filters, we aim to minimize the objective time if that might help us to further improve on FPR. In this
of 0≤r ≤ ⌊log R ⌋ д(r ) · ϵr (where ϵr is the FPR of the level-
Í section, we discuss such an optimization.
r Bloom filter of Rosetta), subject to the overall memory The core intuition of the optimization in this section is
budget M. Let Mr be the number of bits assigned to the level- that the multiple levels of Rosetta require valuable bits. Ef-
r Bloom filter. Then the following expression minimizes the fectively, all levels but the last one in each Rosetta instance
aforementioned function. help primarily with probe time as their utility is to prune the
parts of the range we are going to query for at the last level
n C
Mr = − ln (3) of Rosetta.
(ln 2)2 д(r ) The fewer such ranges the less the probe cost of the filter.
1
⌊log R⌋ (ln 2)2 At the same time, though, these bits at the top levels of
M ⌊log R⌋
© Ö
where, C = д(r )® · e− n Rosetta could be used to store prefixes with fewer conflicts
ª
(4)
« 0≤r ≤ ⌊log R ⌋ ¬ at the bottom level, improving FPR. This gives a balance
The derivation of the Equation 3 is as follows. First, we between FPR and probe cost. We present two new strategies
transformÕthe objective function 2 based on this observation.
Õ: Mr
д(r ) · ϵr = д(r ) · e − n ·(ln 2)
2
(5) (i) Single-level filter for small ranges. This is an extreme
0≤r ≤ ⌊log R ⌋ 0≤r ≤ ⌊log R ⌋ design option of using only a single level Rosetta. This means
1
3x + x2 + . . . + xh ≥ x i h , For x i ≥ 0, the equality holds
ease of analysis, we assume each Bloom filter at level in [0, ⌊log R ⌋]
1 Î
2 For 1 h · 1≤i ≤h
has n unique keys. when all x i are equal.
1.0
that all of Rosetta is stored in a single Bloom filter maximiz-
50
Variable−level Variable−level
40
Single−level Single−level
probe time as all ranges need to be queried. Thus the probe
0.6
30
FPR
time is linear to the range size.
0.4
20
(ii) Variable-level filter for large ranges. An alternative ap-
0.2
10
proach to the extreme single-level design is to selectively
0.0
0
push more bits at lower levels (e.g., more aggressively push- 2
2 4
4 88 16
16 32
32 64
64 128 256
256 512 22 44 88 16
16 32
32 64
64 128 256
256 512
ing bits at lower levels to the point where a level might be Range Range
18 10
(S) (S) (S) (S) (S) (S) (S) 0.03
SuRF
16 (S) (S) (S) (S) (S) (S) (S)
(R) (R) 88 (R)
(second)
14
latency (sec)
Workload execution latency (ms)
(R)
12
12 0.02
tta
(second)
66
10
(R) Rose
execution
FPR
88 (R) (R) (R) (R)
cost
44 0.01
6 (R) (R) (R) (R) (R)
Workload
CPU
44 22
2
0.00
0 0
1SuRFRosetta1
Rosetta1 2SuRFRosetta1
4SuRFRosetta1
8SuRFRosetta1
16 32SuRFRosetta1
SuRFRosetta1 64SuRF 1SuRFRosetta
Rosetta 2SuRFRosetta
4SuRFRosetta
8SuRFRosetta
16 32SuRFRosetta
SuRFRosetta 64SuRF 2 4 8 16 32 64
(A1) Uniform data (50M) + uniform workload + 22 bits/key (A2) Break-down of CPU cost for uniform workload (A3) Comparison of FPR
Workload execution latency
(S) 12
12 (S)
75
(S)
in ce
35
(S)
rs
po en
(S) (S)
rs om
te
30 (S) 60
F
(S) (S) (S) (S) 10
(S) (S)
lte lo
(R)
Fi x B
25
88
(second)
i
40
ef
20
20
Pr
66 (R)
15 (R) (R) (R) (R) (R) (R) SuRF
44 20
10
10
5
(R) (R) (R) (R) (R) 22 Rosetta
0
0 0
1SuRFRosetta1
2SuRFRosetta1
4SuRFRosetta1
8SuRFRosetta1
16 32SuRFRosetta1 2 4 8 16 32 64
1 Rosetta1
Rosetta1
SuRF 2 Rosetta1
SuRF 4 Rosetta1
SuRF 8SuRFRosetta1
16 32
SuRFRosetta1 64SuRF
SuRFRosetta1 Rosetta1 SuRFRosetta1 64
SuRF
Range size Range size
(B) Uniform data (50M) + correlated workload + 22 bits/key (C) Normal data (50M) + uniform workload + 22 bits/key (D) Rosetta outperforms all baselines
Figure 5: Rosetta improves end-to-end RocksDB performance across diverse workloads.
Experimental Infrastructure. We use a NUC5i3ryh server range queries, whereas, Rosetta takes advantage of its lower
with a 5th Generation Intel Core i3 CPU, 8GB main memory, level Bloom filters for answering short range queries. This
1 TB SATA HDD at 7200 RPM, and 120GB SSD with version significantly reduces the FPR achieved by Rosetta for shorter
Samsung 850 EVO m.2 120GB Samsung EVO 850 SSD. The ranges, as shown in Figure 5(A3). The mean FPR of Rosetta
operating system is Ubuntu version 16.04 LTS. for shorter ranges is 0.00012 whereas, it is 0.0242 for SuRF.
Rosetta Improves the Performance of RocksDB. In our This leads to a 70% reduction of I/O cost for shorter ranges
first experiment we demonstrate the benefits of Rosetta in in Rosetta. The gap closes for bigger ranges as more of the
a uniform workload while varying the query range from 1 data has to be scanned and more areas in the filters need to
to 64. For each range size, we measure the the end-to-end be probed. As the range size increases beyond 16, although
performance in terms of workload execution latency, for both the FPR of Rosetta increases, it is still 1.5× less than that of
Rosetta-integrated RocksDB and SuRF-integrated RocksDB. SuRF which reduces the I/O cost by 13.6%.
Figure 5(A1) shows the results denoting the measurements Sacrificing Filter Probe Cost Leads To Better Perfor-
of Rosetta and SuRF using R and S, respectively. Figure 5(A1) mance with Rosetta. In Figure 5(A2), we magnify the CPU
also reports how the total cost for each system is spread cost reported in Figure 5(A1) and break it down further into
among disk I/Os, CPU usage, and the overlap between the sub-costs: fetching serialized filter bits before a filter probe,
two. The total time emanating from disk I/Os and overall deserializing filter bits, probing the filter, and performing
CPU usage are obtained using RocksDB-reported metric residual operations seek() which includes routine jobs per-
– block_read_time and iter_seek_cpu_nanos. The cost formed by RocksDB iterators – looking for checksum mis-
for I/O and CPU taken together exceeds the total cost due match and I/O errors, going forward and backward over the
to cycle-stealing within the processor architecture which data, filters and fence pointers, and creating and managing
prevents wasting CPU cycles. Therefore, the CPU-IO overlap database snapshots for each query. Figure 5(A2) shows that
contributes to both the CPU and I/O cost. For measuring the cost of serialization is negligible (about 0.7% for both
the sub-costs due to serialization, deserialization, and filter SuRF and Rosetta), and hence is not visible within the stacked
probe, we have embedded custom statistics into RocksDB breakdown. The CPU overhead of deserialization is signifi-
using the internal stopwatch() support. cant yet close across both filters – about 14.5% and 11.1% for
Figure 5(A1) shows that the performance of Rosetta is up to Rosetta and SuRF, respectively. Overall, the cost of serializa-
4× better compared to SuRF. For shorter ranges till range size tion and deserializtion do not fluctuate across diverse range
of 16, Rosetta incurs a negligible I/O overhead compared to sizes as both these operations account for constant amount
SuRF with 22 bits per key. This behavior is due to the design CPU work for each query.
differences between SuRF and Rosetta – SuRF employs a The critical observation from Figure 5(A2) is that filter
trie-culling design that prunes a full-trie’s bottom elements, probe cost is a non-dominant portion of the overall CPU
which contain important information for answering short cost. Rosetta does incur a higher filter probe cost compared
to SuRF (19% of total CPU cost compared to 5%). The reason is it as positive, since it will have likely culled that key’s pre-
that Rosetta probes its internal Bloom filters once per dyadic fixes which contain the information to answer the query. In
interval of the target query range. Within each interval, the contrast, for any query, Rosetta is able to decompose it into
deeper Bloom filters are recursively probed until one of them prefix checking which is insensitive to the correlation.
returns a negative. On the other hand, with SuRF a probe
involves only checking the existence of (i) a smaller number Rosetta Enhances RocksDB for Skewed Key Distribu-
of (ii) fixed-length prefixes. tion. For this experiment, we generate a dataset compris-
However, this CPU-intensive design leads to bigger gains ing of 50M keys following a normal distribution and exe-
for Rosetta. It allows Rosetta to achieve lower FPR (Fig- cute a uniform workload on the dataset. In Figure 5(C), we
ure 5(A3)) resulting in 2 − 4× end-to-end improvement for demonstrate that Rosetta offers up to 2× performance bene-
RocksDB (Figure 5(A1)). The improved FPR does not only fits over SuRF with skewed keys for end-to-end performance
help with I/O but it also affects the fourth CPU-overhead in RocksDB. As more skewed keys are added to a run, the
contributing to the residual seek cost. On averaging over dif- number of distinct keys within the run decreases. This leads
ferent range sizes, the overhead due to residual seek is about to more prefix collisions which makes the trie-culling ap-
65% and 82% for Rosetta and SuRF, respectively, thereby dom- proach of SuRF vulnerable to false positives. On the other
inating the overall CPU cost. This overhead comprises of hand, Rosetta can resolve the collision through more probes
a (i) fixed component (invariable across varied range sizes) at the deeper level Bloom filters. On average, Figure 5(C)
due to creation and maintenance of top-level iterators in shows that Rosetta enhances performance over SuRF by 52%
RocksDB for all queries and (ii) a variable component due to and 19% for short and long range queries, respectively.
longer usage of child iterators per query due to reading more Rosetta improves default RocksDB by as much as 40x.
data or fence pointer blocks in case of false or true positives. We now demonstrate the benefits of Rosetta against the
For example, as we compare the results across Figures 5(A1), default filters of RocksDB using the 50M uniform workload
(A2), and (A3), we notice that Rosetta shows a fixed over- setup (as in Figure 5(A)). In Figure 5(D), we observe that both
head of about 2.7 seconds for shorter ranges, i.e., when FPR Prefix Bloom filters and fence pointers offer significantly
is significantly close to 0. The variable cost starts showing higher latency compared to both SuRF and Rosetta. This is
up as the FPR increases for range size 32 and 64. Due to high because fence pointers cannot detect empty range queries
FPR in SuRF, we observe that this part of the CPU-overhead especially when the range size is significantly lower than the
is significantly higher. size of the key domain (264 ). Prefix Bloom filters also cause
Therefore, sacrificing the CPU cost due to filter probe a high FPR due to frequent prefix collisions as the range
significantly reduces (a) the CPU overhead originating from size increases. Rosetta offers the lowest latency among all
seek() (which also dominates the total CPU cost) and of baselines bringing an up to 40× improvement over default
course (b) the total I/O cost of the workload due to lower RocksDB. While the benefit reduces with larger range sizes,
FPR. The consolidated effects of (a) and (b) make a significant key-value stores are fundamentally meant to support point
difference in the end-to-end performance for Rosetta. lookups, writes, and short range queries as opposed to scans
and long range queries [38, 57, 60].
Rosetta Enhances RocksDB for Key-Correlated Work-
loads. In order to investigate the benefits of Rosetta across Rosetta’s Construction Cost Adds Minimal Overhead.
diverse workload types, we repeat the previous experiment All types of filters in an LSM-based key-value store need to
on both correlated and skewed workloads. First, we analyze be reconstructed after a merge occurs. In our next experi-
workloads that exhibit prefix similarity between queried ment, we demonstrate the impact of this construction cost
keys and stored keys. For this experiment, we introduce a as we vary the data size. First, we set the size of L0 to be
correlation factor θ such that a range query with correla- very high (30 runs) so that the entire data fits into L0 and
tion degree θ has its lower bound at a distance θ from the there is no compaction. This way, we can isolate and measure
lower bound generated using the distribution. Therefore, for the filter construction cost without taking into account the
a range query of size R, if the generated lower-bound is l, overlap with the compaction overhead. We vary the SST file
then we set the actual lower-bound to be at key l + θ . For size between 256 MBs and 1 GB which, in turn, increases the
this experiment, we set θ = 1. number of filter instances created from 5 to 59. Figure 6(A)
Figure 5(B) depicts that when the workload entails a key shows that the average filter construction cost of Rosetta is
correlation, Rosetta-integrated RocksDB on average achieves about 14% less than that of SuRF. This is because Rosetta cre-
a 75% lower latency than that of SuRF-integrated RocksDB. ates a series of Bloom filters which are densely packed arrays,
This is because when the lower bound of a query falls close causing fewer memory misses compared to populating a tree.
to a key, even if the range is empty, SuRF will always answer Next, in order to study the more holistic impact of filters on
write costs, we restored the size of L0 to 3 (as in other exper- FPR. It keeps up the performance for larger memory bud-
iments) and measure the end-to-end workload execution la- gets compared to the Bloom filters on RocksDB and even
tency for both the filters as well as vanilla RocksDB with only improves on it as more memory becomes available.
fence pointers (denoted by F in the results). In Figure 6(B), The most critical observation from this experiment is that
we break down (S) the filters which are meant
SST size: 256 MB Prefix Bloom Fillter
(S) (R)
25 SST size: 1 GB
Filter creation latency
FPR
(sec)
15 (R)
due to reads and Bloom filters cannot give ad- 0.01 Bloom filter
10 (S) (R) (S)
(R)
writes and writes 5 equate point query perfor- 0.001 Ros
etta
are further bro- mance. Thus, a key-value
700
ken down into read (F) store would need to either
Workload execution
600
600
(S)
filter creation
Workload execution latency (ms)
bits/key
latency (sec)
500
sub-costs emanat- compaction maintain two filters per run
400 (F)
ing from both com- (B) (S) (R) (one for point queries and Figure 7: Rosetta main-
300
(R)
200
200 (F)
paction and filter 100
(R) (S) one for range queries), or tains the point query per-
creation. We ob- 0
10M
RosettaSuRF None Rosetta20M
SuRF None Rosetta30M
SuRF None
lose out on performance for formance on RocksDB.
serve that write- Data size one of the query types. On
overhead is less Figure 6: Rosetta incurs minimal the other hand, Rosetta can efficiently support both query
for fence point- filter construction overhead dur- types.
ers as there is no ing compactions. Robust Benefits Across the Memory Hierarchy. For this
filter construction overhead, but reads are expensive due experiment, we compare SuRF and Rosetta outside RocksDB
to high false-positives. On the other hand, both SuRF and to verify whether we see the same behavior as in the full
Rosetta have higher write-overhead due to the additional system experiments. We use 10M keys each of size 64 bits.
work of recreating filters after each compaction. To give more The experiment is performed using a uniform workload.
insights, we measure the total compaction time (T), and the We measure end-to-end la-
1.5
the compaction overhead to be T/(R+W). We observe this memory, HDD, or SSD, as 0.95 FPR
p: 0.70
overhead to be 0.006µs/byte, 0.008µs/byte, and 0.007µs/byte shown in Figure 9. For all 0.95 FPR
1.0
for Rosetta, SuRF, and fence pointers respectively. Thus, the storage media, Rosetta does p: 0.71
0.5
overall overhead in Rosetta due to the more complex filter 0.95 FPR
spend more time in filter
p: 2.57 p: 2.64 p: 2.64
structure is minimal compared to SuRF or even compared probe cost (represented by
0.0
0.09 FPR 0.09 FPR
0.09 FPR
to having no filter at all. Critically, this overhead is over- p), but the improved re- Mem SSD HDD
shadowed by the improved FPR which, in turn, drastically sulting FPR translates to Figure 9: Rosetta improves
improves query performance on every run of the tree. Of end-to-end benefits even in performance across the
course, for purely insert-heavy workloads, no filters should memory. Thus, this result memory hierarchy.
be maintained (Rosetta, SuRF, or Bloom filters) as they would verifies the same end-to-end behavior we observe when in-
be constructed bur never (or hardly ever) be used. tegrating the filters in RocksDB.
Rosetta Provides Competitive Performance with String
Rosetta Maintains RocksDB’s Point-Query Efficiency. Data. Given that SuRF is based on a trie design while Rosetta
We now show that unlike other filters, Rosetta not only is based on hashing, the natural expectation with string
brings benefits in range queries but also does not hurt point data is that SuRF will be better. We now show that Rosetta
queries. We experiment with a uniform workload of 50M achieves competitive performance to SuRF with string data.
keys. We vary the bits per key from 10 to 20 and measure We use again a standalone experiment here, outside RocksDB
the change in FPR for both SuRF and Rosetta, compared to to further isolate the properties of the filters.
the Bloom filters on RocksDB. Figure 7 shows the results. We use a variable-length string data set, Wikipedia Ex-
SuRF-Hash and SuRF-real offer a drastically higher FPR. For traction (WEX)9 comprising of a processed dump (of size
example, the point-query FPR of SuRF-Hash is 10× worse 6M) of English language in Wikipedia. The wiki markup for
compared to that of the default Bloom filter in RocksDB as each article is transformed into machine-readable XML, and
it incurs more hash collisions with large number of keys. common relational features such as templates, infoboxes,
Similarly, if we were to use Prefix Bloom filters, FPR for categories, article sections, and redirects are extracted in tab-
point queries goes all the way up to 1 in this case. On the ular form. We generate a workload comprising of 1 million
other hand, Rosetta only uses its last level Bloom filter which
effectively indexes all bits of the keys and thus, brings great 9 https://aws.amazon.com/de/datasets/wikipedia-extraction-wex/
Rosetta (latency) SuRF (latency) Rosetta (FPR) SuRF (FPR) Analytical systems Rosetta SuRF
range size
0.6 too high
16 16 16
uniform 0.4 to use a
32
8 8 8
0.2 filter
2 2 2 0
workload execution latency (sec)
10 15 20 25 30 10 15 20 25 30 10 15 20 25 30
32 32 1.0
range size
24 0.6 too high
correlated 16 16 to use a
16 0.4 32
8 8
filter
8 0.2
2 2 0
2
10 15 20 25 30 10 15 20 25 30 20 10 15 20 25 30 1.0
20 20
(I) (J) (K) (L)
16 16 16 0.8
64
Memory
range size
0.6 too high
skew to use a
8 0.4 32
8 8
filter
0.2
2 0
2 2
10 15 20 25 30 10 15 20 25 30 10 15 20 25 30
bits/key bits/key bits/key
Figure 8: Rosetta exhibits a superior FPR-Memory tradeoff, compared to SuRF, across multiple workloads. Rosetta
serves a larger spectrum of key-value workloads under diverse range sizes and memory budget.
disk access insertion and filter probe low memory budget and still keep up a low FPR leading to
FPR: .9 .9 .8 .8 .8 .9 .7 .9 .7 .9 robust end-to-end performance.
Rosetta
Rosetta
Rosetta
Rosetta
Rosetta
Rosetta
Rosetta
SuRF
SuRF
SuRF