Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Efficient Peer-To-Peer Lookup Based on a Distributed Trie

Michael J. Freedman∗ Radek Vingralek†


MIT Lab for Computer Science InterTrust STAR Lab
mfreed@pdos.lcs.mit.edu vingralekr@acm.org

Abstract Maintenance is initiated by either a key/value pair in-


sert or by a membership change.
Two main approaches have been taken for distributed Lookups can be made very efficient by replicating
key/value lookup operations in peer-to-peer systems: the lookup structure on every peer. However, mainte-
broadcast searches [1, 2] and location-deterministic al- nance is slow as all peers must keep their lookup struc-
gorithms [5, 6, 7, 9]. We describe a third alterna- ture replicas consistent. To date, peer-to-peer systems
tive based on a distributed trie. This algorithm func- have taken two approaches to reducing maintenance
tions well in a very dynamic, hostile environment, of- costs.
fering security benefits over prior proposals. Our ap-
The first approach, adopted by Gnutella [2] and
proach takes advantage of working-set temporal local-
Freenet [1], eliminates the lookup structure alto-
ity and global key/value distribution skews due to con-
gether and thus requires no maintenance. However,
tent popularity. Peers gradually learn system state dur-
lookups are implemented by a broadcast-like search
ing lookups, receiving the sought values and/or inter-
on all known peers, costing efficiency and scalability.
nal information used by the trie. The distributed trie
(Freenet aims to reduce overhead by broadcasting first
converges to an accurate network map over time. We
to neighbors that previously returned similar keys and
describe several modes of information piggybacking,
replicating key/value pairs along the entire path be-
and conservative and liberal variants of the basic algo-
tween sender and receiver.)
rithm for adversarial settings. Simulations show effi-
cient lookups and low failure rates. The second approach partitions the lookup structure
and distributes a subset of partitions on each peer. Since
maintenance updates are typically localized, peers up-
1 Introduction date only a smaller number of partition replicas. The
system assigns partitions to peers either statically or dy-
We describe a set of algorithms for key-based lookup in namically. In the static-assignment approach, peer ad-
a distributed system consisting of a number of uniform dresses map to the key space and each node replicates
peers. A lookup service is a necessary component for only those partitions that are “close” to its address. Sys-
peer-to-peer file-sharing systems, which need to map tems such as Chord [7] (mapping via consistent hash-
filenames to the location of peers that store them. Such ing), CAN [5] (routing on a d-torus), and Tapestry [9]
an algorithm may return the files themselves or the ad- and Pastry [6] (both related to Plaxton trees [4]) adopt
dresses of servers storing of files. this model. In the dynamic-assignment approach, peers
Most lookup algorithms deploy a lookup structure replicate only partitions that they frequently access.
to efficiently locate values associated with keys. The Relaxing the consistency criteria for partition repli-
lookup structure can be organized as an (extendible) cas can further reduce maintenance costs. However, it
hash table, a trie, a binary tree, a B-tree, etc. A dis- can happen that peers hold stale replicas if they up-
tributed lookup algorithm trades off the efficiency of date local lookup structures lazily [3, 8]. They can
lookups for the maintenance of the lookup structure. commit addressing errors when requesting values from

This work was partially done while the author was visiting In-
peers that are unavailable or that no longer hold the val-
terTrust STAR Lab. ues. To ensure that the extra cost of addressing errors

Current and eternal contact: vingralekr@acm.org. does not approach the cost of a broadcast, peers must

1
Root
limit addressing errors by piggybacking the updates on Routing
********** Table
other traffic. Then, peers reconcile conflicting updates
to achieve replica convergence. 01

We present a set of algorithms that exploit both dy- m bits


namic partitioning based on peers’ access locality and m
00******** 01******** 10******** 11********
2 l −entry
lazy updates of the lookup structure to reduce main- tables
01
tenance cost. All algorithms piggyback trie state on
lookup responses. Peers use timestamping to recon-
0100****** 0101****** 0110****** 0111******
cile conflicting updates. The algorithms differ in the
volume of the trie structure piggybacked and how ag- .. .. .. ..
gressively the requester uses these partitions. . . . .
These algorithms do not preclude strong anonymity
for either peers initiating a lookup or peers storing the
corresponding value. The trie structure can index end- 01010100** 01010101** 01010110** 01010111**

points of fixed-length mix-net circuits, as opposed to


the desired peers. The mix-net circuit will relay mes-
sages from an endpoint to an anonymized recipient. VVVV V V V
Therefore, we can provide anonymity at a constant cost
of extra messages, by treating anonymity as a goal for
the underlying communication channel. Figure 1: Trie lookup structure (k = 10, m = 2).

2 System model
The system consists of n peers that all implement the peer also stores partitions of a lookup structure orga-
following interface: nized as a trie, as shown in Figure 1. A trie represen-
tation is insensitive to the insertion ordering. Conse-
• lookup( key ). The callee sends to the caller quently, it is easier to merge two incomparable versions
the value associated with key if successful or a of the lookup structure.
failure message. In both cases, the callee may pig- Internal trie nodes consist of 2m routing tables. Each
gyback additional state on the response (algorithm routing table consists of l entries. An entry consists of
dependent). a peer address a and a timestamp t. Each level of trie
node “consumes” m bits of the k-bit key. If the node is
• insert( key, value ). The callee inserts a leaf (defined as having depth dk/me), then the entry
a < key,value > pair into its lookup structure. in its matching routing table indicates that peer a was
known at time t to hold the specified value. Otherwise,
• join(). The callee sends to the caller initial the entry at its ith routing table indicates that peer a
state needed to bootstrap lookup operations. was known at time t to hold a replica of the ith child
of the node. The timestamps are generated locally by
Peers cannot update or delete values that were previ- each peer; we assume loose clock synchronization (say,
ously inserted, although they can re-insert new versions to within a few hours).
under a different (or even the same) key. The caller Each peer stores only the subset of trie nodes corre-
is responsible for ensuring key uniqueness, if required. sponding to its access pattern. However, all peers main-
Peers join the system only intermittently, where “join” tain the following invariant:
is defined as an initial entry into the network, not a re-
connection after failure. Peers can leave at any time or Ancestor invariant: If a peer holds a trie
fail silently; no maintenance operations are necessary node, it must hold all ancestors of the node.
in either case.
Each peer stores a number of key/value pairs locally, We represent trie paths more compactly by explicitly
that were either inserted or looked up by the peer. The relying on this invariant: If a peer appears in some rout-

2
ing table, it is known to hold not only the node’s child, such peers fail, A backtracks in its local trie and repeats
but also an entire path down to that child. Peers natu- the same process on the parent routing table. The fig-
rally maintain this invariant by replacing trie nodes and ure shows a dashed line: routing table 010100* failed
routing table entries based on access frequency. Logi- and is crossed out, and A backtracked to routing table
cally, nodes closer to the trie root are more widely repli- 0101*.
cated by peers, removing any single point of failure. Recall that each trie node’s routing tables are of max-
imum size l. Once the caller starts backtracking, we
enumerate larger “virtual” routing tables: an entry list
3 Algorithms
containing all peers thought to hold the desired child
The algorithms we present for a distributed trie lookup trie node. Peer A’s virtual view of 0101* is that level’s
share most of the basic steps. They differ only in what actual routing table merged with all tables in its sub-
state is piggybacked on lookup responses and how this trie (minus the entries in table 010100* already con-
state is used. We first describe the basic framework tacted). By the ancestor invariant, the peers holding trie
shared by all algorithms and subsequently the differ- nodes in the subtrie must also have a copy of the higher-
ences. level routing table. Therefore, we effectively increase
In order to join the distributed system, a peer needs the size of the higher-level routing tables without ad-
to know the address of at least one participating peer, ditional storage. If an entry’s routing table is full when
called its introducer. The introducer responds to a join new entries are installed, the least recent entry is evicted
request with its own root routing table, as labeled in from this table and propagates up the trie according to
Figure 1. The new peer uses this root routing table timestamp.
as a bootstrap mechanism when it exhausts the entries Peers may backtrack their local tries during lookup
within its own trie during a lookup. up to the root routing table. If the value is not found at
Insertion is performed locally by inserting the this time, the lookup process terminates with a failure.
key/value pair (and the corresponding path from the
root of the trie) into local storage. Alternatively, the 3.1 Bounded, unbounded, and full path
peer could send an insert request to other peers, similar modes
to [1]. The peers may decline to store the inserted pair.
This paper does not consider such an insert operation, The bounded, unbounded, and full path modes explore
in order to focus solely on the effect of piggybacking the tradeoff between the size of piggybacked trie state
state in lookups. with the speed of convergence of peers’ tries to an ac-
The lookup caller first checks local storage for the curate network map.
value corresponding to the lookup key. If present, the In bounded mode, the callee responds to a lookup
lookup process terminates. Otherwise, the caller initi- with its most specific routing table matching the key
ates a distributed lookup process. We present an exam- (or the value itself), provided its routing table is more
ple of lookup(0101000000) for comprehension. specific (deeper) than the caller’s current table. Oth-
Caller A searches its local trie for a routing table that erwise, the callee responds with a failure. The caller
most specifically matches lookup key. Such a routing integrates the more specific routing table into its local
table 010100* is shown with a solid box in Figure 1. trie and proceeds with the lookup on this table.
Subsequently, caller A then sends a lookup query to In unbounded mode, the callee responds with its
peer B, who has the latest timestamp in the routing ta- most specific routing table for the key, regardless of
ble. Peer B is most recently known (to Peer A) to hold how this compares to the caller’s current routing ta-
the child trie node 010100* (that A does not currently ble. This additional state is useful to pre-fetch infor-
have). B returns the actual value if B holds it. B’s re- mation about new peers or more recent locations of
sponse may either contain additional trie state or be a higher-level trie nodes to be used when backtracking.
failure with no state. If B returns a deeper routing ta- The caller integrates the returned routing table into its
ble, A drops down to that level and repeats this pro- local table at that same depth, by selecting the l most
cess. If B returns failure, caller A tries other peers in recent distinct entries from the two tables.
its routing table in decreasing timestamp order. If all In full path mode, the callee responds with the entire

3
path (consisting of routing tables) from the root table to 0.012
Bounded, conservative mode
Bounded, liberal mode
its most specific routing table for the key. The caller in- Unbounded, conservative mode
Unbounded, liberal mode
0.01 Full path, conservative mode
tegrates the path into its local trie using the same mech- Full path, liberal mode

anism as in the unbounded mode.

Probability of lookup failure


0.008

3.2 Conservative and liberal modes 0.006

Most peer-to-peer lookup algorithms are susceptible to


0.004
malicious behavior. We describe one particular attack
conceptually similar to DNS cache poisoning and pro-
0.002
pose a conservative mode to resist this attack.
A malicious peer can effectively suppress access to a
0
value by falsely advertising the availability of some key 0 50 100 150 200 250 300 350
Average number of lookups per peer
400 450 500

(with a recent timestamp) and then dropping lookup re-


quests. A set of l malicious peers can cause innocent Figure 2: Probability of lookup failure.
peers to completely replace the entries in their rout-
ing tables with malicious peer addresses. While back- peer and dynamically changed the membership by re-
tracking can help route around this problem by finding moving peers or adding new ones with a probability
less-specific routing tables, these malicious peers have 0.005.
caused the system to lose efficiency. For each lookup phase we collected the following
We propose a verified-only update heuristic for a statistics:
conversative mode. Namely, callers update their local • Message overhead. We classify lookups as either
trie with only the new entries that transitively led them local (i.e., those that could be satisfied by a peer
to the desired value. This assumes that peers can ver- locally) or remote (i.e., those that required sending
ify the validity of returned data. For example, if peer lookup operations to other peers). For each remote
B returns {C,D,E} to peer A at depth 1, peer C returns lookup, we measured the number of lookup opera-
{F,G,H} to A at depth 2, and peer F returns the actual tions that were sent to other peers (transitively) in
value, peer A updates his trie only with peers B, C, and order to satisfy the request.
F in the corresponding-depth routing tables. Conserva-
tive mode ensures that entries in our routing tables have • Failure probability. For each lookup, we mea-
performed useful work in the past. Therefore, we hope sured the probability of its failure. A lookup fails
they will continue to be useful. when the requesting peer’s trie did not contain suf-
In liberal mode, callers immediately update their lo- ficient information to locate an existing key/value
cal tries with any piggybacked state. pair (even after contacting other peers).
The failure probabilities of all algorithm modes are
4 Preliminary experimental evaluation found in Figure 2. The message overheads of all algo-
rithm modes are found in Figures 3, 4, and 5. The three
We implemented a simulator to compare the perfor-
figures respectively show the 90th , 50th , and 10th per-
mance of algorithms modes described in Section 3. We
centiles of the number of query/response message pairs
present preliminary simulation results in this section.
(i.e., lookup operations) that were generated by a re-
We simulated a system consisting of 200 peers. The
mote lookup. The reported values are conservative, as
tries maintained by the peers were characterized by pa-
the message overhead explicitly excludes lookups that
rameters k = 10, m = 2 and l = 10. Each experiment
were satisfied locally. We also did not allow peers to
started with an initial loading phase, when the peers
benefit from access locality by generating keys accord-
inserted a total of 2,000 randomly-generated key/value
ing to a skewed distribution. Based on the limited test
pairs in the system. These pairs were distributed ran-
data, we make the following conjectures:
domly among all peers. The keys were uniformly dis-
tributed in [0, 2k − 1]. Subsequently, at each simulated • The message overheads of all modes converge to-
time-step, we issued a lookup to a randomly-selected ward less than log n lookup operations per remote

4
100
Bounded, conservative mode lookup. The failure probability of all modes con-
Bounded, liberal mode
90 Unbounded, conservative mode
Unbounded, liberal mode
verge toward zero. Thus peers seem capable of
Full path, conservative mode
80 Full path, liberal mode correctly learning a recent view of the distributed
70
trie.
Number of messages

60
• There is an implicit trade-off between message
50 size and convergence time to low message over-
40 head. Full path mode converges the fastest, yet
30
sends the largest messages. Bounded mode con-
20
verges the slowest, but with the benefit of sending
smaller messages.
10

0 • The conservative modes exhibit greater variances


0 50 100 150 200 250 300 350 400 450 500
Average number of lookups per peer in message overhead, although their averages con-
verge similar to those of liberal modes. Simi-
Figure 3: The 90th percentile of the number of messages
larly, the lookup-failure probability of conserva-
generated by a remote lookup.
tive modes (for non-full path modes) also con-
100
Bounded, conservative mode
verge toward zero, but at a slower rate than others.
Bounded, liberal mode
90 Unbounded, conservative mode
Unbounded, liberal mode
Thus peers can reduce the risk of system infiltra-
Full path, conservative mode
80 Full path, liberal mode tion by malicious peers by running in the conser-
70
vative mode with only moderate performance con-
sequences.
Number of messages

60

50 In addition to these experiments with dynamic mem-


40 bership, we ran a similar series of experiments with
30
static membership. The results were practically the
20
same as those reported in the included figures (we
exclude the corresponding graphs due to space con-
10
straints). Therefore, we conjecture that all of the al-
0
0 50 100 150 200 250 300 350 400 450 500 gorithm modes are relatively resilient to system mem-
Average number of lookups per peer
bership changes.
Figure 4: The 50th percentile of the number of messages
generated by a remote lookup. 5 Conclusions
100
Bounded, conservative mode
We propose a new approach to key/value lookup in
Bounded, liberal mode
90 Unbounded, conservative mode
Unbounded, liberal mode
peer-to-peer systems based on a distributed trie. Com-
Full path, conservative mode
80 Full path, liberal mode pared to broadcast-based routing, the algorithms clearly
70
lead to a lower message overhead.
Compared to the static-partitioning-based (or
Number of messages

60
location-deterministic) approaches, these algorithms
50
can deliver a lower message overhead, given sufficient
40
time for peers to learn the distribution of frequently-
30 accessed keys. Message overhead does not depend on
20 the lexicographical distance between the peer and its
10
looked-up key, nor are peers required to maintain state
0
about regions of the keyspace that they do not access.
0 50 100 150 200 250 300
Average number of lookups per peer
350 400 450 500
We perform maintenance of the trie partitions lazily,
and can thus avoid installing updates to keys that are
Figure 5: The 10th percentile of the number of messages never requested by a peer. Since there is no determinis-
generated by a remote lookup. tic mapping between a key and the peers that store the

5
corresponding value, it would be more difficult for an [9] Ben Zhao, John Kubiatowicz, and Anthony Joseph.
adversary to target all replicas of a particular key/value Tapestry: An infrastructure for fault-tolerant wide-area
pair. location and routing. Technical Report UCB/CSD-01-
On the other hand, our algorithms can degenerate 1141, Computer Science Division, U. C. Berkeley, April
2001.
to a broadcast for peers with very stale views, while
the static-partitioning-based approaches have logarith-
mic upper bounds on message overhead per lookup.
We conceive that our algorithms may be used in con-
junction with a static-partitioning algorithm: A peer
may send a limited number of lookup operations based
on its local trie and, only if all failed, revert to a static-
partitioning-based lookup. Symmetrically, if the static-
partitioning algorithms seek to reduce their (fixed) mes-
sage overhead by caching additional state, they can use
our algorithms to maintain consistency of the cached
state.

References
[1] Ian Clarke, Oscar Sandberg, Brandon Wiley, and
Theodore Hong. Freenet: A distributed anonymous in-
formation storage and retrieval system. In Proceedings
of the Workshop on Design Issues in Anonymity and Un-
observability, pages 46–66, July 2000.
[2] Gnutella website. http://gnutella.wego.com.
[3] W. Litwin, M. Neimat, and D. Schneider. lh∗ - linear
hashing for distributed files. In In Proceedings of the
ACM SIGMOD Conference, May 1993. Washington,
DC.
[4] C. Plaxton, R. Rajaraman, and A. Richa. Accessing
nearby copies of replicated objects in a distributed envi-
ronment. In Proceedings of the ACM SPAA, pages 311–
320, June 1997.
[5] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and
S. Shenker. A scalable content-addressable network. In
Proc. ACM SIGCOMM, San Diego, 2001.
[6] Antony Rowstron and Peter Druschel. Pastry: Scal-
able, distributed object location and routing for large-
scale peer-to-peer systems. In Proceedings of the 18th
IFIP/ACM International Conference on Distributed Sys-
tems Platforms (Middleware 2001), November 2001.
[7] Ion Stoica, Robert Morris, David Karger, M. Frans
Kaashoek, and Hari Balakrishnan. Chord: A scalable
peer-to-peer lookup service for internet applications. In
Proc. ACM SIGCOMM, San Diego, 2001.
[8] R. Vingralek, Y. Breitbart, and G. Weikum. Distributed
file organization with scalable cost/performance. In In
Proceedings of the ACM-SIGMOD International Con-
ference on Management of Data, May 1994. Minneapo-
lis, MN.

You might also like