Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 47

Chord

Fay Chang, Jeffrey Dean, Sanjay Ghemawat,


Wilson C. Hsieh, Deborah A. Wallach, Mike
Burrows, Tushar Chandra, Andrew Fikes,
Robert E. Gruber
Google, Inc.
OSDI 2006
Introduction
 Dynamo stores objects associated with a key
through a simple interface:
 get(),put()
 It should be possible to scale Dynamo
incrementally
 This requires the ability to partition data
over the set of nodes (storage hosts)
 Dynamo relies on a concept called consistent
hashing
 The approach they used is similar to that found in
Chord.
Distributed Hash Tables (DHT)
 Operationally like standard hash tables
 Stores (key, value) pairs
 The key is like a filename
 The value can be file contents or pointer to
location
 Goal: Efficiently insert/lookup/delete
(key,value) pairs
 Each peer stores a subset of (key, value)
pairs in the system
DHT
 Core operation: Find node responsible for
a key
 Map key to node
 Efficiently route insert/lookup/delete request
to this node
 Allow for frequent node arrivals and
departures
DHT
 Introduce a hash function to map the object being searched
for to a unique global identifier:
 e.g., h(“NGC’02 Tutorial Notes”) → 8045
 Distribute the range of the hash function among all nodes in
the network

1500-4999
1000-1999 4500-6999
8045

9000-9500
8000-8999 7000-8500
0-999
9500-9999

 Each node must “know about” at least one copy of each


object that hashes within its range (when one exists)
DHT:Desirable Properties
 Key ID space (search space) is uniformly populated
 Mapping of keys to IDs using (consistent) hashing

 A node is responsible for indexing all the keys in a


certain subspace of the ID space
 Nodes have only partial knowledge of other node’s
responsibilities
 Messages should be routed to a node efficiently
(small number of hops)
 Node arrival/departure should only affect a few
nodes.
Consistent Hashing
 The main idea: map both keys and nodes
(node IPs) to the same (metric) ID space
Consistent Hashing
 The main idea: map both keys and nodes
(node IPs) to the same (metric) ID space

The ring is just a possibility.


Any metric space will do
Consistent Hashing
 With high probability, the hash function
balances load (all nodes receive roughly the
same number of keys).
 With high probability, when a node joins
(or leaves) the network, only an fraction of
the keys are moved to a different location.
 Thisis clearly the minimum necessary to
maintain a balanced load.
Consistent Hashing
 The consistent hash function assigns each node
and key an m-bit identifier using SHA-1 as a base
hash function.
 A node’s identifier is chosen by hashing the node’s
IP address.
 A key identifier is produced by hashing the key.
 For more info see:
 D. R. Karger, E. Lehman, F. Leighton, M. Levine, D. Lewin,
and R.Panigrahy, “Consistent hashing and random trees:
Distributed caching protocols for relieving hot spots on
theWorldWideWeb,” in Proc. 29th ACM Symp. Theory of
Computing, El Paso, TX, May 1997, pp. 654–663.
P2P Middleware: Differences
 Different P2P middlewares differ in:
 The choice of the ID space
 The structure of their network of nodes (i.e.
how each node chooses its neighbors)
 For each object, node(s) whose range(s) cover
that object must be reachable via a “short”
path
 This is a major research topic
Chord
 m bit identifier space for both keys and
nodes
 Key identifier = SHA-1(key)
SHA-1
 Key = “LetItBe” ID=50
SHA-1
 Key = “129.100.16.93” ID=70
 How do we assign keys to nodes?
Chord
 Nodes organized in
an identifier circle
based on node
identifiers
 Keys assigned to
their successor
node in the
identifier circle
e.g., node with next
higher ID.
Chord
 Hash function
ensures even
distribution of
nodes and keys on
the circle
 Range covered by
node is from
previous ID up to
its own ID
 Assume an N node
network
Chord: Search Possibilities
 Routing table size vs search cost
 Every peer knows every other peer: O(N)
routing table size
 Every peer knows its successor: O(N)
search time.
 The “compromise” is to have each peer
know the next m successors.
Finger Table
 Let m be the number of bits in the
key/node identifiers
 Each node, n, maintains a routing table with
at most m entries called the finger table.
 The ith entry in the table at node n contains
the identity of the first node, s, that
succeeds n by at least 2i-1.
s = successor(n+2i-1)
 s is called the ith finger of node n
Chord:Finger Table

Finger table:
finger[i] =
successor (n + 2i-1)
where 1 ≤ i ≤ m

O(log N) table size


Chord: Finger Table

Finger table:
finger[i] =
successor (n + 2i-1)
Chord: Finger Table

Finger table:
finger[i] =
successor (n + 2i-1)
Chord: Finger Table

Finger table:
finger[i] =
successor (n + 2i-1)
Chord: Finger Table

Finger table:
finger[i] =
successor (n + 2i-1)
Chord: Finger Table

Finger table:
finger[i] =
successor (n + 2i-1)
Chord: Finger Table

Finger table:
finger[i] =
successor (n + 2i-1)
Chord: Finger Table

Finger table:
finger[i] =
successor (n + 2i-1)
Chord: Finger Table

Finger table:
finger[i] =
successor (n + 2i-1)
The Chord algorithm –
Scalable node localization
Chord: Search
 Assume node n is searching for key k.
 Node n does the following:
 Find ith table entry of node n such that
k[finger[i].start, finger[i+1].start])
 If no such entry exists then return the node in
the last entry of the finger table
 The above two steps are repeated until the
condition in the first step is satisfied.
Chord: Join
 Nodes can join (and leave) at any time.
 Challenge: Preserving the ability to locate
every key in the network
 Chord must preserve the following:
 Each node’s successor correctly maintained
 For every key k, node successor(k) is
responsible for k.
 For lookups to be fast, it is desirable for
the finger tables to be correct.
Chord: Join Implementation
 Each node in Chord maintains a
predecessor pointer.
 This consists of the Chord ID and IP address
of the immediate predecessor of that node.
 It can be used to walk counterclockwise around
the identifier circle.
 The new node to be added learns the
identify of an existing Chord node by some
external mechanism
Chord: Join Initialization Steps
 Assume n is the node to join.
 Find any existing node, n’.
 Find successor of n from n’. Label this
successor(n).
 Ask successor(n) for its predecessor. This
is labelled as predecessor(successor(n)).
Chord: Join Example
•Assume N26 wants to
join; If finds N8

•N8’s finger table suggests


that N26 will be “between”
N21 and N32.
Chord: Join (Initialize finger
table)
 Node n needs to have its finger table
initialized
 Node n can ask one its predecessor to be
for its finger table as a starting point
Chord: Join (Changing Existing
Finger Tables)
 Node n needs to entered into the finger tables of
some existing nodes.
 Node n becomes the ith finger of node p, iff
 p precedes n by at least 2i-1 ; and
 The ith finger of node p succeeds n.
 The first node, p, that satisfies these conditions
is the immediate predecessor of n-2i-1
 For a given n, the algorithm starts with the ith
finger of node n and then continues to walk in the
counter-clock-wise direction on the identifier
circle until it encounters a node whose ith finger
precedes n.
Chord: Join Example (add N26)
N21 (old finger table) N21 (new finger table)
N21+1 N32 N21+1 N26
N21+2 N32 N21+2 N26
N21+4 N32 N21+4 N26
N21+8 N32 N21+8 N32
N21+16 N38 N21+16 N38
N21+32 N56 N21+32 N56

i=1: Does N21 precede N26 by at least 1 (2i-1); yes: N21+1 becomes N26;
i=2: Does N21 precede N26 by at least 2; yes: N21+2 becomes N26;
i=3: Does N21 precede N26 by at least 4; yes: N21+4 becomes N26;
i=4: Does N21 precede N26 by 8; no; evaluate N14;
Chord: Join Example (add N26)
N14 (new finger table) N14 (new finger table)
N14+1 N21 N14+1 N21
N14+2 N21 N14+2 N21
N14+4 N21 N14+4 N21
N14+8 N32 N14+8 N26
N14+16 N32 N14+16 N32
N14+32 N48 N14+32 N48

i=4: Does N14 precede N26 by at least 8; yes; N14+8 becomes N26
i=5; Does N15 precede N26 by at least 16; no; evaluate N8
Etc
Chord: Join (Transferring Keys)
 Move responsibility for all the keys for
which node n is the successor.
 Typically this involves moving data
associated with each key to the new node.
 Node n can become the successor for keys
that were previously the responsibility of
the node immediately following n.
 Node n only needs to contact one node to
transfer responsibility for all relevant
keys.
Chord: Join
 The previous discussion on join focuses on a
single node join.
 What if there are multiple node joins?
 Join requires that each node’s successor is
correctly maintained
Chord: Stabilization Protocol
 The successor/predecessor links are
rebuilt by periodic stabilize notification
messages
 Sent by each node to its successor to inform it
of the (possibly new) identity of the
predecessor
 The successor pointers are used to verify
and correct finger table entries.
Chord: Join/Stabilize Example
Chord: Join/Stabilize Example

• N26 joins the system

• N26 acquires N32 as its successor

• N26 notifies N32

• N32 acquires N26 as its


predecessor
Chord: Join/Stabilize Example

• N26 copies keys

• N21 runs stabilize() and asks its


successor N32 for its predecessor
which is N26.
Chord: Join/Stabilize Example

• N21 aquires N26 as its successor


Chord Stabilization
 Pointers and finger tables may be in a state
of flux
 Is it possible that data will not be found?
 Yes

 Recovery: try again


Chord: Node Failure

N120
N10
N113

N102

N85 Lookup(90)

N80

N80 doesn’t know correct successor, so incorrect lookup


Chord: Node Failure
 Solution: Use successor lists
 Each node knows r immediate successors
 After failure, will know first live successor
 Stabilize messages correct finger tables
 Replicas of the data associated with a key at the r
successor nodes might be used
 Application dependent
Chord Properties

 In a system with N nodes and K keys, with high


probability…
 each node receives at most K/N keys
 each node maintains info. about O(log N) other nodes
 lookups resolved with O(log N) hops
 Insertions O(log2N)
 The developers of Chord validated this through
simulation studies.
 No consistency among replicas
 Hops have poor network locality
Chord: Network Locality
 Nodes close on ring can be far in the
network.
To vu.nl
Lulea.se

OR-DSL N20
CMU
MIT
MA-Cable
Cisco

CA-T1
N40 Cornell

N41
CCI NYU

N80
Aros
Utah

* Figure from http://project-iris.net/talks/dht-toronto-03.ppt

You might also like