Professional Documents
Culture Documents
An Efficient Failure Detector For Sparsely Connected Networks
An Efficient Failure Detector For Sparsely Connected Networks
Martin Hutle ∗
Embedded Computing Systems Group
Institute for Computer Engineering
Vienna University of Technology
Treitlstraße 3/2
A-1040 Vienna, Austria
hutle@ecs.tuwien.ac.at
• a time-stamp lastp [q] that holds the last round when p The variable distancep [q] at time t is p’s estimate of
received a new heartbeat from q. D(p, q, t). If a longer path is faster than a shorter one, this
estimate may not be exact, but it is never smaller than the
These arrays grow dynamically with each new process p initial real distance:
learns of. For reasons of simplicity they are used in the
algorithm as if they were statically allocated. The set Lemma 2. For any q ∈ Pk (p, 0), distancep [q] ≥ k.
detectedp contains all processes that the failure detector Proof. The variable distancep [q] is only set when p re-
does not suspect, i.e., the failure detector’s output. ceives a heartbeat of q. At q, distanceq [q] = 0, and with
The algorithm for process p is presented in Figure 1. every hop along the path from q to p this hop counter is
It comprises a periodical task, which sends some part of its increased. Therefore distancep [q] contains the length of
local heartbeat table to its current neighbors, and a receiver the path the heartbeat took. By definition every process in
task, which updates the local table when it receives new Pk (p, 0) is initially at distance k to p, and the distance is
heartbeats from neighbors. monotonically increasing. Therefore the length of this path
Initially, p knows and detects only itself. Every T must be greater or equal than k.
time steps, p increases its own heartbeat counter, which is
also used as the local round number. Every ∆k rounds, Lemma 3. A heartbeat of process q from Pk (p, 0) is sent
all known processes with distance k are put into the set by p at most once every ∆k rounds.
unsentp . Since only messages from this set are sent, this
Proof. A process q’s heartbeat is sent by p only if it
ensures that every heartbeat of a process with distance k
is previously put into unsentp . This happens only ev-
is broadcast at most every ∆k rounds. From this set, the
ery ∆distancep [q] rounds (lines 14+15) . According to
processor id, heartbeat, and distance of the ∆ + 1 pro-
Lemma 2, distancep [q] ≥ k and therefore ∆distancep [q] ≥
cesses with lowest distance to p are sent and removed from
∆k .
unsentp in every round. If p does not receive a new up-
date from another process it has previously detected for a The following lemma shows that, although our algo-
sufficiently long time, it suspects it. rithm sends constant size messages at every process, our
The receiver task of p increases the distance counter scheduling function ensures that the heartbeat counter of
of each message it receives by one. If this distance counter every process is forwarded periodically by every process.
is shorter than its own estimate, it adopts the new distance. In the following, we will call the value of hbcp [p] also the
Once it receives a heartbeat newer than its own, it adopts current round number.
this heartbeat and—if this is not already the case—detects
the heartbeat’s origin. Lemma 4. If k = distancep [q] at some time t, the heart-
beat of q is broadcast by p at most 2∆k rounds after t.
Theorem 1. Algorithm 1 implements strong completeness.
Proof. Let Q = {q 0 |distancep [q 0 ] ≤ k} the set of pro-
Proof. If p and q become disconnected, eventually hbcp [q] cesses with an estimated distance to p less or equal than q.
will not grow anymore, since only q can increase hbcq [q], We first show that the maximum number m of messages
1 variables
2 ∀q ∈ Π : hbcp [q], distancep [q], lastp [q] ∈ N /* heartbeat table */
3 unsentp ⊆ Π
4 detectedp ⊆ Π /* the failure detector output */
5
6 initially
7 ∀q : hbcq [p] = 0
8 distancep [p] = 0, ∀q 6= p : distancep [q] = ∞
9 detectedp = {p}
10 unsentp = ∅
11
12 every T time steps do:
13 hbcp [p] = lastp [p] = hbcp [p] + 1 /* this is also the local round number */
14 for each k ≥ 0, such that ∆k divides hbcp [p] do:
15 add all q with distancep [q] = k to unsentp
16 for 1 to ∆ + 1 do:
17 q = an item from unsentp for which distancep [q] is minimal
18 remove q from unsentp
19 broadcastp (q, hbcp [q], distancep [q])
20 for each q ∈ detectedp do:
2T
21 if (hbcp [p] − lastp [q])T > η∆k + kε then /* η = ∆−1 , k = distancep [q] */
22 remove q from detectedp /* suspect q */
23 distancep [q] = ∞
24
25 on deliver p (q, new hbc, new dist) do:
26 if distancep [q] > new dist + 1 then
27 distancep [q] = new dist + 1
28 if new hbc > hbcp [q] then /* more recent heartbeat */
29 hbcp [q] = new hbc /* adopt heartbeat */
30 lastp [q] = hbcp [p] /* set last reception to current round */
31 if q ∈/ detectedp then
32 add q to detectedp /* detect q */
Figure 1. Failure detector algorithm for any process p. It comprises a periodical task and a message handler.
containing heartbeats from processes in Q and sent in ∆k equal priority than q, q is broadcast in each period of ∆k
rounds is less or equal to ∆k (∆ + 1). Let q 0 be a process rounds at least once 2 . Since the position of t in the ∆k
from Q and let i = D(p, q 0 , 0) be the initial distance of p period can be arbitrary, the heartbeat is broadcast at most
and q 0 (that is, q 0 ∈ Pi (p, 0)). According to Lemma 3, q 0 after 2∆k rounds.
causes at most ∆k /∆i messages in ∆k rounds. Therefore,
the sum m of all messages from Q is less or equal than The guarantee that a heartbeat is forwarded after some
well defined time allows us to compute a precise bound
k on the time after which a process learns the current round
X ∆k
ni (p, 0) , number of another process in the system:
i=0
∆i
Lemma 5. If two processes p and q remain connected with
and using Lemma 1, distance k after some time t0 and p increases hbcp [p] at
! time t ≥ t0 to a value v, then q sets hbcq [p] = v and
k
∆(∆ − 1)i−1
Pk−1 i
k
X distanceq [p] ≤ k by time t + kτ + + 2( i=1 ∆ )T .
m≤∆ 1+ ≤ ∆k (∆ + 1)
i=1
∆i
Proof. By induction on k. For k = 1, p broadcasts
k k p immediately and therefore q receives hbcp [p] at least
Since ∆ (∆ + 1) messages can be sent in ∆ rounds
after τ + time steps. For k > 1, assume the lemma
according to lines 16-19, and since less or equal than
∆k (∆ + 1) heartbeats (including q) can have a higher or 2 In fact, q it is broadcast exactly once in ∆k rounds
holds for k − 1. Let q 0 be a neighbor of q with distance Theorem 3. In every round of T time steps, every process
k − 1 from p (since ∀t0 ≥ t : D(p, q, t0 ) = k, such
a process exists). Then by the induction hypothesis, q 0 • sends at most ∆ + 1 messages
sets hbcq0 [p] = v and distanceq0 [p] ≤ k − 1 by time • receives at most ∆(∆ + 1) messages
Pk−2
t + (k − 1)τ + + 2( i=1 ∆i )T . Therefore, according
to Lemma 4, q 0 forwards the heartbeat and distance of p where each message is of size O(log n + log t).
at most after 2∆distanceq0 [p] T ≤ 2∆k−1 T time steps. In
consequence, including the maximum communication de- Proof. That the number of messages a process sends per
Pk−1 i round is ∆ + 1 follows immediately from lines 16 and 19
lay τ + , q receives v at t+kτ + +2( i=1 ∆ )T . According
of the algorithm. Every node has at most ∆ neighbors,
to line 26+27 of the algorithm, after the reception of this
so the number of received messages is ∆(∆ + 1). The
message, distanceq [p] ≤ k.
message size follows from the fact that every message is a
tuple (p, hbc, distance), where p is of size O(log n), hbc
With that result, we can compute a timeout on the
of size O(log t) and distance of size O(log n).
time difference between two updates of a heartbeat counter.
This timeout does not depend on any other parameter than In practice, the message size can be regarded as con-
the maximum degree ∆ and the jitter ε: stant. Since ∆ is also a constant, the communication traffic
at each process is of constant size.
Lemma 6. If any two processes p and q remain connected
at distance k, the time difference between two updates of Theorem 4. If two processes p and q become disconnected
hbcq [p] at q is less or equal than η∆k + kε, where η = at time t, they suspect each other by time 3 t + 2η∆k +
2T
∆−1 . k(2τ + − τ − ), where k is the distance just before the parti-
tion.
Proof. W.L.O.G., assume that p sets hbcp [p] = v at time 0
and hbcp [p] = v + 1 at time T . Obviously, the earliest time Proof. According to Lemma 5, p receives the last heartbeat
Pk−1
q can set hbcq [p] = v is kτ − . By Lemma 5, the latest time of q by time t + kτ + + 2( i=1 ∆i )T . According to line
q can set hbcq [p] = v + 1 (or a higher value) is T + kτ + + 21, p suspects q exactly η∆k + kε time steps later. Since
Pk−1 i
2( i=1 ∆ )T . Hence, the time difference is k
∆ −1
t + kτ + + T · 2 − 1 + η∆k + kε ≤
k−1 k ∆−1
X
i ∆ −1
2 ∆ ·T +T +kε ≤ T · 2 − 1 +kε ≤ η∆k +kε
i=1
∆−1 ≤ t + 2η∆k + k(2τ + − τ − )
the theorem follows.
Theorem 2. Algorithm 1 implements eventual strong ac- 7 Conclusions and Future Work
curacy.
We presented an implementation of an eventually perfect
Proof. Let p and q be two processes that remain con- failure detector for a partitionable network with sparse
nected. After some time, D(p, q, t) does not change any- topology. Processes can only communicate with their
more. Then either p never suspects q—in which case we neighbors. Between neighbors, an upper bound on the
are done—or there is a time where p suspects q. In this communication jitter is assumed. The number of neigh-
case, distancep [q] = ∞. Then eventually p will learn k, bors is assumed to be bounded by ∆, which is an ade-
i.e. distancep [q] = k. According to Lemma 6, p receives quate model for wireless ad-hoc networks. The algorithm
a new heartbeat from q at least every η∆k + kε time steps. requires neither a priori knowledge of the number of pro-
Therefore the condition in line 21 is never satisfied and the cesses in the system nor an upper bound on the commu-
processes never suspect each other. nication delay between arbitrary processes. Every process
broadcasts just ∆ + 1 messages per round to its neighbors,
Corollary Algorithm 1 implements a eventually perfect and under the assumption of a constant size name-space
failure detector P for partitionable systems. and time domain, these messages are of constant size. Pro-
cesses at shorter distances get more accurate information
about each other than farther ones.
6 Complexity Analysis It is possible to adapt our algorithm to systems where
links can also recover. However, in such a system the def-
In this section we analyze the message complexity and the
inition of reachability is not obvious, since an application
failure detection time of the algorithm. As the following
of the failure detector may use e.g. a routing algorithm for
theorems show, we get the logarithmic message complexity
communication. The application-level reachability relation
in exchange for a failure detection time exponential in the
distance between the nodes. 3 In [6] this is called the detection time TD .
would hence also depend on the behavior of this routing al- [9] Michael J. Fischer, Nancy A. Lynch, and Michael S.
gorithm. One solution would be to investigate algorithms Paterson. Impossibility of distributed consensus with
that depend on a failure detector but operate directly on the one faulty process. Journal of the ACM, 3(2):374–
sparse network, another to derive connectedness conditions 382, April 1985.
on the topology from various routing algorithms. Part of
our future work will be devoted to those extensions. [10] Rachid Guerraoui. Non-blocking atomic commit in
asynchronous distributed systems with failure detec-
tors. In Distributed Computing, volume 15, pages 17–
Acknowledgments 25. Springer-Verlag, 2002.
I should like to thank Josef Widder for helpful discussion [11] Indranil Gupta, Tushar D. Chandra, and Germán S.
and comments. The contribution of the anonymous referees Goldszmidt. On scalable and efficient distributed
and the proofreading of Bettina Weiss and Ulrich Schmid, failure detectors. In Proceedings of the 20th ACM
my PhD supervisor, are also acknowledged. Symposium on Principles of Distributed Computing
(PODC’01), pages 170–179, August 2001.
[12] Mikel Larrea, Antonio Fernández, and Sergio
References
Arévalo. On the impossibility of implementing per-
[1] Marcos Kawazoe Aguilera, Wei Chen, and Sam petual failure detectors in partially synchronous sys-
Toueg. Using the heartbeat failure detector for qui- tems. In Proceedings of the 10th Euromicro Workshop
escent reliable communication and consensus in par- on Parallel, Distributed and Network-based Process-
titionable networks. Theoretical Computer Science, ing (PDP’02), January 2002.
220(1):3–30, 1999. [13] Ulrich Schmid. How to model link failures: A
perception-based fault model. In Proceedings of the
[2] Özalp Babaoğlu, Renzo Davoli, and Alberto Montre-
International Conference on Dependable Systems and
sor. Group communication in partitionable systems:
Networks (DSN’01), pages 57–66, Göteborg, Swe-
Specification and algorithms. Software Engineering,
den, July 1–4, 2001.
27(4):308–336, 2001.
[14] Robbert van Renesse, Yaron Minsky, and Mark Hay-
[3] Marin Bertier, Olivier Marin, and Pierre Sens. Imple- den. A gossip-style failure detection service. Techni-
mentation and performance evaluation of an adapt- cal Report TR98-1687, 1998.
able failure detector. In Proceedings of the Interna-
tional Conference on Dependable Systems and Net- [15] Josef Widder. Booting clock synchronization in par-
works (DSN’02), pages 354–363, Washington, DC, tially synchronous systems. In Proceedings of the
June 23–26, 2002. 17th International Symposium on Distributed Com-
puting (DISC’03), volume 2848 of LNCS, pages 121–
[4] Tushar Deepak Chandra, Vassos Hadzilacos, and Sam 135. Springer Verlag, October 2003.
Toueg. The weakest failure detector for solving con-
sensus. Journal of the ACM (JACM), 43(4):685–722,
1996.