Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

An Efficient Failure Detector for Sparsely Connected Networks

Martin Hutle ∗
Embedded Computing Systems Group
Institute for Computer Engineering
Vienna University of Technology
Treitlstraße 3/2
A-1040 Vienna, Austria
hutle@ecs.tuwien.ac.at

ABSTRACT in its partition. A failure detector satisfies eventual strong


We present an implementation of an eventually perfect fail- accuracy if the failure detector of every processor eventu-
ure detector for sparsely connected, partitionable networks, ally stops suspecting all processes that are in its partition.
where each process has only a bounded number of neigh- A failure detector that fulfills these two properties is called
bors. Processes and links may fail by crashing. Regard- eventually perfect; the class of these failure detectors is de-
ing synchrony, our algorithm only needs to know an upper noted as P.
bound on the jitter ε of the communication between direct In this paper, we present an implementation of such
neighbors. No a-priori knowledge about the number of pro- a failure detector, where processes are connected by a not-
cesses in the system is required. fully-connected network with bounded number of neigh-
The algorithm uses heartbeats to determine whether bors. In wireless ad-hoc networks, where hardware limi-
a process is in the same partition. By reducing the fre- tations (e.g., receiver channels or buffers) allow processes
quency of forwards by distance, information about nearer to keep connections only to a certain number of other pro-
processes is more accurate than about farther ones, and the cesses, this is a natural property of the low-level network.
message size becomes constant. Since this property can The total number of processes needs not to be known.
be guaranteed independently of the number of processes in Links and processes can fail by crashing, and due to these
the system, our failure detector is very efficient in terms of failures the network may partition into components. Pro-
communication complexity. cesses can communicate only with their neighbors via a
local broadcast primitive, which can also be implemented
KEY WORDS
efficiently in wireless ad-hoc networks.
Fault Tolerant Distributed Systems, Unreliable Failure De-
tectors, Partitionable Networks, Sparse Networks Processes do not need to know a bound on the com-
munication delay between arbitrary processes but only a
bound on the jitter on the communication between neigh-
1 Introduction bors. This implies synchronous communication and may
appear to be a severe restriction. Synchrony is required
Unreliable failure detectors were introduced by Chandra only between direct neighbors, however, which makes
and Toueg in a seminal paper [5] for solving consensus in communication with non-neighbors at least partially syn-
an asynchronous system [9], and have been recognized as chronous [8, 12] in case of unknown network size. In
useful building blocks for many problems in fault-tolerant fact, in a wireless ad-hoc network, bounded communica-
distributed computing [10, 2]. A failure detector is a mod- tion delays can easily be achieved if a fixed part of the
ule that provides a process with (possibly incorrect) infor- communication bandwidth is reserved for the failure de-
mation about other processes. Although the output of a tector, and the number of neighbors and the message size
failure detector is not constrained, Chandra and Toueg (and can be bounded. The algorithm of this paper fulfills these
most other authors) focus on failure detectors that issue a conditions.
list of processes that the failure detector module suspects The algorithm uses heartbeats and timeouts to deter-
to have crashed. Such failure detectors can be described by mine if there is a connection between two processes. In
accuracy and completeness properties. contrast to systems where a fully connected network is as-
In [1] the definitions of completeness and accuracy sumed (and therefore the information is routed over the par-
properties were extended to partitionable networks. In- tially connected network when simulating the fully con-
formally speaking, a failure detector for partitionable net- nected one), every process reuses the information of its
works satisfies strong completeness if the failure detector of neighbors, so unnecessary traffic can be avoided: Period-
each processor eventually suspects all processes that are not ically, every process increments its own heartbeat counter
∗ Supported by the Austrian START program Y41-MAT, in the context and exchanges heartbeats with its neighbors. Consider a
of our W2F-project http://www.auto.tuwien.ac.at/Projects/W2F simple algorithm first, where every process sends all heart-
beats to its neighbors in every round. Every process re- For an arbitrary but fixed time t, the distance
ceives a new heartbeat from every connected process in D(p, q, t) of two nodes p and q denotes the length of the
each round here. If a process does not receive a heartbeat shortest path from p to q in G(t). Obviously, the distance
from another process, it suspects it. Still, this algorithm between two nodes is monotonically increasing with time.
would require every process to send O(n) messages in ev- Two nodes connected by a direct link are called neighbors,
ery round. the set of all neighbors of a process p at time t is denoted
By contrast, our algorithm forwards heartbeats of pro- by NB (p, t), the size of this set is called the degree of the
cesses that are far away less frequently than those of nearer node. The maximum degree of the whole graph is denoted
ones. The failure detector provides therefore more accu- as ∆(t). We assume that there is a bound ∆ on ∆(t) that is
rate information about nearer processes. This blends nicely known to the processes.
with real systems, where, due to localization, processes that Due to changes in the communication graph, the net-
work together are often situated in the same region of the work may partition into components. For a process p, the
network. When choosing the parameters appropriately, we component C (p, t) is defined as the subgraph of G(t) that
can reduce the failure detector-caused traffic of each pro- is induced by all nodes connected to p. Further, we group
cess to constant size, independently of the number of pro- all processes in p’s component by their distance from p:
cesses in the system, yet calculate a precise timeout value. Pk (p, t) = {q ∈ C (p, t) | D(p, q, t) = k}, and write
nk (p, t) = |Pk (p, t)|. Note that C (p, t) and Pk (p, t) can
contain only processes that have not crashed by t.
2 Related Work
Processes can communicate with their neighbors us-
Failure detectors for partitionable systems were defined in ing a local broadcast service. The service consists of two
a similar way by various authors [1, 2, 7]. In [1], a sparsely primitives, broadcast and deliver. When a process p in-
connected communication graph is used, but the failure vokes broadcast(msg) at time t, then deliver(msg) is trig-
detector named “heartbeat” is used for quiescent reliable gered in the interval [t+τ − , t+τ + ] at all processes that are
communication and is weaker than W. 1 in N B(p, t + τ + ). All links are therefore reliable until they
Many failure detector implementations of P use crash. Note that only a bound on the jitter ε = τ + − τ − ,
heartbeats [11, 3]. To our knowledge, none of them oper- needs to be known by the processes; as in [13] and [15], τ +
ates on sparsely connected networks, however. They could and τ − are used only for analysis purposes.
be used in conjunction with routing or other mechanisms
to implement a reliable point-to-point network on top of a 4 Failure Detectors
sparse network, but this typically creates excessive traffic.
To reduce communication traffic, methods as gossip- As many other failure detectors, our algorithm outputs a
ing [14] are used, but these algorithms provide no deter- list of processes. Since we do not require that processes
ministic solution and also require a fully connected com- know all other processes in the system or even n a pri-
munication graph. ori, our failure detector will not output all suspected pro-
cesses. It outputs a list of detected processes instead, which
is complementary and hence equivalent to the suspect list
3 System Model
in the sense that it also allows an application to determine
Our distributed system comprises a set Π of n processes, whether a specific process is up or down. Therefore we call
connected by a not fully connected network. We assume a process p suspected by another process q, if it is not in its
the existence of a discrete global clock with values from a list of detections. Formally, a failure detector history is a
set T , which is used only for analysis and is not available function H(p, t) : Π×T → 2Π . If a process q is in H(p, t)
to the processes. However, we assume that processes can at some time t, we say p detects q, else p suspects q.
measure time intervals with negligible small clock drift. We call a failure detector for a partitionable system
The network topology is described by a simple undi- eventually perfect, if it outputs a list of detections that ful-
rected graph G(t) = (V (t), E(t)) that changes with time fills the following properties:
(t ∈ T ). V (t) contains a node for each process from Π that Strong Completeness For any two processes that became
has not crashed until t, E(t) contains an edge e = (p, q) if disconnected (this includes the case that one of them
and only if there is a direct link between p and q which has crashed), there is a time after which they permanently
not crashed by t. Because a link can only connect two non- suspect each other. Formally,
crashed processes we have ∀t ∈ T : E(t) ⊆ V (t) × V (t).
∃t0 ∀t ≥ t0 : q ∈
/ C (p, t) ⇒ ∃t1 ∀t ≥ t1 : q ∈
/ H(p, t)
As we assume persistent crashes of processes and links
only, V (t + 1) ⊆ V (t) and E(t + 1) ⊆ E(t). Eventual Strong Accuracy For any two processes that re-
1 Informally speaking, a failure detector is in W if eventually no cor- main permanently connected, there is a time after
rect process is suspected by some other correct process and every faulty which they permanently do not suspect each other.
process is eventually suspected by some correct process (see [5] for de- Formally,
tails). Note that this definition does not include partitions. W has been
shown to be the weakest failure detector that solves consensus [4]. ∃t0 ∀t ≥ t0 : q ∈ C (p, t) ⇒ ∃t1 ∀t ≥ t1 : q ∈ H(p, t)
The class of eventually perfect failure detectors is denoted and only connected processes can learn this value. Ei-
by P. Note that this definition is a generalization of failure ther p already suspects q, in which case we are done, or
detectors for non-partitionable systems [5]. If the network distancep [q] is bounded by the last distance between p and
does not partition, then the definitions are equivalent. q. Thus, eventually p will time out q and therefore suspect
With such a failure detector some a priori known sub- it.
set of processes can solve consensus if a majority of them
remains connected and there is a communication module To show eventual strong accuracy, we need some
that implements reliable point to point connections be- technical lemmas. First, we derive a bound on the num-
tween them [1]. It can also be used to solve partitionable ber of processes that are at a certain distance to a process
group membership [2]. initially:
Lemma 1. Let p be any process in a network with maximal
5 The Algorithm degree ∆. Then n0 (p, t) = 1 and for k > 0, nk (p, t) ≤
∆(∆ − 1)k−1 .
Every process p has, for every other process q it knows, a Proof. Obviously, p is the only process with distance 0 to
heartbeat table consisting of: p. For k = 1, since p can have at most ∆ neighbors, the
• a heartbeat counter hbcp [q] which contains the most lemma also holds. Assume that the lemma is valid for
recent heartbeat of q k − 1 > 0. Then nk−1 (p, t) ≤ ∆(∆ − 1)k−2 . Each of
these processes must have a link to some process in Pk−2 .
• a distance counter distancep [q] that contains p’s esti- Therefore at most ∆ − 1 links can lead to processes in Pk ,
mate about the current distance to q yielding nk = |Pk | ≤ ∆(∆ − 1)k−1

• a time-stamp lastp [q] that holds the last round when p The variable distancep [q] at time t is p’s estimate of
received a new heartbeat from q. D(p, q, t). If a longer path is faster than a shorter one, this
estimate may not be exact, but it is never smaller than the
These arrays grow dynamically with each new process p initial real distance:
learns of. For reasons of simplicity they are used in the
algorithm as if they were statically allocated. The set Lemma 2. For any q ∈ Pk (p, 0), distancep [q] ≥ k.
detectedp contains all processes that the failure detector Proof. The variable distancep [q] is only set when p re-
does not suspect, i.e., the failure detector’s output. ceives a heartbeat of q. At q, distanceq [q] = 0, and with
The algorithm for process p is presented in Figure 1. every hop along the path from q to p this hop counter is
It comprises a periodical task, which sends some part of its increased. Therefore distancep [q] contains the length of
local heartbeat table to its current neighbors, and a receiver the path the heartbeat took. By definition every process in
task, which updates the local table when it receives new Pk (p, 0) is initially at distance k to p, and the distance is
heartbeats from neighbors. monotonically increasing. Therefore the length of this path
Initially, p knows and detects only itself. Every T must be greater or equal than k.
time steps, p increases its own heartbeat counter, which is
also used as the local round number. Every ∆k rounds, Lemma 3. A heartbeat of process q from Pk (p, 0) is sent
all known processes with distance k are put into the set by p at most once every ∆k rounds.
unsentp . Since only messages from this set are sent, this
Proof. A process q’s heartbeat is sent by p only if it
ensures that every heartbeat of a process with distance k
is previously put into unsentp . This happens only ev-
is broadcast at most every ∆k rounds. From this set, the
ery ∆distancep [q] rounds (lines 14+15) . According to
processor id, heartbeat, and distance of the ∆ + 1 pro-
Lemma 2, distancep [q] ≥ k and therefore ∆distancep [q] ≥
cesses with lowest distance to p are sent and removed from
∆k .
unsentp in every round. If p does not receive a new up-
date from another process it has previously detected for a The following lemma shows that, although our algo-
sufficiently long time, it suspects it. rithm sends constant size messages at every process, our
The receiver task of p increases the distance counter scheduling function ensures that the heartbeat counter of
of each message it receives by one. If this distance counter every process is forwarded periodically by every process.
is shorter than its own estimate, it adopts the new distance. In the following, we will call the value of hbcp [p] also the
Once it receives a heartbeat newer than its own, it adopts current round number.
this heartbeat and—if this is not already the case—detects
the heartbeat’s origin. Lemma 4. If k = distancep [q] at some time t, the heart-
beat of q is broadcast by p at most 2∆k rounds after t.
Theorem 1. Algorithm 1 implements strong completeness.
Proof. Let Q = {q 0 |distancep [q 0 ] ≤ k} the set of pro-
Proof. If p and q become disconnected, eventually hbcp [q] cesses with an estimated distance to p less or equal than q.
will not grow anymore, since only q can increase hbcq [q], We first show that the maximum number m of messages
1 variables
2 ∀q ∈ Π : hbcp [q], distancep [q], lastp [q] ∈ N /* heartbeat table */
3 unsentp ⊆ Π
4 detectedp ⊆ Π /* the failure detector output */
5
6 initially
7 ∀q : hbcq [p] = 0
8 distancep [p] = 0, ∀q 6= p : distancep [q] = ∞
9 detectedp = {p}
10 unsentp = ∅
11
12 every T time steps do:
13 hbcp [p] = lastp [p] = hbcp [p] + 1 /* this is also the local round number */
14 for each k ≥ 0, such that ∆k divides hbcp [p] do:
15 add all q with distancep [q] = k to unsentp
16 for 1 to ∆ + 1 do:
17 q = an item from unsentp for which distancep [q] is minimal
18 remove q from unsentp
19 broadcastp (q, hbcp [q], distancep [q])
20 for each q ∈ detectedp do:
2T
21 if (hbcp [p] − lastp [q])T > η∆k + kε then /* η = ∆−1 , k = distancep [q] */
22 remove q from detectedp /* suspect q */
23 distancep [q] = ∞
24
25 on deliver p (q, new hbc, new dist) do:
26 if distancep [q] > new dist + 1 then
27 distancep [q] = new dist + 1
28 if new hbc > hbcp [q] then /* more recent heartbeat */
29 hbcp [q] = new hbc /* adopt heartbeat */
30 lastp [q] = hbcp [p] /* set last reception to current round */
31 if q ∈/ detectedp then
32 add q to detectedp /* detect q */

Figure 1. Failure detector algorithm for any process p. It comprises a periodical task and a message handler.

containing heartbeats from processes in Q and sent in ∆k equal priority than q, q is broadcast in each period of ∆k
rounds is less or equal to ∆k (∆ + 1). Let q 0 be a process rounds at least once 2 . Since the position of t in the ∆k
from Q and let i = D(p, q 0 , 0) be the initial distance of p period can be arbitrary, the heartbeat is broadcast at most
and q 0 (that is, q 0 ∈ Pi (p, 0)). According to Lemma 3, q 0 after 2∆k rounds.
causes at most ∆k /∆i messages in ∆k rounds. Therefore,
the sum m of all messages from Q is less or equal than The guarantee that a heartbeat is forwarded after some
well defined time allows us to compute a precise bound
k on the time after which a process learns the current round
X ∆k
ni (p, 0) , number of another process in the system:
i=0
∆i
Lemma 5. If two processes p and q remain connected with
and using Lemma 1, distance k after some time t0 and p increases hbcp [p] at
! time t ≥ t0 to a value v, then q sets hbcq [p] = v and
k
∆(∆ − 1)i−1
Pk−1 i
k
X distanceq [p] ≤ k by time t + kτ + + 2( i=1 ∆ )T .
m≤∆ 1+ ≤ ∆k (∆ + 1)
i=1
∆i
Proof. By induction on k. For k = 1, p broadcasts
k k p immediately and therefore q receives hbcp [p] at least
Since ∆ (∆ + 1) messages can be sent in ∆ rounds
after τ + time steps. For k > 1, assume the lemma
according to lines 16-19, and since less or equal than
∆k (∆ + 1) heartbeats (including q) can have a higher or 2 In fact, q it is broadcast exactly once in ∆k rounds
holds for k − 1. Let q 0 be a neighbor of q with distance Theorem 3. In every round of T time steps, every process
k − 1 from p (since ∀t0 ≥ t : D(p, q, t0 ) = k, such
a process exists). Then by the induction hypothesis, q 0 • sends at most ∆ + 1 messages
sets hbcq0 [p] = v and distanceq0 [p] ≤ k − 1 by time • receives at most ∆(∆ + 1) messages
Pk−2
t + (k − 1)τ + + 2( i=1 ∆i )T . Therefore, according
to Lemma 4, q 0 forwards the heartbeat and distance of p where each message is of size O(log n + log t).
at most after 2∆distanceq0 [p] T ≤ 2∆k−1 T time steps. In
consequence, including the maximum communication de- Proof. That the number of messages a process sends per
Pk−1 i round is ∆ + 1 follows immediately from lines 16 and 19
lay τ + , q receives v at t+kτ + +2( i=1 ∆ )T . According
of the algorithm. Every node has at most ∆ neighbors,
to line 26+27 of the algorithm, after the reception of this
so the number of received messages is ∆(∆ + 1). The
message, distanceq [p] ≤ k.
message size follows from the fact that every message is a
tuple (p, hbc, distance), where p is of size O(log n), hbc
With that result, we can compute a timeout on the
of size O(log t) and distance of size O(log n).
time difference between two updates of a heartbeat counter.
This timeout does not depend on any other parameter than In practice, the message size can be regarded as con-
the maximum degree ∆ and the jitter ε: stant. Since ∆ is also a constant, the communication traffic
at each process is of constant size.
Lemma 6. If any two processes p and q remain connected
at distance k, the time difference between two updates of Theorem 4. If two processes p and q become disconnected
hbcq [p] at q is less or equal than η∆k + kε, where η = at time t, they suspect each other by time 3 t + 2η∆k +
2T
∆−1 . k(2τ + − τ − ), where k is the distance just before the parti-
tion.
Proof. W.L.O.G., assume that p sets hbcp [p] = v at time 0
and hbcp [p] = v + 1 at time T . Obviously, the earliest time Proof. According to Lemma 5, p receives the last heartbeat
Pk−1
q can set hbcq [p] = v is kτ − . By Lemma 5, the latest time of q by time t + kτ + + 2( i=1 ∆i )T . According to line
q can set hbcq [p] = v + 1 (or a higher value) is T + kτ + + 21, p suspects q exactly η∆k + kε time steps later. Since
Pk−1 i
2( i=1 ∆ )T . Hence, the time difference is  k
∆ −1

t + kτ + + T · 2 − 1 + η∆k + kε ≤
k−1  k  ∆−1
X
i ∆ −1
2 ∆ ·T +T +kε ≤ T · 2 − 1 +kε ≤ η∆k +kε
i=1
∆−1 ≤ t + 2η∆k + k(2τ + − τ − )
the theorem follows.

Theorem 2. Algorithm 1 implements eventual strong ac- 7 Conclusions and Future Work
curacy.
We presented an implementation of an eventually perfect
Proof. Let p and q be two processes that remain con- failure detector for a partitionable network with sparse
nected. After some time, D(p, q, t) does not change any- topology. Processes can only communicate with their
more. Then either p never suspects q—in which case we neighbors. Between neighbors, an upper bound on the
are done—or there is a time where p suspects q. In this communication jitter is assumed. The number of neigh-
case, distancep [q] = ∞. Then eventually p will learn k, bors is assumed to be bounded by ∆, which is an ade-
i.e. distancep [q] = k. According to Lemma 6, p receives quate model for wireless ad-hoc networks. The algorithm
a new heartbeat from q at least every η∆k + kε time steps. requires neither a priori knowledge of the number of pro-
Therefore the condition in line 21 is never satisfied and the cesses in the system nor an upper bound on the commu-
processes never suspect each other. nication delay between arbitrary processes. Every process
broadcasts just ∆ + 1 messages per round to its neighbors,
Corollary Algorithm 1 implements a eventually perfect and under the assumption of a constant size name-space
failure detector P for partitionable systems. and time domain, these messages are of constant size. Pro-
cesses at shorter distances get more accurate information
about each other than farther ones.
6 Complexity Analysis It is possible to adapt our algorithm to systems where
links can also recover. However, in such a system the def-
In this section we analyze the message complexity and the
inition of reachability is not obvious, since an application
failure detection time of the algorithm. As the following
of the failure detector may use e.g. a routing algorithm for
theorems show, we get the logarithmic message complexity
communication. The application-level reachability relation
in exchange for a failure detection time exponential in the
distance between the nodes. 3 In [6] this is called the detection time TD .
would hence also depend on the behavior of this routing al- [9] Michael J. Fischer, Nancy A. Lynch, and Michael S.
gorithm. One solution would be to investigate algorithms Paterson. Impossibility of distributed consensus with
that depend on a failure detector but operate directly on the one faulty process. Journal of the ACM, 3(2):374–
sparse network, another to derive connectedness conditions 382, April 1985.
on the topology from various routing algorithms. Part of
our future work will be devoted to those extensions. [10] Rachid Guerraoui. Non-blocking atomic commit in
asynchronous distributed systems with failure detec-
tors. In Distributed Computing, volume 15, pages 17–
Acknowledgments 25. Springer-Verlag, 2002.

I should like to thank Josef Widder for helpful discussion [11] Indranil Gupta, Tushar D. Chandra, and Germán S.
and comments. The contribution of the anonymous referees Goldszmidt. On scalable and efficient distributed
and the proofreading of Bettina Weiss and Ulrich Schmid, failure detectors. In Proceedings of the 20th ACM
my PhD supervisor, are also acknowledged. Symposium on Principles of Distributed Computing
(PODC’01), pages 170–179, August 2001.
[12] Mikel Larrea, Antonio Fernández, and Sergio
References
Arévalo. On the impossibility of implementing per-
[1] Marcos Kawazoe Aguilera, Wei Chen, and Sam petual failure detectors in partially synchronous sys-
Toueg. Using the heartbeat failure detector for qui- tems. In Proceedings of the 10th Euromicro Workshop
escent reliable communication and consensus in par- on Parallel, Distributed and Network-based Process-
titionable networks. Theoretical Computer Science, ing (PDP’02), January 2002.
220(1):3–30, 1999. [13] Ulrich Schmid. How to model link failures: A
perception-based fault model. In Proceedings of the
[2] Özalp Babaoğlu, Renzo Davoli, and Alberto Montre-
International Conference on Dependable Systems and
sor. Group communication in partitionable systems:
Networks (DSN’01), pages 57–66, Göteborg, Swe-
Specification and algorithms. Software Engineering,
den, July 1–4, 2001.
27(4):308–336, 2001.
[14] Robbert van Renesse, Yaron Minsky, and Mark Hay-
[3] Marin Bertier, Olivier Marin, and Pierre Sens. Imple- den. A gossip-style failure detection service. Techni-
mentation and performance evaluation of an adapt- cal Report TR98-1687, 1998.
able failure detector. In Proceedings of the Interna-
tional Conference on Dependable Systems and Net- [15] Josef Widder. Booting clock synchronization in par-
works (DSN’02), pages 354–363, Washington, DC, tially synchronous systems. In Proceedings of the
June 23–26, 2002. 17th International Symposium on Distributed Com-
puting (DISC’03), volume 2848 of LNCS, pages 121–
[4] Tushar Deepak Chandra, Vassos Hadzilacos, and Sam 135. Springer Verlag, October 2003.
Toueg. The weakest failure detector for solving con-
sensus. Journal of the ACM (JACM), 43(4):685–722,
1996.

[5] Tushar Deepak Chandra and Sam Toueg. Unreli-


able failure detectors for reliable distributed systems.
Journal of the ACM (JACM), 43(2):225–267, 1996.

[6] Wei Chen, Sam Toueg, and Marcos Kawazoe Aguil-


era. On the quality of service of failure detectors. In
Proceedings IEEE International Conference on De-
pendable Systems and Networks (ICDSN / FTCS’30),
2000.

[7] D. Dolev, R. Friedman, I. Keidar, and D. Malkhi. Fail-


ure detection in omission failure environments. In
Proceedings of the 16th ACM Symposium on Princi-
ples of Distributed Computing, 1997.

[8] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer.


Consensus in the presence of partial synchrony. Jour-
nal of the ACM (JACM), 35(2):288–323, 1988.

You might also like