Lecture 4 - Failure Detection and Membership

Parallel and Distributed
Computing
Fall 2023
Dr. Rashid Amin
Lecture 4: Failure Detection and
Membership
1
A Challenge
• You’ve been put in charge of a datacenter, and your
manager has told you, “Oh no! We don’t have any failures
in our datacenter!”
• Do you believe him/her?
• What would be your first responsibility?

• Build a failure detector
• What are some things that could go wrong if you didn’t do
this? 2
Failures are the Norm
… not the exception, in datacenters.
Say, the rate of failure of one machine (OS/disk/motherboard/network, etc.) is

once every 10 years (120 months) on average.
When you have 120 servers in the DC, the mean time to failure (MTTF) of
the next machine is 1 month.
When you have 12,000 servers in the DC, the MTTF is about once every 7.2
hours!
Soft crashes and failures are even more frequent!

3
To build a failure detector
• You have a few options
1. Hire 1000 people, each to monitor one machine in the datacenter

and report to you when it fails.
2. Write a failure detector program (distributed) that automatically
detects
failures and reports to your workstation.
Which is more preferable, and why?
4
Target Settings
• Process ‘group’-based
systems
– Clouds/Datacenters
– Replicated servers
– Distributed databases
• Fail-stop (crash) process 5

Group Membership Service
Application Queries Application Process pi
e.g., gossip, overlays,
DHT’s, etc.
Membership
Protocol
Membership List
Unreliable
Communication 6
Two sub-protocols
Application Process pi
Group
Membership List
pj
•Complete list all the time (Strongly consistent) Dissemination

•Virtual synchrony
Failure Detector
•Almost-Complete list (Weakly consistent)
•Gossip-style, SWIM, …
•Or Partial-random list (other systems)
•SCAMP, T-MAN, Cyclon,…
Unreliable
Focus of this series of lecture Communication 7
Large Group: Scalability A Goal
this is us (pi) Process Group
“Members”
1000’s of processes
Unreliable Communication
Network
8
Group Membership Protocol
II Failure Detector
Some process
pi finds out quickly
pj
III Dissemination
Network
Fail-stop Failures only 9
Next
• How do you design a group
membership protocol?
10
I. pj crashes
• Nothing we can do about it!
• A frequent occurrence
• Common case rather than exception
• Frequency goes up linearly with size
of datacenter
11
II. Distributed Failure Detectors:
Desirable Properties
• Completeness = each failure is detected
• Accuracy = there is no mistaken
detection
• Speed
– Time to first detection of a failure
• Scale
– Equal Load on each member 12
Distributed Failure Detectors: Properties
• Completeness Impossible together in

lossy networks [Chandra
• Accuracy and Toueg]
• Speed
– Time to first detection of a failureIf possible, then can
solve consensus! (but
• Scale consensus is known to be
– Equal Load on each member unsolvable in
– Network Message Load asynchronous systems)
13
What Real Failure Detectors Prefer
• Completeness Guaranteed
Partial/Probabilistic
• Accuracy guarantee
• Speed
• Scale
– Equal Load on each member
– Network Message Load 14
• Speed
• Scale Time until some
process detects the failure
• Speed
• Scale Time until some
process detects the failure
– Equal Load on each member No bottlenecks/single
– Network Message Load failure point 16
Failure Detector Properties
• Completeness In spite of
arbitrary simultaneous
• Accuracy process failures
• Speed
• Scale
Centralized Heartbeating
 Hotspot
pi
pi, Heartbeat Seq. l++

pj •Heartbeats sent periodically
•If heartbeat not received from pi
18
timeout,
within mark pi as failed
Ring Heartbeating
 Unpredictable on
pi simultaneous multiple
pi, Heartbeat Seq. l++
failures
pj
19
All-to-All Heartbeating
 Equal load per member
pi, Heartbeat Seq. l++ pi  Single hb loss  false
detection
…
pj
20
Next
• How do we increase the robustness of all-to-
all heartbeating?
21
Gossip-style Heartbeating
 Good accuracy
Array of pi properties
Heartbeat Seq. l
for member subset
22
Gossip-Style Failure Detection
1 10118 64
1 10120 66 2 10110 64
2 10103 62 3 10090 58
3 10098 63 4 10111 65
4 10111 65 2
1
Address Time (local) 1 10120 70
Heartbeat Counter 2 10110 64
Protocol:
3 10098 70
• Nodes periodically gossip their membership
list: pick random nodes, send it list
4 4 10111 65
•On receipt, it is merged with local

3
Current time : 70 at node 2
membership list
•When an entry times out, member is marked (asynchronous clocks)
as failed
23
• If the heartbeat has not increased for more
than Tfail seconds,
the member is considered failed
• And after a further Tcleanup seconds, it will
delete the member from the list
• Why an additional timeout? Why not
delete right away?
24
• What if an entry pointing to a failed node
is deleted right after Tfail (=24) seconds?
1 10120 66
2 10110 64
1 10120 66 34 10019181 75650
2 10103 62 4 10111 65
3 10098 55 2
4 10111 65 1 Current time : 75 at node 2
4
3 25
Next
• Is there a better failure detector?
26
SWIM Failure Detector Protocol
pi pj
• random pj
ping K random
ack processes
• random K X
ping-req
X
Protocol period ping
= T’ time units
ack
ack
27
Time-bounded Completeness
• Key: select each membership element once as a ping
target in a traversal
– Round-robin pinging
– Random permutation of list after each traversal
• Each failure is detected in worst case 2N-1 (local)
protocol periods
• Preserves FD properties
28
Next
• How do failure detectors fit into the big
picture of a group membership protocol?
• What are the missing blocks?
29
Group Membership Protocol
II Failure Detector
Some process
pi finds out quickly
pj
III Dissemination
Network
Fail-stop Failures only 30
Dissemination Options
• Multicast (Hardware / IP)
– unreliable
– multiple simultaneous multicasts
• Point-to-point (TCP / UDP)
– expensive
• Zero extra messages: Piggyback on Failure Detector messages
– Infection-style Dissemination
31
Infection-style Dissemination
pi pj
• random pj
ping K random
ack processes
• random K X
ping-req
X
Protocol period ping
= T time units
ack
ack Piggybacked
membership
information
32
Suspicion Mechanism
• False detections, due to
– Perturbed processes
– Packet losses, e.g., from congestion
• Indirect pinging may not solve the problem
• Key: suspect a process before declaring it
as failed in the group
33
Suspicion Mechanism pi
Dissmn (Suspect pj) Dissmn

FD3
4
Suspected
Alive Failed
Dissmn (Alive Dissmn (Failed

pj) pj)
Suspicion Mechanism
• Distinguish multiple suspicions of a process
– Per-process incarnation number
– Inc # for pi can be incremented only by pi
• e.g., when it receives a (Suspect, pi) message
– Somewhat similar to DSDV (routing protocol in ad-hoc nets)
• Higher inc# notifications over-ride lower inc#’s
• Within an inc#: (Suspect inc #) > (Alive, inc #)
• (Failed, inc #) overrides everything else
35
Wrap Up
• Failures the norm, not the exception in datacenters
• Every distributed system uses a failure detector
• Many distributed systems use a membership service
• Ring failure detection underlies

– IBM SP2 and many other similar clusters/machines
• Gossip-style failure detection underlies

– Amazon EC2/S3 (rumored!) 36

Lecture 4 - Failure Detection and Membership

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 4 - Failure Detection and Membership

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4 - Failure Detection and Membership

Uploaded by

Copyright:

Available Formats

Parallel and Distributed

• Do you believe him/her?

• What would be your first responsibility?

Say, the rate of failure of one machine (OS/disk/motherboard/network, etc.) is

Soft crashes and failures are even more frequent!

1. Hire 1000 people, each to monitor one machine in the datacenter

Which is more preferable, and why?

• Fail-stop (crash) process 5

•Complete list all the time (Strongly consistent) Dissemination

• Completeness Impossible together in

pi, Heartbeat Seq. l++

•On receipt, it is merged with local

Dissmn (Suspect pj) Dissmn

Dissmn (Alive Dissmn (Failed

• Ring failure detection underlies

• Gossip-style failure detection underlies

You might also like