Professional Documents
Culture Documents
Fault Tolerance (2) : Distributed Systems Principles and Paradigms
Fault Tolerance (2) : Distributed Systems Principles and Paradigms
Chapter 8
Fault Tolerance (2)
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Plan
• Basic Concepts
• Process Resilience
• Reliable communication
– Client-server communication Tolerating failures
– Group communication
• Distributed commit
• Recovery
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
What can go wrong?
2) Global-commit →
Crash
1) Commit →
2) Global-commit →
Omission
Cannot
commit
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Distributed Commit
• Given a process group and an operation
• Either everybody commits or everybody aborts
– Consistency, validity, termination
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Two-Phase Commit
3) Vote-commit ←
4) Global-commit →
2) Vote-request →
3) Vote-commit←
1) Commit →
2) Vote-request →
4) Global-commit →
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Two-Phase Commit
Input event Output event
• Figure 8-18. (a) The finite state machine for the coordinator in
2PC. (b) The finite state machine for a participant.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Two-Phase Commit
• Failures
– Crash and omission
– Detect via timeouts
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Coordinator Perspective
• Blocks in WAIT
– Participant may have
failed
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Coordinator Perspective
...
...
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Participant Perspective
• Blocks in READY
– Coordinator may have
failed
• What to do?
– Some participants may
already have
committed…
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Participant Perspective
We know that coordinator
managed to start commit
Safe to abort?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Three-Phase Commit
• The states of the coordinator and each
participant satisfy the following two
conditions:
1. There is no single state from which it is
possible to make a transition directly to
either a COMMIT or an ABORT state.
2. There is no state in which it is not possible to
make a final decision, and from which a
transition to a COMMIT state can be made.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Three-Phase Commit
• Backward recovery
– Bring the system to backward to a correct, previous
state
– Restore
• Forward recovery
– Bring the system forward to a correct, new state
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Independent Checkpointing
a1 a2 a3
b1 b2 b3
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Coordinated Checkpointing
• Make sure that processes are synchronized when
doing the checkpoint
• Two-phase blocking protocol
• Ordering constraints?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Summary
• Looked at last pieces of fault tolerance
• Distributed commit
– 2PC – blocking, has bad state
– 3PC – non-blocking, but not widely used in
practice
• Recovery
– Storage
– Checkpointing
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5