Professional Documents
Culture Documents
CS 3700 Networks and Distributed Systems
CS 3700 Networks and Distributed Systems
Centralization
debit_transaction(-$75)
OK
Bob
get_account_balance()
$225
Bob:
Bob:
$300
$225
Advantages of centralization
Easy to setup and deploy
Consistency is guaranteed (assuming correct software implementation)
Shortcomings
No load balancing
Single point of failure
Sharding
debit_account(-$75)
<A-M>
Bob:
Bob:
$300
$225
OK
Bob
get_account_balance()
<N-Z>
$225
Advantages of sharding
Better load balancing
If done intelligently, may allow incremental scalability
Shortcomings
Failures are still devastating
Replication
debit_account(-$75)
100%
Agreeme
nt
OK
Bob
get_account_balance()
$225
Advantages of replication
<A-M>
Bob:
Bob:
$300
$225
<A-M>
<A-M>
Bob:
Bob:
$300
$225
Bob:
Bob:
$300
$225
Shortcomings
How do we maintain consistency?
Consistency Failures
No
ACK
No
Agreeme
nt
Asynchronous
networks are
problematic
Bob:
$300
Bob:
Bob:
$300
$225
Bob:
Bob:
$300
$225
No
ACK
Too few
replicas?
Bob:
$300
Bob:
Bob:
$300
$225
Timeout
!
Bob:
Bob:
$300
$225
Bob:
Bob:
$300
$225
No
Agreeme
nt
Bob:
Bob:
$300
$225
Byzantine Failures
Bob:
$300
No
Agreeme
nt
Bob:
Bob:
$300
$1000
In some cases,
replicas may be
buggy or
malicious
Challenges:
Many, many different failure modes
Theory tells us that these goals are impossible to achieve
(more on this later)
Forcing Consistency
debit_account(-$75)
OK
Bob:
Bob:
$300
$225
Bob:
Bob:
Bob:
$300
$225
$175
debit_account(-$50)
Bob
Error
Bob:
Bob:
Bob:
$300
$225
$175
Motivating Transactions
transfer_money(Alice, Bob, $100)
debit_account(Alice, -$100)
Error
OK
debit_account(Bob, $100)
OK
Error
Alice:
Alice:
$600
$500
Bob:
Bob:
$300
$400
Simple Transactions
transfer_money(Alice, Bob, $100)
begin_transaction()
debit_account(Alice, -$100)
debit_account(Bob, $100)
At this point, if
there havent
been any errors,
we say the
transaction is
committed
end_transaction()
Alice:
Alice:
$600
$500
Bob:
Bob:
$300
$400
Alice:
$500
Bob:
$400
OK
Simple Transactions
transfer_money(Alice, Bob, $100)
begin_transaction()
debit_account(Alice, -$100)
debit_account(Bob, $100)
Error
Alice:
$600
Bob:
$300
Alice:
$500
ACID Properties
Traditional transactional databases support the
following:
1. Atomicity: all or none; if transaction fails then no changes
are applied to the database
2. Consistency: there are no violations of database integrity
3. Isolation: partial results from incomplete transactions are
hidden
4. Durability: the effects of committed transactions are
permanent
2PC Example
Replica 1
Replica 2
Replica 3
x y
x y
x y
ready txid =
678
Time
Begin by
distributing the
update
Txid is a logical
clock
Wait to receive
ready to
commit from all
replicas
Tell replicas to
commit
Leader
commit txid =
678
committed txid =
678
Failure Modes
Replica Failure
Before or during the initial promise phase
Before or during the commit
Leader Failure
Before receiving all promises
Before or during sending commits
Before receiving all committed messages
Leader
Replica 1
Replica 2
Replica 3
x y
x y
happens if a write
or a ready is
dropped, a replica
times out, or a
replica returns an
error
ready txid =
678
Time
aborted txid =
678
Leader
Replica 1
Replica 2
x y
x y
Replica 3
x y
ready txid =
678
commit txid =
678
Known
inconsistent state
Leader must keep
retrying until all
commits succeed
y
committed txid =
678
commit txid =
678
y
committed txid =
678
Time
Leader
Replica 1
Replica 2
x y
stat txid =
678
commit txid =
678
Finally, the
system is
consistent and
may proceed
committed txid =
678
Time
Replicas attempt
to resume
unfinished
transactions
when they reboot
Replica 3
Leader Failure
What happens if the leader crashes?
Leader must constantly be writing its state to permanent
storage
It must pick up where it left off once it reboots
Allowing Progress
Key problem: what if the leader crashes and never
recovers?
By default, replicas block until contacted by the leader
Can the system make progress?
However, this only works if all the replicas are alive and
reachable
New Leader
Leader
Replica 1
Replica 2
Replica 3
x y
x y
x y
ready txid =
678
Replica 2s
timeout expires,
begins recovery
procedure
System is
consistent again
Time
committed txid =
678
Deadlock
Leader
Replica 1
Replica 2
x y
x y
Replica 3
x y
ready txid =
678
Replica 2s
timeout expires,
begins recovery
procedure
Cannot proceed,
but cannot abort
Time
Garbage Collection
2PC is somewhat of a misnomer: there is actually a third
phase
Garbage collection
2PC Summary
Message complexity: O(2n)
The good: guarantees consistency
The bad:
Write performance suffers if there are failures during the
commit phase
Does not scale gracefully (possible, but difficult to do)
A pure 2PC system blocks all writes if the leader fails
Smarter 2PC systems still blocks all writes if the leader + 1
replica fail
3PC Example
Tell replicas to
commit
At this point, all
replicas are
guaranteed to
be up-to-date
Replica 1
Replica 2
Replica 3
x y
x y
x y
prepare txid =
678
prepared txid =
678
Time
Begin by
distributing the
update
Wait to receive
ready to
commit from all
replicas
Tell all replicas
that everyone is
ready to
commit
Leader
commit txid =
678
committed txid =
678
Leader Failures
Replica 2s
timeout expires,
begins recovery
procedure
Replica 3 cannot
be in the
committed state,
thus okay to
System is
abort
consistent again
Replica 1
Replica 2
x y
x y
x y
Replica 3
Time
Begin by
distributing the
update
Wait to receive
ready to
commit from all
replicas
Leader
Leader Failures
Leader
Replica 1
Replica 2
Replica 3
prepare txid =
678
prepared txid =
678
System is
consistent again
Time
Replica 2s
timeout expires,
begins recovery
procedure
All replicas must
have been ready
to commit
Partitioning
txid = 678; value
=y
ready txid =
678
Leader assumes
replicas 2 and 3
have failed,
moves on
System is
inconsistent
Replica 1
Replica 2
x y
x y
x y
Leader
recovery
initiated
prepare txid =
678
prepared txid =
678
Abor
t
commit txid =
678
committed txid =
678
Replica 3
Time
Network
partitions into
two subnets!
Leader
3PC
Summary
Adds an additional phase vs. 2PC
Message complexity: O(3n)
Really four phases with garbage collection
A Moment of Reflection
Goals, revisited:
The system should be able to reach consensus
Consensus [n]: general agreement
Properties:
Agreement: all non-faulty processes ultimately choose the
same value
Either 0 or 1 in this case
Algorithm Sketch
Each replica maintains a map M of all known values
Initially, the vector only contains the replicas own value
e.g. M = {replica1: 0}
Key insight: all replicas must be sure that all replicas that
did not crash have the same information (so they can
make the same decision)
Proof sketch, assuming f = 2
Worst case scenario is that replicas crash during rounds 1 and 2
During round 1, replica x crashes
All other replicas dont know if x it alive or dead
Impact of FLP
FLP proves that any fault-tolerant distributed algorithm
attempting to reach consensus has runs that never
terminate
These runs are extremely unlikely (probability zero)
Yet they imply that we cant find a totally correct solution
And so consensus is impossible (not always possible)
CAP Examples
(key, 1)
d
Rea
A+P
Write
Error: Service
Unavailable
(key, 1)
2)
Replicate
(key, 1)
Availability
(key, 1)
C+P
Write
(key, 1)
2)
Impact of partitions
Not consistent
Consistency
Replicate
d
Rea
Impact of partitions
No availability
C or A: Choose 1
Taken to the extreme, CAP suggests a binary division in
distributed systems
Your system is consistent or available
Financial
information must
Always
Available
Attempt to balance
correctness with
Quorum Systems
In law, a quorum is the minimum number of members of
a deliberative body necessary to conduct the business
of that group
When quorum is not met, a legislative body cannot hold a
vote, and cannot change the status quo
E.g. Imagine if only 1 senator showed up to vote in the Senate
Advantages of Quorums
Writ
e
ts: 1
ts: 3
Bob:
Bob:
$300
$375
ts:
1
ts:
ts:23
Bob:
Bob:
Bob:
$300
$400
$375
ts: 1
Bob:
$300
ts:
ts:12
Bob:
Bob:
$300
$400
Read
Challenges
1. Ensuring that at least
replicas commit each
update
2. Ensuring that updates have
the correct logical ordering
Timesta
mp
Balan
ce
$300
$400
$375
Paxos
Replication protocol that ensures a global ordering of
updates
All writes into the system are ordered in logical time
Replicas agree on the order of committed writes
History of Paxos
Developed by Turing award winner Leslie Lamport
First published as a tech report in 1989
Journal refused to publish it, nobody understood the protocol
Paxos at a High-Level
1. Replicas elect a leader and agree on the view number
The view is a logical clock that divides time into epochs
During each view, there is a single leader
View Selection
Time
Prepare/Promise
13
13
13
13
13
prepare view=5
clock= 13
Time
promise view=5
clock=13
Commit/Accept
13
13
13
13
13
write
accept
clock=14
14
x
OK
14
x
14
14
x
14
x
Time
commit
clock=14
Paxos Review
Failure Modes
Bad Commit
What happens if a
quorum does not
accept a commit?
13
13
13
13
commit
clock=14
x
accept
clock=14
commit
clock=14
accept
clock=14
13
Time
14
x
14
x
14
14
x
Partitions (1)
Time
Partitions (2)
What happens during a partition?
Time
13
13
x
x
prepare clock=
13
promise
clock=13
Leader is
unaware of
uncommitted
update
Leader announces
commit
a new update with
clock=14
clock=14, which
is rejected
Replica
3 isby
desynchronized,
replica
3
must
reconcile
with another
13
13
Time
commit
clock=14
What happens if there is an
uncommitted update with no
quorum?
13
13
Leader is aware of
uncommitted
update
Leader must
recommit the
original clock=14
update
13
13
x
x
prepare clock=
13
promise
clock=13
commit
clock=14
x
13
Time
commit
clock=14
What happens if there is an
uncommitted update with no
quorum?
13
13
commit
clock=14
What happens if there is an
uncommitted update with a
quorum?
13
13
13
x
x
Time
13
Garbage collection
Replicas need to remember the exact history of updates, in case the
leader changes
Periodically, the lists need to be garbage collected
Byzantine
Distributed Systems
Goals
1. All loyal lieutenants obey the same order
2. If the commanding general is loyal, then every loyal
lieutenant obeys the order he sends
Alternatives to Quorums
Gossip protocols
Replicas periodically, randomly exchange state with each
other
No strong consistency guarantees but
Surprisingly fast and reliable convergence to up-to-date state
Requires vector clocks or better in order to causally order
events
Sources