Professional Documents
Culture Documents
Distributed Transaction Management Slides
Distributed Transaction Management Slides
Distributed Deadlocks
With distributed 2PL, a deadlock can occur between transactions
executing at different sites.
To illustrate this, suppose the relation
Distributed Commit
Once a global transaction has completed all its operations, the
ACID properties require that it be made durable when it commits.
This means that the LTMs participating in the execution of the
transaction must either all commit or all abort their
sub-transactions.
The most common protocol for ensuring distributed atomic
commitment is the two-phase commit (2PC) protocol. It
involves two phases:
Distributed Consensus
The possibility of blocking in 2PC may have a negative impact on
performance. Blocking can be avoided by using the idea of
fault-tolerant distributed consensus.
Two widely used protocols for distributed consensus are Paxos and
Raft. For example, Google’s Spanner uses 2PC with Paxos.
Failure of 2PC participants could make data unavailable, in the
absence of replication. Distributed consensus can also be used to
keep replicas of a data item in a consistent state (see later).
“Soft state” refers to the fact that there may not be a single
well-defined database state, with different replicas of the same
data item having different values.
“Eventual consistency” guarantees that, once the partitioning
failures are repaired, eventually all replicas will become consistent
with each other. This may not be fully achievable by the database
system itself and may need application-level code to resolve some
inconsistencies (see later).
Many NoSQL systems do not aim to provide Consistency at all
times, aiming instead for Eventual Consistency.
Replication Protocols
With the distributed locking and commit protocols described
earlier, all data replicas are updated as part of the same global
transaction — known as eager or synchronous replication.
However, many systems, including some relational ones, support
replication with a weaker form of consistency.
A common approach is for the DBMS to update just one ‘primary’
copy of a database object ( leader-based replication), and to
propagate updates to the rest of the copies ( followers) afterwards
— known as lazy or asynchronous replication.
Whenever the leader writes new data to its local storage, it also
sends the change to all of its followers as part of a replication log.
Each follower takes the log and updates its local copy of the data.
When a client wants to read from the database, it can query either
the leader or any of the followers. However, writes are only
accepted on the leader.
Leader failure
The process of recovering from leader failure is called failover. It
comprises 3 steps:
1 Detect that the leader has failed. This is usually done using
timeouts (the nodes frequently bounce messages back and
forth).
2 Elect a new leader. Getting all the nodes to agree on a new
leader is an application of the distributed consensus problem.
3 Reconfigure the system to use the new leader. When the old
leader comes back online, it needs to become a follower.
There are many subtle problems which can occur in this process,
e.g., the new leader may not have received all writes from the old
leader before it failed (see Kleppmann’s book for more examples).
Multi-Leader Replication
Sometimes it advantageous to have multiple leaders, say in an
environment where there are multiple data centres, in which case
having one leader per data centre makes sense.
The biggest problem with multi-leader replication is that write
conflicts can occur, i.e., the same data can be modified in different
ways at two leaders. In this case, some form of conflict resolution
is needed.
Leaderless Replication
In leaderless replication, there is no leader. Instead, writes are sent
to some number of replicas. Amazon’s original Dynamo system
used leaderless replication, as do Cassandra and Riak, for example.
To solve the problem that some nodes may have been down when
an update was applied, read requests are also sent to several nodes
in parallel. Version numbers are used to determine the most recent
value if different results are received.
Version numbers are also used to detect write conflicts.
Riak
In Riak, you can choose
how many nodes to replicate each key-value pair to (N)
the number of nodes read from before returning (R)
the number of nodes written to before the write is considered
successful (W )
where R ≤ N and W ≤ N.
When R > 1, the “most recent” item is returned. This is
determined using a “logical” clock, also known as a “vector clock”,
using the version-vector scheme.
Version-vector scheme
Assume that a data item d is replicated on n different nodes.
Then each node i stores an array, Vi [1, . . . , n] of version
numbers for d.
When node i updates d, it increments the version number
Vi [i] by one.
Its update and version vector are propagated to the other
nodes on which d is replicated.
Example
Suppose d is replicated at nodes 1, 2 and 3.
If d is initially created at 1, the version vector V1 would be [1, 0, 0].
When this update is propagated to nodes 2 and 3, their version
vectors V2 and V3 would each be [1, 0, 0].
If d is then updated at node 2, V2 becomes [1, 1, 0].
Assume R = 2 and item d is now read from nodes 1 and 2 before
the update from 2 has been propagated.
V1 is [1, 0, 0] while V2 is [1, 1, 0], so V2 is more recent, and d from
node 2 is used.
Example
Recall the example where d is replicated at nodes 1, 2 and 3.
If d is initially created at 1, the version vector V1 would be [1, 0, 0].
Say this update is only propagated to 2 where it is then updated,
giving V2 = [1, 1, 0].
Suppose now that this version of d is replicated to 3, and then
both 2 and 3 concurrently update d.
Then, V2 would be [1, 2, 0], while V3 would be [1, 1, 1].
These vectors are inconsistent, since V2 [2] = 2 while V3 [2] = 1,
whereas V2 [3] = 0, while V3 [3] = 1.