Distributed Transaction Management Slides

Transaction Management without Replication Data Replication
Advances in Data Management:

Distributed Transactions and Data
Consistency
Advances in Data Management: Distributed Transactions and Data Consistency

The purpose of distributed transaction management is to achieve

the ACID properties for global transactions.
The local transaction manager (LTM) at each site is responsible
for maintaining the ACID properties of sub-transactions of global
transactions that are being executed at that site.
A global transaction manager (GTM) is also needed in order to
distribute requests to, and coordinate the execution of, the various
LTMs involved in the execution of each global transaction.
There will be one GTM at each site of the DDB system to which
global transactions can be submitted.
Each GTM is responsible for guaranteeing the ACID properties of
transactions that are submitted to it.

Distributed Concurrency Control

A commonly adopted approach is to use distributed 2PL,
together with an atomic commitment protocol such as two-phase
commit (see below) which ensures that all locks held by a global
transaction are released at the same time.
In distributed 2PL, a GTM is responsible for coordinating global
transactions that are submitted to it for execution. The GTM
sends transactions’ lock requests to the participating LTMs as
needed. Local lock tables and waits-for graphs are maintained at
the LTMs.

Distributed Deadlocks
With distributed 2PL, a deadlock can occur between transactions
executing at different sites.
To illustrate this, suppose the relation
accounts(accno, cname, balance)
is horizontally partitioned so that the rows for accounts 123 and

789 reside at different sites,
Suppose two global transactions are submitted for execution:
T1 = r1 [accno 789], w1 [accno 789], r1 [accno 123], w1 [accno 123]

T2 = r2 [accno 123], r2 [accno 789]

Consider the following partial execution of transactions T1 and T2

under distributed 2PL:
r1 [accno 789], w1 [accno 789], r2 [accno 123], r1 [accno 123]
T1 is unable to proceed since its next operation w1 [accno 123] is

blocked waiting for T2 to release the R-lock obtained by
r2 [accno 123].
T2 is unable to proceed since its next operation r2 [accno 789] is
blocked waiting for T1 to release the W-lock obtained by
w1 [accno 789].

In a centralised DB system, the ‘waits-for’ graph would contain a

cycle, and either T1 or T2 would be rolled back.
In a DDB, the LTMs store their own local waits-for graphs, but
also periodically exchange ‘waits-for’ information between each
other, possibly at the instruction of the GTM.

In our example, the transaction fragments of T1 and T2 executing

at the site of account 123 would cause a waits-for arc T1 → T2
which would eventually be transmitted to the site of account 789.
Similarly, the transaction fragments executing at the site of
account 789 would cause a waits-for arc T2 → T1 which would
eventually be transmitted to the site of account 123.
Whichever site detects the deadlock first will notify the GTM,
which will select one of the transactions to be aborted and
restarted.

Distributed Commit
Once a global transaction has completed all its operations, the
ACID properties require that it be made durable when it commits.
This means that the LTMs participating in the execution of the
transaction must either all commit or all abort their
sub-transactions.
The most common protocol for ensuring distributed atomic
commitment is the two-phase commit (2PC) protocol. It
involves two phases:

1 The GTM sends the message PREPARE to all the LTMs

participating in the execution of the global transaction,
informing them that the transaction should now commit.
An LTM may reply READY if it is ready to commit, after first
“forcing” (i.e. writing to persistent storage) a PREPARE
record to its log. After that point it may not abort its
sub-transaction, unless instructed to do so by the GTM.
Alternatively, an LTM may reply REFUSE if it is unable to
commit, after first forcing an ABORT record to its log. It can
then abort its sub-transaction.

2 If the GTM receives READY from all LTMs, it sends the

message COMMIT to all LTMs, after first forcing a
GLOBAL-COMMIT record to its log. All LTMs commit after
receiving this message.
If the GTM receives REFUSE from any LTM, it transmits
ROLLBACK to all LTMs, after first forcing an
GLOBAL-ABORT record to its log. All LTMs rollback their
sub-transactions on receiving this message.
After committing or rolling back their sub-transactions, the
LTMs send an acknowledgement back to the GTM, which then
writes an end-of-transaction record in its log.

A DDB system can suffer from the same types of failure as

centralised systems (software/hardware faults, disk failures, site
failures); but also loss/corruption of messages, failure of
communication links, or partitioning of the system into two or
more disconnected subsystems.
There is therefore a need for a termination protocol to deal with
situations where the 2PC protocol is not being obeyed by its
participants.
There are three situations in 2PC where the GTM or an LTM may
be waiting for a message, that need to be dealt with:

An LTM is waiting for the PREPARE message from the GTM:

The LTM can unilaterally decide to abort its sub-transaction;
and it will reply REFUSE if it is later contacted by the GTM or
any other LTM.
The GTM is waiting for the READY/REFUSE reply from an
LTM:
If the GTM does not receive a reply within a specified time
period, it can decide to abort the transaction, sending
ROLLBACK to all LTMs.

An LTM which voted READY may be waiting for a

ROLLBACK/COMMIT message from the GTM:
It can try contacting the other LTMs to find out if any of them
has either
(i) already voted REFUSE or not voted READY, or
(ii) received a ROLLBACK/COMMIT message.
If it cannot get a reply from any LTM for which (i) or (ii)
holds, then it is blocked. It is unable to either commit or abort
its sub-transaction, and must retain all the locks associated
with this sub-transaction while in this state of indecision.
The LTM will persist in this state until enough failures are
repaired to enable it to communicate with either the GTM or
some other LTM for which (i) or (ii) holds.

Distributed Consensus
The possibility of blocking in 2PC may have a negative impact on
performance. Blocking can be avoided by using the idea of
fault-tolerant distributed consensus.
Two widely used protocols for distributed consensus are Paxos and
Raft. For example, Google’s Spanner uses 2PC with Paxos.
Failure of 2PC participants could make data unavailable, in the
absence of replication. Distributed consensus can also be used to
keep replicas of a data item in a consistent state (see later).

The distributed consensus problem is as follows:

A set of n nodes need to agree on a decision; in this case,
whether or not to commit a particular transaction.
The inputs to make the decision are provided to all the nodes,
and then each node votes on the decision; in the case of 2PC,
the decision is on whether or not to commit a transaction.
This decision should be made in such a way that all nodes will
“learn” the same value for the decision, even if some nodes
fail during the execution of the protocol, or there are network
partitions.
Further, the protocol should not block, as long as a majority
of the nodes participating remain alive and can communicate
with each other.

Distributed databases often expected to provide high

availability, i.e., data is accessible even with (node and
network) failures.
This is usually achieved by replicating data.
Challenge is to ensure that all replicas of the same data item
are consistent, i.e., have the same value. (Note that we are
considering only individual data items.)
Some nodes may be disconnected or may have failed, so
impossible to ensure that all copies have the same value.
Instead, the system should ensure that reads of a data item
get to see the latest value that was written.

In general, read and write operations on the replicas of a data item

must ensure what is called linearisability: Given a set of read and
write operations on a data item,
there must be a linear ordering of the operations such that
each read in the ordering should see the value written by the
most recent write preceding the read (or the initial value if
there is no such write), and
if an operation o1 finishes before an operation o2 begins
(based on external time), then o1 must precede o2 in the
linear order.
Note that linearisability only addresses what happens to a single
data item, and it is not related to serialisability.

Concurrency Control with Replicas

Returning to management of global transactions, now in the
presence of replication, GTMs commonly employ a ROWA (Read
One, Write All) locking protocol:
an R-lock on a data item is only placed on the copy of the
data item that is being read by a local sub-transaction;
a W-lock is placed on all copies of a data item that is being
written by some local sub-transaction;
Conflicts only involve W-locks, and a conflict only needs to be
detected at one site for a global transaction to be prevented from
executing incorrectly, so this is sufficient.
When coupled with 2PL, ROWA ensures that reads see the value
written by the most recent write of the same data item.

An alternative is the Majority Protocol, which states that:

if a data item is replicated at n sites, then a lock request
(whether R-lock or W-lock) must be sent to, and granted by,
more than half of the n sites.
ROWA has the advantage of fewer overheads for Read operations
compared to the majority protocol, making it advantageous for
predominantly Read workloads.
The majority protocol has the advantage of lower overheads for
Write operations compared with ROWA, making it advantageous
for predominantly Write or mixed workloads.
Also, it can be extended to deal with site or network failures, in
contrast to ROWA which requires all sites holding copies of a data
item that needs to be W-locked to be available.

Trading Off Consistency for Availability

Many applications require maximum availability in the presence of
failures, even at the cost of consistency.
The so-called CAP Theorem states that it is not possible to
achieve all three of the following at all times in a system that has
distributed replicated data:
Consistency of the distributed replicas at all times;
Availability of the database at all times;
Partition-tolerance, i.e. if a network failure splits the set of
database sites into two or more disconnected groups, then
processing should be able to continue in all groups.

The CAP Theorem implies the following:

If the network is working correctly, consistency and availability
can be guaranteed.
If a network fault occurs, you have to choose between
consistency and total availability.
Here, “consistency” means “linearisability” (see earlier).

Protocols such as ROWA and Majority sacrifice availability, not

consistency.
However, for some applications, availability is mandatory, while
they may only require the so-called “BASE” properties:
Basically Available, Soft state, Eventually consistent
Such applications require updates to continue to be executed on

whatever replicas are available, even in the event of network
partitioning.

“Soft state” refers to the fact that there may not be a single
well-defined database state, with different replicas of the same
data item having different values.
“Eventual consistency” guarantees that, once the partitioning
failures are repaired, eventually all replicas will become consistent
with each other. This may not be fully achievable by the database
system itself and may need application-level code to resolve some
inconsistencies (see later).
Many NoSQL systems do not aim to provide Consistency at all
times, aiming instead for Eventual Consistency.

Replication Protocols
With the distributed locking and commit protocols described
earlier, all data replicas are updated as part of the same global
transaction — known as eager or synchronous replication.
However, many systems, including some relational ones, support
replication with a weaker form of consistency.
A common approach is for the DBMS to update just one ‘primary’
copy of a database object ( leader-based replication), and to
propagate updates to the rest of the copies ( followers) afterwards
— known as lazy or asynchronous replication.

Whenever the leader writes new data to its local storage, it also
sends the change to all of its followers as part of a replication log.
Each follower takes the log and updates its local copy of the data.
When a client wants to read from the database, it can query either
the leader or any of the followers. However, writes are only
accepted on the leader.

The advantage of asynchronous replication is faster completion of

operations, e.g. if some of the non-primary data is temporarily not
available, and lower concurrency control overheads (if transactions
are supported).
The disadvantage is that non-primary data may not always be in
sync with the primary copy. For a transaction-based system, this
may result in non-serialisable executions of global transactions
reading/writing different versions of the same data item.

This may not be a problem for certain classes of applications, e.g.

statistical analysis, decision support systems. But it may not be
appropriate for some application classes, e.g. financial applications.
When failed links/sites recover, followers that are “lagging behind”
need to be automatically brought “up-to-date” (i.e., to satisfy
“eventual consistency”). They can simply request from the leader
all changes made since they were disconnected.
Failure of the leader is more complicated.

Leader failure
The process of recovering from leader failure is called failover. It
comprises 3 steps:
1 Detect that the leader has failed. This is usually done using
timeouts (the nodes frequently bounce messages back and
forth).
2 Elect a new leader. Getting all the nodes to agree on a new
leader is an application of the distributed consensus problem.
3 Reconfigure the system to use the new leader. When the old
leader comes back online, it needs to become a follower.
There are many subtle problems which can occur in this process,
e.g., the new leader may not have received all writes from the old
leader before it failed (see Kleppmann’s book for more examples).

Read Your Own Writes

Many applications allow a user to submit some data and then view
what they have submitted.
If the application writes to the leader but reads from a follower,
the update may not yet have reached the follower.
In this case, it will appear to the user that their submission was
lost.
What is needed is read-after-write consistency, also known as read
your own writes.

Some ways of achieving “read-after-write consistency” with

leader-based replication are:
When reading an item the user may have modified, read it
from the leader; otherwise, read it from a follower.
Keep track of the time of the last update and, for one minute
after that, say, make all reads from the leader.
The client can remember the timestamp of its most recent
write, and the system can ensure that the follower serving any
reads for that client reflects updates until at least that
timestamp.

Multi-Leader Replication
Sometimes it advantageous to have multiple leaders, say in an
environment where there are multiple data centres, in which case
having one leader per data centre makes sense.
The biggest problem with multi-leader replication is that write
conflicts can occur, i.e., the same data can be modified in different
ways at two leaders. In this case, some form of conflict resolution
is needed.

Leaderless Replication
In leaderless replication, there is no leader. Instead, writes are sent
to some number of replicas. Amazon’s original Dynamo system
used leaderless replication, as do Cassandra and Riak, for example.
To solve the problem that some nodes may have been down when
an update was applied, read requests are also sent to several nodes
in parallel. Version numbers are used to determine the most recent
value if different results are received.
Version numbers are also used to detect write conflicts.

Riak
In Riak, you can choose
how many nodes to replicate each key-value pair to (N)
the number of nodes read from before returning (R)
the number of nodes written to before the write is considered
successful (W )
where R ≤ N and W ≤ N.
When R > 1, the “most recent” item is returned. This is
determined using a “logical” clock, also known as a “vector clock”,
using the version-vector scheme.

Version-vector scheme
Assume that a data item d is replicated on n different nodes.
Then each node i stores an array, Vi [1, . . . , n] of version
numbers for d.
When node i updates d, it increments the version number
Vi [i] by one.
Its update and version vector are propagated to the other
nodes on which d is replicated.

Example
Suppose d is replicated at nodes 1, 2 and 3.
If d is initially created at 1, the version vector V1 would be [1, 0, 0].
When this update is propagated to nodes 2 and 3, their version
vectors V2 and V3 would each be [1, 0, 0].
If d is then updated at node 2, V2 becomes [1, 1, 0].
Assume R = 2 and item d is now read from nodes 1 and 2 before
the update from 2 has been propagated.
V1 is [1, 0, 0] while V2 is [1, 1, 0], so V2 is more recent, and d from
node 2 is used.

Detecting Write Conflicts

Two replicas can be updated concurrently, leading to a write
conflict. This can be detected using the version-vector scheme.
Whenever two nodes i and j exchange an updated data item d,
they have to determine whether their copies are consistent:
If Vi = Vj , then the copies of d are the same.
If, for each k, Vi [k] ≤ Vj [k] (and Vi 6= Vj ), then the copy of
d at i is older than that at j, so i replaces its copy as well as
the version vector with those from j.
If there is a pair of values k and m such that Vi [k] < Vj [k]
and Vi [m] > Vj[m], then the copies are inconsistent.

Example
Recall the example where d is replicated at nodes 1, 2 and 3.
If d is initially created at 1, the version vector V1 would be [1, 0, 0].
Say this update is only propagated to 2 where it is then updated,
giving V2 = [1, 1, 0].
Suppose now that this version of d is replicated to 3, and then
both 2 and 3 concurrently update d.
Then, V2 would be [1, 2, 0], while V3 would be [1, 1, 1].
These vectors are inconsistent, since V2 [2] = 2 while V3 [2] = 1,
whereas V2 [3] = 0, while V3 [3] = 1.

Having detected write conflicts, they should be resolved in some

way, ideally automatically.
For example, if the write operations commute with one another
(e.g., each write adds an item to a shopping basket) then they can
be resolved.
However, in general, there is no technique that can automatically
resolve all kinds of conflicting writes.
It is then left to the application programmer to decide how to
resolve them.
Or one can adopt a policy of last write wins (available in Riak,
e.g.), which is based on the system clocks of the nodes (which are
not necessarily completely synchronised) rather than vector clocks.

Distributed Transaction Management Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Transaction Management Slides

Uploaded by

Copyright:

Available Formats

Transaction Management without Replication Data Replication

Advances in Data Management:

Advances in Data Management: Distributed Transactions and Data Consistency

The purpose of distributed transaction management is to achieve

Advances in Data Management: Distributed Transactions and Data Consistency

Distributed Concurrency Control

Advances in Data Management: Distributed Transactions and Data Consistency

accounts(accno, cname, balance)

is horizontally partitioned so that the rows for accounts 123 and

T1 = r1 [accno 789], w1 [accno 789], r1 [accno 123], w1 [accno 123]

Advances in Data Management: Distributed Transactions and Data Consistency

Consider the following partial execution of transactions T1 and T2

r1 [accno 789], w1 [accno 789], r2 [accno 123], r1 [accno 123]

T1 is unable to proceed since its next operation w1 [accno 123] is

Advances in Data Management: Distributed Transactions and Data Consistency

In a centralised DB system, the ‘waits-for’ graph would contain a

Advances in Data Management: Distributed Transactions and Data Consistency

In our example, the transaction fragments of T1 and T2 executing

Advances in Data Management: Distributed Transactions and Data Consistency

Advances in Data Management: Distributed Transactions and Data Consistency

1 The GTM sends the message PREPARE to all the LTMs

Advances in Data Management: Distributed Transactions and Data Consistency

2 If the GTM receives READY from all LTMs, it sends the

Advances in Data Management: Distributed Transactions and Data Consistency

A DDB system can suffer from the same types of failure as

Advances in Data Management: Distributed Transactions and Data Consistency

An LTM is waiting for the PREPARE message from the GTM:

Advances in Data Management: Distributed Transactions and Data Consistency

An LTM which voted READY may be waiting for a

Advances in Data Management: Distributed Transactions and Data Consistency

Advances in Data Management: Distributed Transactions and Data Consistency

The distributed consensus problem is as follows:

Advances in Data Management: Distributed Transactions and Data Consistency

Distributed databases often expected to provide high

Advances in Data Management: Distributed Transactions and Data Consistency

In general, read and write operations on the replicas of a data item

Advances in Data Management: Distributed Transactions and Data Consistency

Concurrency Control with Replicas

Advances in Data Management: Distributed Transactions and Data Consistency

An alternative is the Majority Protocol, which states that:

Advances in Data Management: Distributed Transactions and Data Consistency

Trading Off Consistency for Availability

Advances in Data Management: Distributed Transactions and Data Consistency

The CAP Theorem implies the following:

Advances in Data Management: Distributed Transactions and Data Consistency

Protocols such as ROWA and Majority sacrifice availability, not

Basically Available, Soft state, Eventually consistent

Such applications require updates to continue to be executed on

Advances in Data Management: Distributed Transactions and Data Consistency

Advances in Data Management: Distributed Transactions and Data Consistency

Advances in Data Management: Distributed Transactions and Data Consistency

Advances in Data Management: Distributed Transactions and Data Consistency

The advantage of asynchronous replication is faster completion of

Advances in Data Management: Distributed Transactions and Data Consistency

This may not be a problem for certain classes of applications, e.g.

Advances in Data Management: Distributed Transactions and Data Consistency

Advances in Data Management: Distributed Transactions and Data Consistency

Read Your Own Writes

Advances in Data Management: Distributed Transactions and Data Consistency

Some ways of achieving “read-after-write consistency” with

Advances in Data Management: Distributed Transactions and Data Consistency

Advances in Data Management: Distributed Transactions and Data Consistency

Advances in Data Management: Distributed Transactions and Data Consistency