Chap. 6 Consistency & Replication: Distributed Systems

Chap.
6 Consistency & Replication

BIT3263 | Distributed Systems
Prepared by Noris Bt. Ismail
FACULTY OF INFORMATION & COMMUNICATION TECHNOLOGY
BIT3263 | Distributed System
Objectives
• Reasons for replication
• Relationship between replication and scalability.
• Consistency of replicated data??
• Managing replicas (Placement of the replicas and content distribution).
• Various ways that consistency can be achieved.
ALL RIGHTS RESERVED

No part of this document may be reproduced without written approval from Limkokwing University of Creative Technology Worldwide
Reasons for Replication

• Data are replicated to increase the reliability of a system.
• Replication for performance
▪ Scaling in numbers – Increasing num. of processes that need to access the single
server.
▪ Scaling in geographical area – Placing a copy of data in the proximity of the process
using them.
▪ Caveat / Cautions
▪ Gain in performance
▪ Cost of increased bandwidth for maintaining replication
▪ Fault tolerance – correctness concerns on the freshness of the data supplied to the
client and the effects of clients’ operation on the data e.g. Air traffic control(correct
data is needed in a short timescale)
ALL RIGHTS RESERVED

Figure 6.1
A basic architectural model for the management of replicated data
Requests and
replies
RM RM
C FE
Clients Front ends Service
C FE Replica
RM managers
Components of the basic architectural model for the management of replicated data
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 4
© Pearson Education 2007
Figure 6.2
Services provided for process groups
A process outside the group sends to the

Group Group without knowing the group’s membership.
address
expansion
Group Leave
send
Multicast Group membership

communication Fail
management
Join
Process group
Role of Group Membership Service

• Providing an interface for group membership changes – Provides operation to
create and destroy process group, add or withdraw a process.
• Implementing a failure detector – Monitor the group member in terms of
crash, or unreachable because of comm. failure.
• Notifying members of group membership changes – Notify when a new
process is added/withdrawn.
• Performing group address expansion – Provide the group identifier rather then
a list of processes in the group.
ALL RIGHTS RESERVED

Passive(Primary-backup) Replication
• There is only a single primary replica manager and one or more secondary replica
managers – ‘backup’/’slaves’.
• Sequence of events as follows:
1. Request – Front end issues the request (containing unique identifier to primary
manager)
2. Coordination – Primary takes the request , checks the unique identifier, if already
executed re-send the response.
3. Execution – Primary executes the request and stores the response.
4. Agreement – If update, primary sends the update state, response and unique
identifier to all the backup. Backup send ACK.
5. Response – Primary responds to the front end, and pass back to the front end
ALL RIGHTS RESERVED

Figure 6.4
The passive (primary-backup) model for fault tolerance
At any one time there is

only a single primary
Prim
replica manager and one ary
or more secondary replica R
C F R
managers. M
E M
The primary replica Back
manager executes the up
operations and sends
copies of the updated data
to the backups
C F R
E M
Back
up
Figure 6.5
Active replication
RM
C FE RM FE C
RM
Front end multicast their request to the group of replicas managers and
all the replica managers process the request independently but
identically and reply
Active Replication
1. Request – Front end issues the request (containing unique identifier) and multicast
it to the group of replica managers. It will not issue next request until it receives a
response.
2. Coordination – The group comm. System delivers the request to every correct
replica manager in the same order.
3. Execution – Every replica manager executes the request. Correct replica managers
all process the request identically. The response contains the client’s unique request
identifier.
4. Agreement – No agreement phase is needed, because of the multicast delivery
semantics.
5. Response – Each replica manager send its response to the front end, then the front
end passes the first response to arrive back to the client and discard the rest
ALL RIGHTS RESERVED

Figure 6.6
Query and update operations in a gossip service
Service
Provide two basic types of
RM Operation: Queries and
update operation .
gossip
RM Front ends send queries and
RM
update to any replica
manager
Update id
Query,prev Val, new Update,prev they choose (any that is
available and can provide
FE FE
reasonable response time).
Query Va Update Clients are possible to obtain

l Clients stale data from the replica
managers.
Gossip Architecture
1. Request – Front end issues the request normally to a single replica manager at a
time. If the single replica manager fails/unreachable it may try the other RM.
2. Update the response - If the request is an update then the RM replies as soon as it
has received the update.
3. Coordination – RM receives a request and will not process it until it can apply the
request according to the required ordering constraints.
4. Execution – RM executes the request.
5. Query response – If the request is a query, then RM replies at this point.
6. Agreement – RM update one another by exchanging gossip messages, which
contain the most recent updates they have received.
ALL RIGHTS RESERVED

Figure 6.7
Front ends propagate their timestamps whenever clients communicate directly
Service
Each Front End keep a
vector timestamp that reflects RM
the version of the latest data
values accessed by the FE. gossip
RM RM
When the RM returns a value

as a result of a query
operation, it supplies the new
Vector
vector timestamp, since the FE FE
timestamps
replicas may have been
updated since the last
operation. Clients
© Pearson Education 2007 13
Figure 6.8
A gossip replica manager, showing its main state components
Other replica managers

Replic Replica
Updates that have been
timesta Accepted by the RM
a log
Gossi mp
p
messag Reasons for keeping
es Replica update log:
manager -Unstable status(held
Timestamp
back and
table
Valu timesta unprocessed)
Replica
e mp - Confirmation on the
timestamp Stabl
Update e Valu status of propagate
updat e mesg. to the other
log
es
Executed operation RM.
table
To prevent update being applied twice from the FE/any other
RM. Checks the identifiers.
Updat
es Prev – Latest value accessed by the FE
Operation Updat Pre Update – Unique update ID from the RM
F F ID e v
E E
Figure 6.9
Committed and tentative updates in Bayou – Provide data replication for
high availability with weaker guarantees.
Committed Tentative
c0 c1 c2 cN t0 t1 t2 ti ti+1
Tentative update ti becomes the next committed update and is

inserted after the last committed update cN.
- Updates are marked as tentative when they are first applied to the database
while updates are tentative, the system may undo/reapply them.
-Bayou arranges the tentative updates and placed it in a canonical order
and marked as committed.
-Once committed, they remain applied in their allotted order and marked as
committed.
Figure 6.10
Transactions on replicated data
Client + front end Client + front end

T U
deposit(B,3);
getBalance(A) B
Replica managers
Replica managers
A A A B B B
Different replication schemes have different rules on how many the RM

are required to carry out the operations.
E.g.
A read request can be performed by a single RM, whereas
a write request must be performed by all the replica managers in the
group.
Figure 6.10
Transactions on replicated data
Client + front end Client + front end

T U
deposit(B,3);
getBalance(A) B
Replica managers
Replica managers
A A A B B B
Different replication schemes have different rules on how many the RM

are required to carry out the operations.
E.g.
A read request can be performed by a single RM, whereas
a write request must be performed by all the replica managers in the
group.
Figure 6.11
Available copies
Client + front end T U Client + front end
getBalance(B)
deposit(A,3);
getBalance(A)
deposit(B,3); Replica managers
B
Replica managers M
A A B B
X Y P N
getBalance operation of transaction T is performed by X, whereas its deposit operation is

performed by M, N and P.
Concurrency control at each replica manager affects the operation performed locally.
E.g. At X, transaction T has read A and therefore transaction U is not allowed to update A with
the deposit transaction until transaction T has completed.
Quorum-Based Protocols
• Basic idea – Require clients to request and acquire the permission of multiple
servers before either reading/writing a replicated data item.
• Gifford’s scheme for NR and NW as based on the following constraint:

1. NR + NW > N N=num. of replicas
2. NW > N/2
ALL RIGHTS RESERVED

A correct choice of read and write set.
1. NR + NW > N
2. NW > N/2
N=num. of replicas
Most recent write quorum

consist of 10 servers(C-L).
Any subsequent read quorum
of the 3 servers will have to
consists one of this set.
Figure 6-22. A correct choice of read and write set.
ALL RIGHTS RESERVED

A choice that may lead to write-write conflicts.
1. NR + NW > N
2. NW > N/2
N=num. of replicas
A write-write conflicts may

occur because NW ≤ N/2. Two
clients are running two updates
where the two updates will be
accepted without detecting that
they actually conflicts.
Figure 6-22. A choice that may lead to write-write conflicts.
ALL RIGHTS RESERVED

A correct choice, known as ROWA (read one, write all).
1. NR + NW > N
2. NW > N/2
N=num. of replicas
It is possible to read a replicated file by

finding any copy and using it. Write update
need to acquire ALL copies.
NW > N/2
Figure 6-22. A correct choice, known as ROWA (read one, write all).
ALL RIGHTS RESERVED

Figure 6.12
Network partition
Client + Client + A network partition separates a

front end front end group of replica managers into
Net two/more subgroups.
T U
withdra work
parti Only group members of the
w(B, 4) tion deposit subgroup can comm. with each
(B,3); other. – server
Replica crash/down/unreachable.
manager
B s E.g. The RM receiving the
deposit request cannot send it to
the RM receiving the withdraw
B B B request.
Replication schemes are
designed with the assumption
that partitions will eventually be
repaired. And must ensure that
no inconsistencies occur when
the partition is repaired.
Figure 6.13
Two network partitions
Transaction T Network partition
Replica managers
X V Y Z
“Network partition” refer to the barrier that divides RM into several parts.
E.g. Transaction T starts by performing its read at V at a time when V is still in

contact with X, Y and Z.
- Now supposed network partition occurs in figure above, in which X and V are in
one part and Y and Z in different ones.
- When transaction T attempts to write, V will notice that it cannot contact Y and Z.
- When a RM manager cannot contact managers it could previously contact, it
will keep on trying until it can create a virtual partition (one/both of them reply).
Figure 6.14
Virtual partition
Virtual partition Network partition

Replica managers
X V Y Z
“Virtual Partition” refer to the parts of the RM themselves. Virtual partition

has a creation time, a set of potential members and a set of actual members.
E.g. V will keep on trying to contact Y and Z until both of them replies e.g. when
Y can be accessed. The group of replica managers V, X and Y comprise a virtual
partition because they are sufficient to form read and write quora.
When a new virtual partition is created during transaction that has performed an
operation at one of the RM e.g. Transaction T, the transaction must be
aborted and the replicas within the new partition must be updated.
Figure 6.15
Two overlapping virtual partitions
Virtual partition V1 Virtual partition V 2
Y X V Z
Case – Several RM may attempt to create a new virtual partition simultaneously:

Two RM, Y and Z keep making attempts to contact each other. Partition is
partially repaired, so that Y cannot communicate with Z but the two groups V, X
and Y and V, X and Z.
Overlapping Virtual Partitions –

A read operation of transaction V, X and Y might be applied at the RM Y, in which its read
lock will not conflict with write locks set by the write operations in another virtual partitions.
References
These slides are taken from Tanenbaum & Van Steen, Distributed Systems:
Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved.
0-13-239227-5
ALL RIGHTS RESERVED

Sub Point #1
Reasons for Replication
Sub Point #4
Network Partition Vs. KEY Sub Point #2
Types of Replication
Virtual Partition POINTS
Sub Point #3
Consistency
ALL RIGHTS RESERVED

Questions : Replication
Three computers together provide a replicated service. The manufacturers claim that
each computer has a mean time between failures of five days; a failure typically takes
four hours to fix. What is the availability of the replicated service?
Answer:
Formula to use:-
n= num. of replicas.
Probability (replicated server unreachable/failed) = hours of fixing/(hours of

failures + hrs of fixing)
Availability of replicated service = 1 - pn
ALL RIGHTS RESERVED

Cont..
• 1 – Probability (all managers failed or unreachable) = 1 - pn

Probability (replicated server unreachable/failed) = hours of
fixing/(hours of failures + hrs of fixing)
The probability that an individual computer is down is 4/(5*24 + 4) ~ 0.03.
Assuming failure-independence of the machines, the availability is therefore:-
n= num. of replicas.
= 1 – 0.033
= 0.999973.
ALL RIGHTS RESERVED

End of Lecture
ALL RIGHTS RESERVED


Chap. 6 Consistency & Replication: Distributed Systems

Uploaded by

Copyright:

Available Formats

You might also like

Chap. 6 Consistency & Replication: Distributed Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap. 6 Consistency & Replication: Distributed Systems

Uploaded by

Copyright:

Available Formats

Chap.

6 Consistency & Replication

BIT3263 | Distributed System

• Consistency of replicated data??

• Managing replicas (Placement of the replicas and content distribution).

• Various ways that consistency can be achieved.

ALL RIGHTS RESERVED

BIT3263 | Distributed System

Reasons for Replication

▪ Cost of increased bandwidth for maintaining replication

ALL RIGHTS RESERVED

Clients Front ends Service

A process outside the group sends to the

Multicast Group membership

BIT3263 | Distributed System

Role of Group Membership Service

ALL RIGHTS RESERVED

BIT3263 | Distributed System

ALL RIGHTS RESERVED

At any one time there is

BIT3263 | Distributed System

ALL RIGHTS RESERVED

Query Va Update Clients are possible to obtain

BIT3263 | Distributed System

ALL RIGHTS RESERVED

When the RM returns a value

Other replica managers

Tentative update ti becomes the next committed update and is

Client + front end Client + front end

Different replication schemes have different rules on how many the RM

Client + front end Client + front end

Different replication schemes have different rules on how many the RM

Client + front end T U Client + front end

getBalance operation of transaction T is performed by X, whereas its deposit operation is

BIT3263 | Distributed System

• Gifford’s scheme for NR and NW as based on the following constraint:

ALL RIGHTS RESERVED

BIT3263 | Distributed System

A correct choice of read and write set.

Most recent write quorum

Figure 6-22. A correct choice of read and write set.

ALL RIGHTS RESERVED

BIT3263 | Distributed System

A choice that may lead to write-write conflicts.

A write-write conflicts may

Figure 6-22. A choice that may lead to write-write conflicts.

ALL RIGHTS RESERVED

BIT3263 | Distributed System

A correct choice, known as ROWA (read one, write all).

It is possible to read a replicated file by

ALL RIGHTS RESERVED

Client + Client + A network partition separates a

Transaction T Network partition

E.g. Transaction T starts by performing its read at V at a time when V is still in

Virtual partition Network partition

“Virtual Partition” refer to the parts of the RM themselves. Virtual partition

Virtual partition V1 Virtual partition V 2

Case – Several RM may attempt to create a new virtual partition simultaneously: