BITS Pilani: Distributed Computing Global State & Snapshot Recording Algorithms

Distributed Computing
Global State & Snapshot Recording

Algorithms
Dr. Barsha Mitra
BITS Pilani CSIS Dept, BITS Pilani, Hyderabad Campus
Hyderabad Campus
Introduction
•recording the global state of a distributed system on-the-fly is
required for analyzing, testing, or verifying properties
associated with distributed executions
•problems in recording global state
•lack of a globally shared memory
•lack of a global clock
•message transmission is asynchronous
•message transfer delays are finite but unpredictable
•problem is non-trivial
Course No.- SS ZG526, Course Title - Distributed Computing 2 BITS Pilani, Hyderabad Campus
Introduction
•every distributed system component has a local state
•state of a process is characterized by
•state of its local memory
• history of process activity
•channel state is characterized by the set of messages sent
along the channel less the set of messages received along the
channel
•global state of a distributed system is a collection of the local
states of its components
Introduction
•benefit of shared memory - up-to-date state of the entire system is
available to the processes sharing the memory
•absence of shared memory requires ways of getting a coherent and
complete view of the system based on the local states of individual
processes
•meaningful global snapshot can be obtained
•if the components of the distributed system record their local
states at the same time
• requires local clocks at processes to be perfectly synchronized
•existence of global system clock that could be instantaneously
read by the processes
Introduction
•technologically infeasible to have perfectly synchronized clocks at
various sites
•clocks are bound to drift
•reading time from a single common clock maintained at one
process will not work
•indeterminate transmission delays during read operation causes
processes to identify various physical instants as the same time
•collection of local state observations will be made at different times
•may not be meaningful
Introduction
System Model
•system consists of a collection of n processes, p1, p2,…, pn connected by

channels
•no globally shared memory
•processes communicate solely by passing messages
•no physical global clock in the system
•message send and receive are asynchronous
•message delivery is reliable, occurs within finite time but has arbitrary
time delay
System Model
•system can be
•represented as a directed graph
•vertices represent processes
•edges represent unidirectional communication channels
•both processes & channels have states
•process state consists of contents of
•processor registers
•stacks
•local memory
System Model
•process state is highly dependent on the local context of the

distributed application
•Cij: channel from process pi to process pj
•SCij:
•state of channel Cij
•consists of intransit messages
• 3 types of events
• internal events
• message send events
• message receive events
System Model
• send(mij): send event of message mij from pi to pj

• rec(mij): receive event of message mij from pi to pj
• occurrence of events
• changes the states of respective processes and channels
• causes transitions in global system state
• internal event: changes the state of the process at which it occurs
• send event changes:
• state of the process that sends the message
• state of the channel on which the message is sent
• events at a process are linearly ordered by their order of occurrence
System Model
• receive event:
• changes state of the receiving process
• state of the channel on which the message is received
• LSi: state of process pi
• at an instant t, LSi: state of pi as a result of the sequence of events
executed by pi up to t
• an event e ∈ LSi iff e belongs to the sequence of events that have
taken pi to LSi
• e  LSi iff e does not belong to the sequence of events that have taken
pi to LSi
System Model
• channel state depends on the local states of the processes on which
it is incident
• for a channel Cij , intransit messages are:
transit(LSi, LSj) = {mij | send(mij) ∈ LSi  rec(mij)  LSj}
• if a snapshot recording algorithm records the states of pi and pj as
LSi and LSj , respectively,
• then it must record the state of channel Cij as transit(LSi, LSj)
System Model
• several models of communication among processes exist
• different snapshot recording algorithms have assumed
different models of communication
• FIFO model:
• each channel acts as a first-in first-out message queue
• message ordering is preserved by a channel
• non-FIFO model:
• channel acts like a set
• sender process adds messages to the channel in a random
order
• receiver process removes messages from the channel in a
random order
A Consistent Global State
• global state GS is a consistent global stateiff it satisfies the
following two conditions:
• C1: send(mij) ∈ LSi ⇒ mij ∈ SCij⊕ rec(mij) ∈ LSj (⊕: XOR)
• C2: send(mij) LSi ⇒ mijSCij ∧ rec(mij)  LSj
A Consistent Global State
•Condition C1 states the law of conservation of messages:
•every message mij that is recorded as sent in the local state
of sender pi must be captured
•in the state of the channel Cij
•or in the collected local state of the receiver pj
•Condition C2 states that:
•in the collected global state, for every effect, its cause
must be present
•if a message mij is not recorded as sent in the local state of
pi, then it must neither be present in the state of the
channel Cij nor in the collected local state of the receiver pj
Issues in recording a Global State
I1: How to distinguish between the messages to be recorded in the
snapshot (either in a channel state or a process state) from those
not to be recorded ?
I2: How to determine the instant when a process takes its snapshot?
Snapshot Recording Algorithms
•2 types of messages:
•computation messages - exchanged by the underlying
application
•control messages - exchanged by the snapshot algorithm
Chandy–Lamport Algorithm
•uses a control message called marker
•after a site has recorded its snapshot, it sends a marker along all of its
outgoing channels before sending out any more messages
•channels are FIFO
•marker separates the messages in the channel into those to be
included in the snapshot (i.e., channel state or process state) from
those not to be recorded in the snapshot
•markers in a FIFO system act as delimiters
Course ID: SS ZG526, Title: Distributed Computing 18 BITS Pilani, Hyderabad Campus
•all messages that follow a marker on channel Cij have been

sent by pi after pi took its snapshot
•pj must record its snapshot no later than when it receives a

marker on channel Cij
•a process must record its snapshot no later than when it

receives a marker on any of its incoming channels
Marker sending rule for process pi
(1) Process pi records its state.
(2) For each outgoing channel C on which a marker has not been sent, pi sends
a marker along C before pi sends further messages along C.
Marker receiving rule for process pj
On receiving a marker along channel C:
if pj has not recorded its state then
Record the state of C as the empty set
Execute the “marker sending rule”
else
Record the state of C as the set of messages received along C after pj’s
state was recorded and before pj received the marker along C
Putting together recorded snapshots
•required for creating the global snapshot
•global snapshot creation at initiator:
•each process can send its local snapshot to the initiator of the
algorithm
•global snapshot creation at all processes:
•each process sends the information it records along all outgoing
channels
•each process receiving such information for the first time propagates
it along its outgoing channels
•all local snapshots get disseminated to all other processes
•all the processes can determine the global state
•algorithm can be initiated by any process by executing the
marker sending rule
•termination criterion - each process has received a marker on
all of its incoming channels
•if multiple processes initiate the algorithm concurrently
•each initiation needs to be distinguished by using unique
markers
•different initiations by a process are identified by a sequence
number
Chandy-Lamport Algorithm
•Markers shown by dashed-and-dotted arrows
•S1 initiates the algorithm just after t1
•S1 records its local state (account A=$550) and
sends a marker to S2
•marker is received by S2 after t4
•when S2 receives the marker, it records its
local state (account B=$170), state C12 as $0,
and sends a marker along C21
•when S1 receives this marker, it records the
state of C21 as $80
•$800 amount in the system is conserved in the
recorded global state
A = $550, B = $170, C12 = $0, C21 = $80
Chandy-Lamport Algorithm
•Markers shown using dotted arrows
•S1 initiates the algorithm just after t0 and
before sending the $50 for S2
•S1 records its local state (account A = $600)
and sends a marker to S2
•marker is received by S2 between t2 and t3
•when S2 receives the marker, it records its
local state (account B = $120), state of C12 as
$0, and sends a marker along C21
•when S1 receives this marker, it records the
state of C21 as $80
•$800 amount in the system is conserved in the
recorded global state
A = $600, B = $120, C12 = $0, C21 = $80
Variations of the Chandy–Lamport
Algorithm
• several variants were proposed

• variants refined and optimized the basic algorithm
• variant: Spezialetti and Kearns algorithm
• refinements
• optimizes concurrent initiation of snapshot collection
• efficiently distributes the recorded snapshot
Spezialetti–Kearns Algorithm
• 2 phases in obtaining a global snapshot:

• locally recording the snapshot at every process
• distributing the resultant global snapshot to all the
initiators
• Spezialetti and Kearns provided 2 optimizations to the
Chandy–Lamport algorithm
• first optimization
• combines snapshots concurrently initiated by multiple
processes into a single snapshot
• linked with the second optimization
• second optimization
• deals with the efficient distribution of the global snapshot
• any process needs to take only one snapshot, irrespective
of the number of concurrent initiators
• all processes are not sent the global snapshot
• assumes bi-directional channels in the system
Efficient snapshot recording
• marker carries the identifier of the initiator of the algorithm

• every process has a variable master to keep track of the initiator
of the algorithm
• when a process executes the “marker sending rule” on the
receipt of its first marker
• it records the initiator’s identifier carried in the received
marker in the master variable
• process that initiates the algorithm records its own identifier in
the master variable
• notion of region is used in the system

• region –
• encompasses all the processes whose master field contains
the identifier of the same initiator
• identified by the initiator’s identifier
• if there are multiple concurrent initiators, the system gets
partitioned into multiple regions
• if the initiator’s identifier in a marker received along a channel is

different from the value in the master variable
• a concurrent initiation of the algorithm is detected
• implies that the sender of the marker lies in a different region
• identifier of the concurrent initiator is recorded in a local
variable id-border-set
• process receiving the marker
• does not take a snapshot for this marker
• does not propagate this marker
• the algorithm efficiently handles concurrent snapshot initiations

• suppresses redundant snapshot collections
• process does not take a snapshot or propagate a snapshot
request initiated by a process if it has already taken a snapshot in
response to some other snapshot initiation
• state of the channel is recorded just as in the Chandy–Lamport

algorithm (including those that cross a border between regions)
• enables the snapshot recorded in one region to be merged with
the snapshot recorded in the adjacent region
• if markers arriving at a node contain identifiers of different
initiators
• they are considered part of the same instance of the
algorithm for the purpose of channel state recording
• snapshot recording completion criterion at a process:

• after it has received a marker along each of its channels
• after every process has recorded its snapshot
• the system is partitioned into as many regions as the number
of concurrent initiations of the algorithm
• variable id-border-set at a process contains the identifiers of the
neighboring regions
Efficient dissemination of recorded snapshot
• efficiently assembles the recorded snapshot in the following way:
• in the snapshot recording phase
• a forest of spanning trees is created in the system
• in this forest
• initiator of algorithm is the root of a spanning tree
• all processes in its region belong to its spanning tree
• if pi executed the “marker sending rule” in response to
receiving its first marker from pj
• then pj is the parent of pi in the spanning tree
• in this forest (contd. from previous slide)
• when a leaf process in the spanning tree has recorded the
states of all incoming channels
• the process sends the locally recorded state (local
snapshot, id-border-set) to its parent in the spanning
tree
• an intermediate process in a spanning tree

• receives the recorded states from all its children processes
• records the states of all incoming channels
• forwards its locally recorded state and the locally recorded
states of all its descendent processes to its parent
• initiator receives the locally recorded states of all its descendants
from its children processes
• initiator assembles the snapshot for all the processes in its region
and the channels incident on these processes
• initiator knows the identifiers of initiators in adjacent regions
using id-border-set information it receives from processes in its
region
• initiator exchanges the snapshot of its region with the initiators in
adjacent regions in rounds
• in each round
• an initiator sends to initiators in adjacent regions, any new
information obtained from the initiator in the adjacent region
during the previous round of message exchange
• round is complete when an initiator receives information, or the

blank message from all initiators of adjacent regions from which it
has not already received a blank message
• blank message signifies no new information is available
message complexity of assembling and disseminating the
snapshot = O(rn2)
r: number of concurrent initiations
n: number of processes in the system
Snapshot Algorithms for Non-FIFO
Channels
• FIFO system
• ensures that all messages sent after a marker on a channel will
be delivered after the marker
• in a non-FIFO system
• a marker cannot be used to differentiate messages into those to
be recorded in the global state from those not to be recorded in
the global state
Snapshot Algorithms for Non-FIFO
Channels
• in a non-FIFO system, for recording a consistent global snapshot
• capturing out-of-sequence messages is necessary
• how??
• piggybacking of control information on computation messages
• non-FIFO algorithm by Lai and Yang

• use message piggybacking to distinguish computation messages sent
after the marker from those sent before the marker
Lai–Yang Algorithm
• based on 2 observations on the role of a marker in a FIFO

system
• first observation - a marker ensures that condition C2 is
satisfied for LSi and LSj when the snapshots are recorded
at processes pi and pj respectively
• fulfills this role of a marker in a non-FIFO system by using
a coloring scheme on computation messages
Coloring Scheme:
• every process is initially white

• process turns red while taking a snapshot
• equivalent of the “marker sending rule” is executed when
a process turns red
• every message sent by a white process is colored white
• a white message is a message that was sent before the
sender of that message recorded its local snapshot
Coloring Scheme:
• every message sent by a red process is colored red

• a red message is a message that was sent after the
sender of that message recorded its local snapshot
• every white process takes its snapshot at its

convenience, but no later than the instant it receives a
red message
Coloring Scheme:
• when a white process receives a red message, it records

its local snapshot before processing the message
• ensures that
• no message sent by a process after recording its local
snapshot is processed by the destination process
before the destination records its local snapshot
• an explicit marker message is not required
• marker is piggybacked on computation messages using a
coloring scheme
• second observation –
• marker informs process pj of the value of {send(mij) |send(mij) ∈ LSi}
• compute the state of channel Cij as transit(LSi,LSj)
• Lai–Yang algorithm fulfills this role of the marker as follows:

• every white process records a history of all white messages sent or
received by it along each channel
• when a process turns red, it sends these histories along with its
snapshot to the initiator process that collects the global snapshot
• Lai–Yang algorithm fulfills this role of the marker as follows

(contd. from previous slide):
• initiator process evaluates transit(LSi,LSj) to compute the
state of a channel Cij as:
SCij
= {white messages sent by pi on Cij} − {white messages
received by pj on Cij}
= {mij | send(mij) ∈ LSi} − {mij | rec(mij) ∈ LSj}
•marker messages are not required
•each process has to record the entire message history on each
channel as part of the local snapshot
•space requirements of the algorithm may be large
•size of the local storage and snapshot recording can be reduced
•by storing only the messages sent and received since the
previous snapshot recording, assuming that the previous
snapshot is still available
•approach is useful in applications that require repeated snapshots
of a distributed system
Necessary and sufficient conditions for
consistent global snapshots
•local checkpoint – saved intermediate state of a process during its
execution
•global snapshot –
•set of local checkpoints one from each process
•represents a snapshot of the distributed computation execution at
some instant
•consistent global snapshot –
•no causal path between any two distinct checkpoints
•set of local states that occurred concurrently
•Cp,i - ith (i ≥ 0) checkpoint of process pp , assigned the sequence number i
•ith checkpoint interval of pp - all computation performed between (i−1)th
and ith checkpoints (and includes (i−1)th checkpoint but not ith).
•a causal path exists between checkpoints Ci,j and Cx,y if a sequence of
messages exists from Ci,j to Cx,y such that each message is sent after the
previous one in the sequence is received
•a zigzag path between two checkpoints is a causal path, however allows a
message to be sent before the previous one in the path is received, send and
receive are in the same checkpoint interval
•checkpoint C is involved in a zigzag cycle iff there is a zigzag path from C to itself
•necessary condition for consistent snapshot - absence of causal path between

checkpoints in a snapshot
•necessary and sufficient conditions for consistent snapshot - absence of zigzag path
between checkpoints in a snapshot
•a set of checkpoints S can be extended to a consistent snapshot if and only if no
checkpoint in S has a zigzag path to any other checkpoint in S
•a checkpoint can be a part of a consistent snapshot if and only if it is not involved in a
zigzag cycle
Reference
• Ajay D. Kshemkalyani, and Mukesh Singhal, Chapter 4,
“Distributed Computing: Principles, Algorithms, and Systems”,
Cambridge University Press, 2008.

BITS Pilani: Distributed Computing Global State & Snapshot Recording Algorithms

Uploaded by

Copyright:

Available Formats

You might also like

BITS Pilani: Distributed Computing Global State & Snapshot Recording Algorithms

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BITS Pilani: Distributed Computing Global State & Snapshot Recording Algorithms

Uploaded by

Copyright:

Available Formats

Distributed Computing

Global State & Snapshot Recording

•system consists of a collection of n processes, p1, p2,…, pn connected by

•process state is highly dependent on the local context of the

• send(mij): send event of message mij from pi to pj

• C1: send(mij) ∈ LSi ⇒ mij ∈ SCij⊕ rec(mij) ∈ LSj (⊕: XOR)

• C2: send(mij) LSi ⇒ mijSCij ∧ rec(mij)  LSj

•all messages that follow a marker on channel Cij have been

•pj must record its snapshot no later than when it receives a

•a process must record its snapshot no later than when it

• several variants were proposed

• 2 phases in obtaining a global snapshot:

• marker carries the identifier of the initiator of the algorithm

• notion of region is used in the system

• if the initiator’s identifier in a marker received along a channel is

• the algorithm efficiently handles concurrent snapshot initiations

• state of the channel is recorded just as in the Chandy–Lamport

• snapshot recording completion criterion at a process:

• an intermediate process in a spanning tree

• round is complete when an initiator receives information, or the

message complexity of assembling and disseminating the

r: number of concurrent initiations

n: number of processes in the system

• non-FIFO algorithm by Lai and Yang

• based on 2 observations on the role of a marker in a FIFO

• every process is initially white

• every message sent by a red process is colored red

• every white process takes its snapshot at its

• when a white process receives a red message, it records

• Lai–Yang algorithm fulfills this role of the marker as follows:

• Lai–Yang algorithm fulfills this role of the marker as follows

•necessary condition for consistent snapshot - absence of causal path between

You might also like