Download as pdf or txt
Download as pdf or txt
You are on page 1of 157

Distributed Deadlock

Definitions
• Resources:
– There are two types of resources in computer system
• Reusable resources
– They are fixed in number , neither can be created nor can be 
destroyed
– To use the resource, the process must request for it, hold it 
during the usage (allocation) and release it on completion
– The released resources may be re‐allocated to other process
– Example: Memory, CPU, Printer, Disk blocks …etc
• Consumable resources
– These resources will vanish once they are consumed
– Producer process can produce any number of consumable 
resources if it is not blocked
– Example: Messages, Interrupt signals, V operation in semaphore 
…etc
Type of Resource Accesses
• Shared
– In this mode, the resource can be accessed by any number 
of processes simultaneously
– Example: Read lock on data item
• Exclusive
– In this mode, the resource can be accessed by only one 
process at any point of time
– Example: Write lock on data item
– In theory of deadlocks, mostly exclusive locks are 
considered
– Reusable resources can be accessed  in exclusive or shared 
mode at  a time
– Consumable resources  always accessed on exclusive mode
Resource Request Model
• Single unit resource request model
– In this model, a process is allowed to request only one unit 
of the resource at a time.  The process is blocked till that 
resource is allocated.
– Example: A transaction (process) request for the write lock 
on data item [ write_lock(X)]
• AND  request model
– In this model, a process is allowed to request multiple 
resources simultaneously.  It is blocked till all the resources 
are available.
– Example: Consider the data item X is replicated at N sites.  
A transaction request for write lock on X need to request 
for locks at all N where X is located and is blocked till  all 
the write request is granted
Resource Request Model – Contd.
• OR request model
– In this model,  a process is allowed to request the multiple resources 
simultaneously. However, it is blocked till at‐least one resource is 
allocated.
– Example: Consider the data item X is replicated at N sites.  A
transaction request for read lock on X need to request for locks at all N 
where X is located. However, the transaction is blocked till at least one 
of the read lock request is granted.
• AND‐OR request model
– Here the request of the process is specified in the form of a predicate 
where its atoms / variables are the resources.
– Example: R1 AND (R2 OR R3)
• P out of Q request model
– Here, a process can  simultaneously requests Q resources and will be 
blocked till any P out of Q resources are available.
– Note that if P = 1, the model is OR request model; if P = Q, the model 
is AND request model.
Deadlock ‐ General
• A set of processes is said to be in deadlock state, if each of them is 
waiting for the resources to be released by the another process in 
the set.
• Necessary condition for the deadlock:
– Mutual exclusion: ‐ Non sharable characteristic of the resources. Ex: 
Memory location
– No pre‐emption:‐ The allocated resources can’t be pre‐empted from 
the process before its release by the process
– Hold and wait: The process holding some resources and waiting for 
other resources
– Circular wait:‐ The processes are waiting for one another for resources 
in a circular fashion
• Sufficient condition for the deadlock:
– Note that the above mentioned conditions are not sufficient to say 
that the set of processes in deadlock. However, once the set of 
processes are in deadlock then we can observe all of those conditions. 
Hence they are necessary conditions. 
Deadlock Handling Strategies
• Deadlock Prevention
– Idea: Resources are granted to the requesting processes in such a way 
that there is no chance of deadlock ( Vaccination). For example, 
allocating the resources requested if all available. Else wait for all. [ So 
no hold and wait condition holds in that case.]
• Deadlock Avoidance
– Idea: Resources are granted as and when requested by the processes  
provided the resulting system is safe.  The system state is said to be 
safe, if there exist at least one sequence of execution for all the 
process such that all of them can run to completion without getting 
into deadlock situation. 
• Deadlock Detection and Recovery 
– Idea: In this strategy, the resources are allocated to the processes as 
and when requested. However,  the deadlock is detected by deadlock 
detection algorithm. If deadlock is detected, the system recovers from 
it by aborting one or more deadlocked processes.
Distributed Deadlock Algorithms
Distributed Deadlock Prevention
• Basic Idea: 
1. Each process is assigned a globally unique 
timestamp using Lamport’s logical clock, process 
number and site number [i.e., <logical clock value, 
process id, site id>].
2. Every request from the process for the resource 
should accompany the process timestamp.
3. The timestamp of requesting process (for the 
resource) is compared with the one who is holding 
the resource and suitable decision is made to 
prevent the occurrence of deadlock.
Distributed Deadlock Prevention
• Algorithm for distributed deadlock prevention
– Suppose a resource R is held by P1 at some site, 
and the process P2 requests R. Let TS(P1) and 
TS(P2) are timestamps of P1  and P2 respectively.
– Wait‐die method:
• If TS(P2) < TS(P1) then P2 waits      /* P2 is older */ 
• Else P2 is killed                                  /* P2 is younger */
R P1 P2 Pn

TS(P1) > TS(P2) > … > TS(Pn)
POSSIBLE WAITING SEQUENCE FOR RESOURCES (Assumed that Pi waiting for 
some resource hold by Pi‐1)
Distributed Deadlock Prevention
– Note on Wait‐die method:
1. P2 waits if resource holder (i.e. P1) is younger process
2. P2  is killed if the resource holder is older process
3. Killed process will be restarted with SAME timestamp, 
will be older after some time and will not be killed
4. No circular wait condition will be hold in this method
� The waiting sequence, TS(P1) > TS(P2) > … > TS(Pn) leads to a 
circular wait provided  P1 waits for some resources hold by 
Pn, i.e.,  P1 � Pn. This is possible only if TS(P1) < TS(Pn). This 
contradict the waiting sequence, i.e., TS(P1)> TS(Pn)
5. No preemption of process (resource holder) in this 
method. Here, the requester (P2) will either waits or 
die.
Distributed Deadlock Prevention
– Wound‐wait method:
• If TS(P2) < TS(P1) then P1 is killed  /* P2 is older */ 
• Else P2 is waits                                 /* P2 is younger */

P1 P2 Pn

TS(P1) < TS(P2) < … < TS(Pn)

POSSIBLE WAITING SEQUENCE FOR RESOURCES (Assumed that Pi
waiting for some resource hold by Pi‐1)
Distributed Deadlock Prevention
– Note on Wound‐wait method:
1. The older process never waits for the younger 
resource holder
2. P1 is killed if the resource requester is older process
3. Killed process will be restarted with SAME timestamp, 
will be older after some time and will not be killed
4. No circular wait condition will be hold in this method
� The waiting sequence, TS(P1) < TS(P2) < … < TS(Pn) leads to a 
circular wait provided  P1 waits for some resources hold by 
Pn,  i.e.,  P1 � Pn. This is possible only if TS(P1) > TS(Pn). This 
contradict the waiting sequence, i.e., TS(P1) < TS(Pn)
5. There is preemption of process (resource holder) in 
this method. Here, the requester (P2) will  wait or 
resource holder (P1) will be wounded.
Distributed Deadlock Prevention
• Method to handle more than one process waiting 
on same resource (R):
– Method 1: 
• At most only one process is allowed to wait for the resources 
and all other processes are killed. If another process P3
requesting  the same resource R, then Wound‐wait is 
applied  between P2 and P3 to select oldest. Then, either 
Wound‐wait or Wait‐die method can be used between 
oldest waiting process and P1, the resource holder.
– Method 2:
• The waiting processes are ordered in the increasing order of 
their timestamps. A new process requesting for the resource 
is made to wait if it is not the older than resource holder. If 
so, then Wound‐wait method is applied between the new 
process and resource holder
Distributed Deadlock Detection and 
Recovery
• Two components in this strategy:
– Distributed Deadlock detection
– Distributed Deadlock recovery
• Distributed Deadlock detection
– Using Wait for graph (WFG)
• WFG is a directed graph (V,E), where the vertices are the 
processes and the directed edge eij indicates that the 
process Pi is waiting for the resource hold by the process Pj.
• The process Pi belong to any node in DCS.
• All the resources are assumed to be single unit based
Distributed Deadlock Detection
• In Single Unit Resource Request Model:
– A deadlock in this model is detected by existence 
of cycle in WFG
– Note that a process can involve in only one cycle

P2 Cycle, hence P1 , P2 and P3 are 


in deadlock state

P1 P3

P4
Distributed Deadlock Detection

• In AND Request Model:
– A deadlock in this model is detected by existence 
of cycle in WFG
– Note that a process can involve in more than one 
cycle P3  in both cycles. Note that P3
P2 is requested for the resources 
hold by P1 and P5 . P3  is 
holding the resources 
requested by P2 and P6
P1 P3

P5

P4 P6
Distributed Deadlock Detection

• In OR Request Model:
– A cycle in the WFG is not sufficient condition for 
the existence of the deadlock
– Note that a process can involve in more than one 
cycle
P2

P6
P1 P3

P5
P4
Distributed Deadlock Detection

• In OR Request Model:
– A cycle in the WFG is not sufficient condition for 
the existence of the deadlock
– Note that a process can involve in more than one 
cycle Cycles in WFG does not implies the 
P2 deadlock situation. This is because, 
P1 is requested for the resources 
hold by  P2 ,P4 and P5. Once it get 
P6
the resource hold by P4, the 
P1 P3
request edges from P1 to P4 , P1 to 
P5 and  P1 to P2 will be removed and 
hence there will not be any cycle 
P5 and no deadlock.
P4
Distributed Deadlock Detection
• OR Request Model
– The necessary and sufficient condition for detecting 
the deadlock is the presence of knot.
– Knot: A set of processes S is said to be a knot, if
• ∀Pi ∈ S, 
– Dependency Set(Pi) ⊆ S and
– Dependecy Set(Pi) ≠ Φ
– Dependency Set of a process Pi (DS(Pi)) : Set of all 
processes from which Pi is expecting the unit of 
resources to be released.
– Knot implies deadlock in any resource request model
Distributed Deadlock Detection

• OR Request Model: An Illustrative example 
P2

P6
P1 P3

P5
P4

Here, DS(P1) = {P2, P4, P5},  DS(P2) = {P3} , DS(P3)={P1} , DS(P4)={}, DS(P5)= {P6}, DS(P6)={P1}

Note  that S= {P1, P2, P3}  is not a knot,  because DS(P1)  is not in S. If you include P5 and P4 
then, S = {P1, P2, P3, P4, P5} is again not a knot because DS(P4)  is a null set. And so on.

Similar argument will follow for S= {P1, P5, P6}  to show S is not a knot.
Distributed Deadlock Detection

• In AND‐OR Request Model:
– Presence of KNOT in WFG implies that the system 
is in deadlock
• In P out of Q Request Model:
– Presence of KNOT in WFG implies that the system 
is in deadlock
Requirements of Distributed Deadlock 
Detection Algorithm 
• If there is a deadlock, then the  algorithm 
should detect all such deadlocks. (i.e., 
algorithm should detect all deadlocks)

• If the algorithm says that there is a deadlock, 
then definitely there should one. (i.e., no false 
deadlock detection by the algorithm)
Pseudo Deadlock in Distributed 
Environment
• Let P1, P2,…Pn be the sequence of processes such that 
Pi is waiting for the release of resources hold by Pi+1
(where 1 ≤ i≤ n‐1).
• Let Pn releases the resources first and then request for 
the resource hold by P1. For this, Pn will send the 
message (M1) to the resource controller to release the 
allocated  resource  for which  Pn‐1 is waiting and then 
send the message (M2) to the resource controller to 
allocate the resource hold by P1.
• If M2 reaches first than M2 at deadlock detection 
algorithm, then there is a false / pseudo deadlock!
P1 P2 P3 . . . Pn‐1 Pn Release (M1)

Request (M2)
Distributed Deadlock Detection 
Algorithms

• Centralized Approach

• Distributed Approach: Chandy ‐ Misra ‐Haas 


Algorithm
Centralized Approach
• The single unit request resource model is assumed in this 
approach. So, a cycle in the WFG implies the deadlock.
• The local wait for graph (LWFG) is constructed at each site.
• The global wait for graph (GWFG) is constructed at 
coordinator site based on following criteria:
– Whenever an edge is added or deleted at LWFG, the local site 
will inform this to coordinator. Or
– Periodically the local site will send its LWFG.
• The coordinator will construct the GWFG which is the 
concatenation of LWFGs. 
• If there is a cycle, then the coordinator will detect that the 
system is in deadlock.
• But the problem is there is a possibilities of false deadlock 
as demonstrated in next slide.
Centralized Approach

T1 R1
T1 R1 R1

R2 T2
R2
T2

T3 R3
T3 R3
Site S1 Site S2 Coordinator Site

Local Wait For Graphs Global Wait For Graph

At S1:‐ M1 : T3 releases R2


At S2:‐ M2 : T3 requests  R3 } M1 occurs before M2
M2 reaches coordinator before M1
Centralized Approach

T1 R1
T1 R1 R1

R2 T2
R2
T2

M2
T3 R3
T3 R3
Site S1 Site S2 Coordinator Site

Local Wait For Graphs Global Wait For Graph

At S1:‐ M1 : T3 releases R2


At S2:‐ M2 : T3 requests  R3 } M1 occurs before M2
M2 reaches coordinator before M1
False 
Deadlock
Centralized Approach

T1 R1
T1 R1 R1

R2 T2
R2
T2

M2
T3 R3
T3 R3
Site S1 Site S2 Coordinator Site

Local Wait For Graphs Global Wait For Graph

At S1:‐ M1 : T3 releases R2


At S2:‐ M2 : T3 requests  R3 } M1 occurs before M2
M2 reaches coordinator before M1
False 
Deadlock
Centralized Approach

T1 R1
T1 R1 R1

R2 T2
R2
T2

M2
T3 R3
T3 R3
Site S1 Site S2 Coordinator Site

Local Wait For Graphs Global Wait For Graph

At S1:‐ M1 : T3 releases R2


At S2:‐ M2 : T3 requests  R3 } M1 occurs before M2
M2 reaches coordinator before M1
False 
Deadlock

Solution: Timestamp based messages; ordering at Coordinator
Handling of Pseudo Deadlock
• Pseudo deadlocks can be handled using 
timestamps based on Lamport’s clock value.
• Every messages pertaining to LWFG from the 
local site to the coordinator carries the 
timestamp.
• If the coordinator observe the cycle due to the 
message (M) from the local site, then the 
coordinator will broadcast that  is there any 
message having the timestamp lesser than M? 
• The decision about the cycle will be taken after 
the receipt of all the acknowledgements from the 
local site.
Handling of Pseudo Deadlock
• In the above example, since T3 released R2 and 
then requested R3, M1 timestamp should be 
smaller than M2.
• When the coordinator receives M2, it suspect the 
deadlock and it sends the message asking that if 
anyone has the message with timestamp lesser 
than M2. For this, site S1will send the positive 
acknowledgement regarding M1. The coordinator 
now reforms the GWFG with M1 first and then 
M2, hence no deadlock.
Chandy – Misra – Haas Algorithm
• Here, processes are allowed to request multiple 
resources at a time, so the process may wait for two or 
more resources.
• The process either wait for the resources hold by their 
co‐processes (i.e., process in the same system) or by 
the processes exist in other machine.
• The algorithm is invoked when the process has to wait 
for the resource. 
• For this, the process will generates probe message and 
send it to the process from whom it is waiting for the 
resource
Chandy – Misra – Haas Algorithm
• The probe message consists of three components:
– probe originator‐id, sender‐id, receiver‐id
• When the process receives the probe message, it 
checks whether it is waiting for any process(s), If so, it 
update the probe message by updating its 2nd and 3rd
fields and forward it to the process(es) for whom it is 
waiting.
• If the probe message goes all the way round and comes 
back to the originator (i.e., probe originator‐id == 
receiver id), then the set of processes along the path of 
probe message are in deadlock.
• The probe initiator may identify itself as the victim and 
will commit suicide to break the deadlock.
Chandy – Misra – Haas Algorithm
The probe message initiated by process 0 is shown below: The arrow mark from 
process i to j indicates that process i is waiting for the resource hold by j

(0,8,0)

(0,4,6) 6 8
4
(0,2,3)
0 1 2 3
5 7
(0,5,7)
Site 0 Site 1 Site 2
Probe message from process 0  is forwarded 
by process 2
Probe message (0,0,1) initiated by process 0, since it is waiting for the resource hold by process 
1

Once process 0 receives the probe (0,8,0), then it realize that it is in the set of processes under 
deadlock. So, it will identify itself to commit suicide to break the deadlock.
Chandy – Misra – Haas Algorithm
• The problem:
– In the above example, it is possible that  the processes, 0,1,2,3,4,6,8 
may initiate the probe messages and identify themselves as the victim 
to commit suicide.
– This leads to unnecessarily termination of many processes in the same 
deadlock path.
• Solution: 
– The process ids along the way is attached to the probe message in the 
form of a queue.
– When the probe message comes back to the originator, it sees the 
process with highest / lowest process id in the queue and selects it as 
victim and sends the message to the victim to commit suicide by itself.
– Even though many processes may  initiate the probe messages, the 
same set of processes  are identified in a cycle. Further, there is a 
single highest / lowest process id in that cycle. Hence, only one victim 
is selected.
Summary
• Generic resource request models are 
discussed
• Distributed deadlock prevention algorithms 
and distributed deadlock detection and 
recovery algorithms are outlined.
References
• Advanced Operating Systems
– M Singhal and N G Shivarathri, McGraw Hill, 
International

• Distributed Operating Systems
– A S Tanenbaum, Prentice Hall

• Distributed Algorithms
– Nancy A Lynch, Morgan Kaufman
Load Balancing in Distributed 
System
CPU Scheduling ‐ Conventional
• Issue: In multiprogramming environment (with 
single CPU), which job is scheduled to the 
processor NEXT?
• Need: To allocate the job for execution 

J DIFFERENT SCHEDULING TECHNIQUES:

1. FIRST COME FIRST SERVE

O 2.
3.
4.
PRIORITY BASED
ROUND ROBIN BASED
MULTI LEVEL FEEDBACK QUEUES

.
.
B 5. ETC.

S
Load (Job) Scheduling
• Issue: In distributed environment, which job is 
scheduled to which distributed processor?
• Need: To allocate the job for execution

JOB

JOB

.
.
.
?
JOB
Load Balancing
• Issue: Redistribution of processes/task in the DCS.
• Redistribution: Movement of processes from the heavily 
loaded system to the lightly loaded system
• Need: To improve the Distributed Systems’ throughput

? ?
Process

Process
Job Scheduling
Local queue(s)
. . . CPU SITE 1

Job scheduler Local queue(s)
. . . CPU SITE 2

Stream of jobs
Local queue(s)
. . . CPU SITE N

Can be considered as a QUEUEING MODEL  ‐ MULTI JOB MULTI QUEUES SYSTEM
Job Scheduling Policies
• Random:
– Simple and static policy
– The job scheduler will randomly allocate the job to the site i
with some probability pi, where Σpi = 1
– No site state information is used
• Cycle:
– The job scheduler will allocate the job to the site i, if the 
previous job allocation was  to the site i‐1 using the  function (i‐
1)+1 mode N
– It is semi static policy, where in the job scheduler remembers 
the previously allocated site.
• Join the Shortest Queue (JSQ):
– The job scheduler remembers the size of local queue in each 
site.
– The job will be allocated to the queue which is shortest at the 
point of arrival of that job.
Job Scheduling Policies – Parameters 
of Interest
• Mean response time of jobs
• Mean delay of the jobs
• Throughput of the system

• Obviously  the JSQ is having better edge over 
other two in terms of these parameters.
Load Balancing
• Basically two types:
– Sender Initiated Load Balancing
– Receiver Initiated Load Balancing
Sender Initiated Load Balancing
Components of Sender Initiated Load 
Balancing 
• Idea: Node with the higher load (sender) initiate 
the load balancing process.
• Transfer Policy
– Policy about whether to keep the task/process at that 
site or transfer to some other site (or node)
• Location Policy
– If decided to transfer, policy about where to transfer?
• Note that any load balancing algorithm should 
have these to components
Transfer Policy
• At each node, there is a queue
• If queue length of a node < τ (threshold)
– Originating task is processed in that node only
• Else
– Transfer to some other node
Local queue
. . . CPU

τ
• In this policy each node uses only local   state  
information
Location Policy

• Random Policy

• Threshold Location Policy

• Shortest Location Policy
Random Policy
• Node status information are not used
• Destination node is selected at random and task is 
transferred to that node
• On receipt of the task, the destination node will do the 
following:
– If the threshold of the node is < τ, then accept the task
– Else transfer it to some other random node
• If number of transfers reach some limit, Llimit  then, the 
last recipient of the task has to execute that task 
irrespective of its load. This is to avoid unnecessary 
thrashing of jobs. 
Threshold location policy
• Uses node status information for some extent about 
the destination nodes.
• Selects the node at random. Then, probe that node to 
determine whether the transferring task  to that node 
would place its load above the threshold.
– If not, the task is transferred and the destination node has 
to process that task regardless of its state  when the task 
actually arrives.
– If so, select the another node at random and probe in the 
same manner as above.
• The algorithm continues either suitable destination is 
found or number of probes reaches some limit, Tlimit. If 
the number of probes > Tlimit then, the originating node 
should process the task 
Shortest Location Policy
• Uses additional information about  the status of other 
node to make “best” choice of  the destination node.
• In this policy, Lp number of nodes are chosen at 
random and each is polled to determine their queue 
length.
• The task is transferred to the node having shortest 
queue length among the one with threshold < τ
• If none exist with the threshold < τ, then the next Lp
number of nodes are polled and above step is repeated
• Once group of node selection reaches some limit, Ls, 
then the originator should handle the task
Receiver Initiated Load Balancing
Components of Receiver Initiated Load 
Balancing 
• Idea: Under loaded node  (receiver) initiate the 
load balancing process. Receiver tries to get the 
task from overloaded node, sender.
• Transfer Policy (threshold policy)
– Where the decision is based on CPU queue length. If it 
falls below certain threshold, τ, the node is identified 
as receiver for obtaining  task from the sender
• Location Policy
– If decided to receive, policy about from where to 
receive?
Location Policy
• Threshold Location Policy
– A random node is probed to see whether it can 
become a potential sender. If so, the task is 
transferred from the polled node. Else, the process is 
repeated until a potential sender is found or number 
of tries reaches a PollLimit.
• Longest Location Policy
– A pool of nodes are selected and probed to find the 
potential sender with longest queue length (greater 
than τ). If found, then  the task is received from that 
sender. Else the above process is repeated with new 
pool of nodes. 
Drawback of Receiver Initiated 
Algorithm
• Most of the tasks selected for transfers from 
the senders are all preemptive one.
– The reason is:  The job scheduler always gives 
higher priority for allocating the fresh job to the 
processor compared to the existing processes at 
different stages of execution. So, by the time the 
receiver decide to pick the task, it  underwent 
some execution.
Symmetrically Initiated Algorithm
• These are algorithms having both sender 
initiated and receiver initiated components.
• Idea is that at low system loads the sender 
initiated component is more successful in 
finding the under loaded nodes and at high 
system loads the receiver initiated component 
is more successful in finding the overloaded 
nodes.
Symmetrically Initiated Algorithm

• Above Average Algorithm

• Adaptive Algorithms
– Stable Symmetrically Initiated Algorithm
– Stable Sender Initiated Algorithm
Above Average Algorithm
• Idea: There is ‘acceptable range (AR)’ in terms of load is 
maintained. 
– The node is treaded as sender if its load > AR
– The node is treated as receiver if its load < AR
– Else it is a balanced node.
• Transfer Policy: AR is obtained from two adaptive 
thresholds and they are equidistance from estimated 
average load of the system
– For example, if the estimated average load of the system = 
2, then lower threshold (LT) = 1 and upper threshold (UT) = 
3
– So, if the load of the node is <=LT then it is a receiver node. 
If  the load of the node is >= UT, then it is a sender node. 
Balanced node other wise.
Above Average Algorithm – Contd.
• Location Policy: Consists of two components
– Sender Initiated Component:
1. The node with load > AR is called sender. The sender broadcasts 
TOOHIGH message, sets TOOHIGH timeout alarm and listen for 
ACCEPT message
2. On receipt of TOOHIGH message, the receiver (whose load < AR) 
• cancels its TOOLOW timeout alarm 
• sends ACCEPT message to the node which has sent TOOHIGH message
• increments its load value
• set AWAITINGTASK timeout alarm
• if no task transfer within AWAITINGTASK period, then its load value is 
decremented 
3. On receipt of ACCEPT message, sender sends the task to the 
receiver. [Note that the broadcasted TOOHIGH message will be 
received by many receiver and for the first ACCEPT message, the 
sender will transfer the task].
4. On expiry of TOOHIGH timeout period, if no ACCEPT message is 
received by the sender, then sender infers that its estimated 
average system load is too low. To correct the problem, it 
broadcasts CHANGEAVERAGE message to increase the average 
estimated load at all other sites.
Above Average Algorithm – Contd.
– Receiver Initiated Component:
1. Receiver broadcasts TOOLOW message, sets TOOLOW 
timeout alarm and wait for TOOHIGH message.
2. On receipt of TOOHIGH message, perform the 
activities as in step 2 of Sender Initiated Component
3. If TOOLOW timeout period expires, then it infers that 
its estimated average system load is too high and 
broadcasts CHANGE AVERAGE message to decrease 
the estimated average load at all sites.
Stable Symmetrically Initiated 
Algorithm
• Idea: In this algorithm, the information gathered during 
polling is used to classify the node as SENDER, 
RECEIVER or BALANCED.
• Each node maintains a list for each of the class.
• Since this algorithm updates its lists based on what it 
learns from  (or by) probing, the probability of 
selecting the right candidate for load balancing is high.
• Unlike Above average algorithm, no broadcasting, 
hence the number of messages exchanges are less.
• Initially each node assumes that every other node is 
RECEIVER except itself. So, the SENDER and BALANCED 
lists are empty to start with.
Stable Symmetrically Initiated 
Algorithm – Contd.
• Transfer Policy:
– This policy is triggered when the task originates or 
departs.
– This policy uses two thresholds: UT (upper 
threshold) and LT (lower threshold).
– The node is sender, if its queue length > UT, a 
receiver, if its queue length < LT and balanced   if  
LT ≤ queue length ≤ UT
Stable Symmetrically Initiated 
Algorithm – Contd.
• Location Policy: Has two components
– Sender Initiated Component:
1. When the node becomes sender, it polls the node at the head of its 
RECEIVER list. The polled node removes the sender under 
consideration from its RECEIVER list and put it into the head of its 
SENDER list (i.e., it learns!). It also informs to the sender that 
whether it is a sender or a receiver or a balanced node.
2. On receipt of the reply the sender do the following:
• If the polled node is receiver, sender transfers the task to it and updates its 
list (putting the polled node at the head of RECEIVER or BALANCED list)
• Otherwise, updates the list (putting the polled node at the head of SENDER 
or BALANCED list) and again start polling the next node in its RECEIVER list  
3. The polling  process stops if
• The receiver is found      or
• RECEIVER list is empty    or
• Number  of polls reaches POLL‐LIMIT
4. If polling fails, the arrived task has to be processed locally. However, 
there is a chance of migration under preemptive category.
Stable Symmetrically Initiated 
Algorithm – Contd.
– Receiver Initiated Component:
1. When the node becomes receiver, it polls the node at its head of 
SENDER list. The polled node updates its list (i.e., places this 
node at the head of RECEIVER list). It also informs the receiver 
that whether it is a sender or a receiver or a balanced node.
2. On the receipt of reply the receiver do the following:
• If the responded node is  a receiver or a balanced node, then its list is 
updated  accordingly.
• Otherwise (i.e., responded node is a sender), the task sent by it is 
received and the list is updated accordingly
3. The polling process stops if:
• The sender is found or
• No more entries in the SENDER list  or
• The number of polls reaches POLL‐LIMIT
‰ Note that at high load, receiver initiates the poll and at low load 
sender initiates the poll. So this will improve the performance.
Stable Sender Initiated Algorithm
• In this algorithm, there is no transfer of tasks 
when a node becomes receiver. Instead, its status 
information is shared. Hence there is no 
preemptive task transfers.
• The sender initiated component is same as that 
of stable symmetrically initiated algorithm. (like 
list generation, learning the status via polling 
…etc)
• Stable sender initiated algorithm maintains  an 
array (at each node) called status vector of size = 
number of nodes in DCS.
Stable Sender Initiated Algorithm
Status vector
1 2 j N
. . . . . . Node i

‐The entry j in the status vector of node i indicates the best guess (receiver / 
sender /balanced node) about node i by the node j
‐SENDER INITIATED COMPONENT 
‐ When the node become sender, it polls the node (say j) at the head of its 
RECEIVER list
‐ The sender updates its jth entry of its status vector as sender.
‐ Likewise, the polled node (j) updates ith entry in its status vector based on 
its reply it sent to the sender node.
‐ Note that above two aspects are  additional information it learns along with 
the other things as in sender component of stable symmetrically initiated 
algorithm  
‐RECEIVER INITIATED COMPONENT 
‐ When the  node becomes receiver, it checks its status vector and informs all 
those nodes that are misinformed about its current state 
‐ The status vector at the receiver side is then updated to reflect this changes
Stable Sender Initiated Algorithm

• Advantages:
– No broadcasting of messages by the receiver 
about its status
– No preemptive transfer of jobs, since no task 
transfers under receiver initiated component
– Additional learning using status vector  reduces 
unnecessary polling
Challenges Load Balancing Algorithm
• Scalability:
– Ability to make quicker decision about task transfers with lesser 
efforts
• Location transparency:
– Transfer of tasks for balancing are invisible to the user.
• Determinism:
– Correctness in the result inspite of task transfers
• Preemption:
– Transfer  of task to the node should not leads degraded  
performance  for the task generated at that node. So, there is a 
need for preemption of task when the local task arrives at node
• Heterogeneity:
– Heterogeneity in terms of processors, operating systems, 
architecture should not be hindrance for the task transfers.
Summary
• Differences between CPU scheduling, Job 
scheduling and load balancing are discussed.
• Different load balancing algorithms are 
discussed. They are categorized under sender 
initiated, receiver initiated, symmetrically 
initiated and variations of symmetrically 
initiated algorithms are discussed.
References
• Advanced Operating Systems
– M Singhal and N G Shivarathri, McGraw Hill, 
International

• Distributed Operating Systems
– A S Tanenbaum, Prentice Hall

• Distributed Systems Concepts and Design
– G Coulouris and  J Dollimore, Addison Wesley 
Leader Election
Leader Election
Leader election is the process of designating a single process as the
organizer of some task distributed among several computers (nodes).
Before the task is begun, all network nodes are either unaware which
node will serve as the "leader" (or coordinator) of the task, or unable to
communicate with the current coordinator.
After a leader election algorithm has been run, however, each node
throughout the network recognizes a particular, unique node as the task
leader.
The network nodes communicate among themselves in order to decide
which of them will get into the "leader" state.
For that, they need some method in order to break the symmetry among
them.
For example, if each node has unique and comparable identities, then
the nodes can compare their identities, and decide that the node with the
highest identity is the leader.
Leader Election
The problem of leader election is for each node eventually to
decide whether it is a leader or not, subject to the constraint that
exactly one node decides that it is the leader .

-- An algorithm solves the leader election problem if:


States of nodes are divided into elected and not-elected states.
Once elected, it remains as elected (similarly if not elected).
In every execution, exactly one node becomes elected and the rest
determine that they are not elected.
-- A valid leader election algorithm must meet the following conditions:
Termination: the algorithm should finish within a finite time once the
leader is selected.
Uniqueness: there is exactly one node that considers itself as
leader.
Agreement: all other nodes know who the leader is.
Leader Election
An algorithm for leader election may vary in following aspects:
Communication mechanism: the nodes are either synchronous
in which processes are synchronized by a clock signal or
asynchronous where processes run at arbitrary speeds.
Process names: whether processes have a unique identity or are
indistinguishable (anonymous).
Network topology: for instance, ring, acyclic graph or complete
graph.
Size of the network: the algorithm may or may not use knowledge
of the number of nodes in the system.
Ring Network

-- A ring network is a connected-graph topology in which each node is


exactly connected to two other nodes, i.e., for a graph with n nodes, there
are exactly n edges connecting the nodes.
-- A ring can be
unidirectional, means nodes only communicate in one direction (a
node could only send messages to the left or only send messages to
the right), or
bidirectional, meaning nodes may transmit and receive messages in
both directions (a node could send messages to the left and right).
Leader Election in Rings
Models
 Synchronous or Asynchronous
 Anonymous (no unique id) or Non-anonymous (unique ids)
• A ring is said to be anonymous if every nodes is identical.

• There is no deterministic algorithm to elect a leader in anonymous


rings, even when the size of the network is known to the processes.
• This is due to the fact that there is no possibility of breaking symmetry
in an anonymous ring if all processes run at the same speed and their
states remain identical at any instant.
 Uniform (no knowledge of ‘n’, the number of nodes in the ring
network) or non-uniform (knows ‘n’)
Leader Election in Rings

Known Impossibility Result:


 There is no Synchronous, non-uniform leader election
protocol if the processors are of anonymous rings model.
• Implies that there are no uniform algorithms as well
• Implies that there are no asynchronous algorithms as well
Election in Asynchronous Rings
Lelann-Chang-Robert’s Algorithm

– The algorithm assumes that each node has a Unique


Identification (UID) and that the nodes can arrange
themselves in a unidirectional ring with a communication
channel going from each process to the clockwise neighbour.
– The algorithm can be described as follows:
 Initially each node in the ring is marked as non-participant.
 A node that notices a lack of leader starts an election.
 It creates an election message containing its UID and then sends
this message clockwise to its neighbour.
 Every time a node sends or forwards an election message, the node
also marks itself as a participant.
Lelann-Chang-Robert’s Algorithm

– When a node receives an election message it compares the UID in


the message with its own UID.
If the UID in the election message is larger, the node unconditionally
forwards the election message in a clockwise direction.
If the UID in the election message is smaller, and the node is not yet
a participant, the node replaces the UID in the message with its own
UID, sends the updated election message in a clockwise direction.
If the UID in the election message is smaller, and the node is already
a participant (i.e., the node has already sent out an election message
with a UID at least as large as its own UID), the node discards the
election message.
If the UID in the incoming election message is the same as the UID
of the node, that node starts acting as the leader.
Lelann-Chang-Robert’s Algorithm

– When a node starts acting as the leader, it begins the second


stage of the algorithm.
The leader node marks itself as non-participant and sends an elected
message to its neighbour announcing its election and UID.
When a node receives an elected message, it marks itself as non-
participant, records the elected UID, and forwards the elected
message unchanged.
When the elected message reaches the newly elected leader, the
leader discards that message, and the election is over.

– Assuming there are no failures this algorithm will finish.


– Algorithm works for any number of processes N, and does not
require any node to know how many nodes are in the ring.
Lelann-Chang-Robert’s Algorithm
In Summary
 send own id to node on left
 if an id received from right, forward id to left node
only if received id greater than own id, else ignore
 if own id received, declares itself “leader”
works on unidirectional rings
message complexity = θ(n^2)
Hirschberg-Sinclair Algorithm
Algorithm
 Operates in multiple phases, requires bidirectional ring
 In kth phase, send own id to 2^k processes on both
sides of yourself (directly send only to next processes
with id and k in it)
 if id received, forward if received id greater than own id,
else ignore
 last process in the chain sends a reply to originator if its
id less than received id
 replies are always forwarded
 A process goes to (k+1)th phase only if it receives a
reply from both sides in kth phase
 process receiving its own id – declare itself “leader”
Note:
Message Complexity: O(nlgn) [check the
lower bound of Comparison based sorting]
Lots of other algorithms exist for rings
Lower Bound Result:
 Any comparison-based leader election algorithm in
a ring requires Ώ(nlgn) messages
Leader Election in Arbitrary Networks
FloodMax
Theorem-- FloodMax algorithm solves the leader-election problem in
a synchronous general network.
 synchronous, round-based
 at each round, each process sends the max. id seen so far (not
necessarily its own) to all its neighbors
 after diameter no. of rounds, if max. id seen = own id, declares itself
leader
 Time Complexity = diameter rounds
 Communication Complexity= O(dXm), where d is the diameter and
m = no. of edges
 does not extend to asynchronous model trivially
 Variations of building different types of spanning trees with no pre-
specified roots.
 Chosen root at the end is the leader (Ex., the DFS spanning tree)
Mutual Exclusion
Mutual Exclusion

Very well-understood in shared memory systems

Requirements:
 at most one process in critical section (safety)
 if more than one requesting process, someone
enters (liveness)
 a requesting process enters within a finite time (no
starvation)
 requests are granted in order (fairness)
Classification of Distributed Mutual Exclusion
(DME) Algorithms

Non-token based/Permission based


 Permission from all processes:
 e.g.
• Lamport, Ricart-Agarwala,
• Raicourol-Carvalho etc.
 Permission from a subset: e.g. Maekawa

Token based
 e.g. Suzuki-Kasami
Some Complexity Measures

No. of messages/critical section entry


Synchronization delay
Response time
Throughput
Lamport’s Algorithm
Every node i has a request queue qi, keeps requests
sorted by logical timestamps (total ordering enforced by
including process id in the timestamps)

To request critical section:


 send timestamped REQUEST (tsi, i) to all other
nodes
 put (tsi, i) in its own queue

On receiving a request (tsi, Pi):


 send timestamped REPLY to the requesting node P i
 put request (tsi, Pi) in the queue
To enter critical section:
– Pi enters critical section if (tsi, Pi) is at the top if its own
queue, and
– Pi has received a message (any message) with timestamp
larger than (tsi, Pi) from ALL other nodes.

To release critical section:


– Pi removes it request from its own queue and sends a
timestamped RELEASE message to all other nodes
– On receiving a RELEASE message from Pi, Pi’s request is
removed from the local request queue
Some points to note:

Purpose of REPLY messages from node P i to Pj is to


ensure that Pj knows of all requests of Pi prior to
sending the REPLY (and therefore, possibly any
request of Pi with timestamp lower than Pj’s request)
Requires FIFO channels.
3(n – 1 ) messages per critical section invocation
Synchronization delay = max. message
transmission time
requests are granted in order of increasing
timestamps
Ricart-Agarwala Algorithm
Improvement over Lamport’s
Main Idea:
 node Pj need not send a REPLY to node P i
 if Pj has a request with timestamp lower than the request of P i
 (since Pi cannot enter before Pj anyway in this case)
Does not require FIFO
2(n – 1) messages per critical section invocation
Synchronization delay = max. message transmission
time
requests granted in order of increasing timestamps
To request critical section:
 send timestamped REQUEST message (tsi, Pi)

On receiving request (tsi, Pi) at Pj:


 send REPLY to Pi
• if Pj is neither requesting nor executing critical section or
• if Pj is requesting and Pi’s request timestamp is smaller
than Pj’s request timestamp.
 Otherwise, defer the request.

To enter critical section:


 Pi enters critical section on receiving REPLY from all nodes

To release critical section:


 send REPLY to all deferred requests
Roucairol-Carvalho Algorithm
--- Improvement over Ricart-Agarwala

Main idea
 once Pi has received a REPLY from Pj,
 it does not need to send a REQUEST to P j again
 unless it sends a REPLY to Pj (in response to a
REQUEST from Pj)
 no. of messages required varies between 0 and
2(n – 1) depending on request pattern
 worst case message complexity still the same
Maekawa’s Algorithm

Permission obtained from only a subset of other


processes, called the Request Set (or Quorum)
Separate Request Set Ri for each process Pi
Requirements:
 for all i, j: Ri ∩ Rj ≠ Φ
 for all i: i Є Ri
 for all i: |Ri| = K, for some K
 any node i is contained in exactly D Request Sets,
for some D
A simple version

To request critical section:


 Pi sends REQUEST message to all process in Ri

On receiving a REQUEST message:


 send a REPLY message if no REPLY message has been
sent since the last RELEASE message is received.
 Update status to indicate that a REPLY has been sent.
Otherwise, queue up the REQUEST

To enter critical section:


 Pi enters critical section after receiving REPLY from all
nodes in Ri
To release critical section:
 send RELEASE message to all nodes in R i
 On receiving a RELEASE message, send REPLY
to next node in queue and delete the node from
the queue.
 If queue is empty, update status to indicate no
REPLY message has been sent.
Message Complexity: 3*sqrt(N)

Synchronization delay =
2 *(max message transmission time)

Major problem: DEADLOCK possible

Need three more types of messages (FAILED,


INQUIRE, YIELD) to handle deadlock.

Message complexity can be 5*sqrt(N)


Token based Algorithms

Single token circulates, enter CS when token is


present
No FIFO required
Mutual exclusion obvious
Algorithms differ in how to find and get the token
Uses sequence numbers rather than timestamps to
differentiate between old and current requests
Suzuki Kasami Algorithm

Broadcast a request for the token


Process with the token sends it to the requestor if it
does not need it

Issues:

 Current vs. outdated requests


 determining sites with pending requests
 deciding which site to give the token to
The token:
 Queue (FIFO) Q of requesting processes
 LN[1..n] : sequence number of request that j executed most
recently

The request message:


 REQUEST(i, k): request message from node i for its k th
critical section execution

Other data structures


 RNi[1..n] for each node i, where RN i[j] is the largest
sequence number received so far by i in a REQUEST
message from j.
To request critical section:
 If i does not have token, increment RN i[i] and send
REQUEST(i, RNi[i]) to all nodes
 if i has token already, enter critical section if the token is
idle (no pending requests), else follow rule to release
critical section

On receiving REQUEST(i, sn):


 set RNj[i] = max(RNj[i], sn)
 if j has the token and the token is idle, send it to i if RN j[i] =
LN[i] + 1.
 If token is not idle, follow rule to release critical section
To enter critical section:
 enter CS if token is present

To release critical section:


 set LN[i] = RNi[i]
 For every node j which is not in Q (in token), add
node j to Q if RNi[ j ] = LN[ j ] + 1
 If Q is non empty after the above, delete first node
from Q and send the token to that node
Points to note:

 No. of messages: 0 if node holds the token


already, n otherwise

 Synchronization delay: 0 (node has the token) or


max. message delay (token is elsewhere)

 No starvation
Raymond’s Algorithm

Forms a directed tree (logical) with the token-holder


as root

Each node has variable “Holder” that points to its


parent on the path to the root. Root’s Holder variable
points to itself

Each node i has a FIFO request queue Q i


To request critical section:
 Send REQUEST to parent on the tree, provided i
does not hold the token currently and Q i is empty.
Then place request in Qi

When a non-root node j receives a request from i


 place request in Qj
 send REQUEST to parent if no previous
REQUEST sent
When the root receives a REQUEST:
 send the token to the requesting node
 set Holder variable to point to that node
When a node receives the token:
 delete first entry from the queue
 send token to that node
 set Holder variable to point to that node
 if queue is non-empty, send a REQUEST message to the
parent (node pointed at by Holder variable)
To execute critical section:
 enter if token is received and own entry is at the top of the
queue; delete the entry from the queue

To release critical section


 if queue is non-empty, delete first entry from the queue,
send token to that node and make Holder variable point to
that node
 If queue is still non-empty, send a REQUEST message to
the parent (node pointed at by Holder variable)
Points to note:

Avg. message complexity O(log n)

Sync. delay (T log n)/2, where T = max.


message delay
Capturing Global State
Global snapshot is global state
 Each distributed application has a number of processes (leaders)
running on a number of physical servers
• These processes communicate with each other via channels (text
messaging)
• A snapshot captures the local states of each process (e.g., program
variables) along with the state of each communication channel

Why do we need snapshots?


 Checkpointing: restart if the application fails

• Collecting garbage: remove objects that don’t have any references


• Detecting deadlocks: can examine the current application state
• Other debugging: a little easier to work with than printf...
Causal consistency
 Related to the Lamport clock partial ordering
• An event is presnapshot if it occurs before the local snapshot on a
process
• Postsnapshot if afterwards
• If event A happens causally before event B, and B is presnapshot, then A
is too
Proof
 If A and B happen on the same process, then this is trivially true
• Consider when A is the send and B is the corresponding receive event on
processes p and q, respectively
– Since B is presnapshot, q can’t have received a marker and p can’t have sent
a marker
– A must also happen presnapshot
• Similar logic for A happening postsnapshot
Global State Collection

Applications:
 Checking “stable” properties, checkpoint &
recovery

Issues:
 Need to capture both node and channel states
 System cannot be stopped
 No global clock
Some notations:

 LSi : local state of process Pi


 send(mij) : send event of message mij from
process Pi to process Pj
 rec(mij) : similar, receive instead of send
 time(x) : time at which state x was recorded
 time(send(m)) : time at which send(m) occurred
send(mij) є LSi iff
time(send(mij)) < time(LSi)

rec(mij) є LSj iff


time(rec(mij)) < time(LSj)

transit(LSi,LSj) = { mij | send(mij) є LSi and rec(mij) є


LSj}

inconsistent(LSi, LSj) = {mij | send(mij) є LSi and


rec(mij) є LSj}
Global state: collection of local states
GS = {LS1, LS2,…, LSn}

1. GS is consistent iff
for all i, j, 1 ≤ i, j ≤ n,
inconsistent(LSi, LSj) = Ф

2. GS is transitless iff
for all i, j, 1 ≤ i, j ≤ n,
transit(LSi, LSj) = Ф

3. GS is strongly consistent if it is consistent and


transitless.
Chandy-Lamport’s Algorithm
Uses special marker messages.

One process acts as initiator, starts the state


collection by following the marker sending rule below.

Marker sending rule for process P:


 P records its state;
 then for each outgoing channel C from P on which
a marker has not been sent already,
 P sends a marker along C before any further
message is sent on C
When Q receives a marker along a channel C:

 If Q has not recorded its state then Q records the


state of C as empty; Q then follows the marker
sending rule

 If Q has already recorded its state, it records the


state of C as the sequence of messages received
along C after Q’s state was recorded and before Q
received the marker along C
Points to Note:

Markers sent on a channel distinguish messages sent


on the channel before the sender recorded its states
and the messages sent after the sender recorded its
state
The state collected may not be any state that actually
happened in reality, rather a state that “could have”
happened
Requires FIFO channels
Network should be strongly connected (works
obviously for connected, undirected also)
Message complexity O(|E|), where E = no. of links
Lai and Young’s Algorithm

Similar to Chandy-Lamport’s, but does not require


FIFO
Boolean value X at each node, False indicates state
is not recorded yet, True indicates recorded
Value of X piggybacked with every application
message
Value of X distinguishes pre-snapshot and post-
snapshot messages, similar to the Marker
Ordering of Events

Lamport's “Happened Before” Relationship:
For two events a and b, a -> b if
 a and b are events and a occured before b
 a is a send event of a message m and b is the corresponding recieve
event at the destination process
 a->c and c->b for some event c then
 a->b implies a is a potential cause of b


Causal Ordering: potential dependencies happened before
relationship casually orders events.
 If a->b then a casually effects b
 If a->b and b->a then a and b are concurrent (a | | b)
Logical Clock

A mechanism for capturing chronological and causal relationships in a
distributed system.
– Distributed systems may have no physically synchronous global
clock, so a logical clock allows global order


In logical clock systems each process has two data structures: logical
local time and logical global time.
– Logical local time is used by the process to mark its own events,
and logical global time is the local information about global time.
– A special protocol is used to update logical local time after each
local event, and logical global time when processes exchange
data


Logical clocks are useful in 1) computation analysis, 2) distributed
algorithm design, 3) individual event tracking, and 4) exploring
computational progress.
Lamport's clock

Each process Pi keep a clock Ci

Each event a in Pi is timestamped C(a), the value of C i is
when a occured.

Ci is incremented by 1 for each event in Pi.

if a is a send event of message m from process Pi to Pj ,
then on recieve of m,
Cj=max (Cj, C(a)+1)

Points to note:

If a->b , then C(a) < C(b)

-> is irreflexive partial order

Total ordering possible by arbitarily ordering concurrent
events by process numbers
Limitation:
a-> b implies C(a) < C(b)
BUT
C(a) < C(b) doesn't imply a->b !!
So not a true clock!
Solution: Vector Clocks
An algorithm for generating a partial ordering of events in a distributed
system and detecting causality violations.

A system of N processes is a vector of N logical clocks, one clock per


process; a local "smallest possible values" copy of the global clock-
array is kept in each process, with the following rules for clock
updates:
1. Initially all clocks are zero.
2. Each time a process experiences an internal event, it increments its own
logical clock in the vector by one.

3. Each time a process prepares to send a message, it sends its entire vector
along with the message being sent.
 Each time a process receives a message,
 It increments its own logical clock in the vector by one and
 updates each element in its vector by taking the maximum of the value in
its own vector clock and
 the value in the vector in the received message (for every element).
Ci is a vector of size n, where n is no. of processes
C(a) is similarly a vector of size n

Update rules:

Ci[i]= Ci[i] + 1 for every event at process P i


if a is send event of message m from Pi to Pj with vector
timestamp tm, then on receive of m:
Cj[k] = max(Cj[k], tm[k]) for all k
Partial Order between Timestamps
For events a and b with vector timestamps ta and tb,
and tb,
Equal: ta = tb iff ∀i, ta [i] = tb[i]
Not Equal: ta ≠ tb iff ∃i, ta [i] ≠ tb[i]

Less or equal: ta ≤ tb iff ∀i, ta [i] ≤ tb[i] ta

Not less or equal: ta tb iff ∃i, ta [i] > t b[i]

Less than: ta < tb iff (t a ≤ tb and ta ≠ tb)

Not less than: ta tb iff ¬ (ta ≤ tb and ta ≠ tb)

Concurrent: ta || tb iff (ta < tb and tb < ta)


Properties
– a -> b iff ta < tb

– Events a and b are causally related


iff ta < tb or
tb < ta, else
they are concurrent
– Still not a total order ie. partial ordering

– Antisymmetry: ¬ tb < ta iff ta < tb


– Transitivity: if ta < tb and tb < tc
then ta < tc
» Or if a ->b and b->c
then a->c
Causal ordering of messages:
Application of vector clocks

If send(m1)-> send(m2), then every recipient of both


message m1 and m2 must “deliver” m1 before m2.

“deliver” – when the message is actually given to the


application for processing
Birman-Schiper-Stephenson Protocol
Assumes broadcast communication channels that do not loose or
no corrupt messages. ( i.e. everyone talks to everyone ).

Use vector clocks to "count" number of messages ( i.e. set d = 1 ).


n processes.

1. To broadcast m from process Pi, increment Ci(i), and timestamp m


with tm = Ci
2. When Pj (j≠i) upon receiving m with timestamp tm, Pj delays
delivery of m until both
 Cj[i] = tm[i] –1 and
has received all messages that Pi had received before sending m.
- Cj[k] ≥ tm[k], k= 1, 2, 3,….n and k ≠ i

 Delayed messaged are queued in Pj sorted by vector time.


Concurrent messages are sorted by receive time.

3. When m is delivered at Pj, Cj is updated according to vector clock


rule.
Schiper-Eggli-Sandoz Protocol

 The goal of this protocol is to ensure that messages are given to the
receiving processes in order of sending.
 Unlike the Birman-Schiper-Stephenson protocol, it does not require
using broadcast messages.
 Each message has an associated vector that contains information for
the recipient to determine if another message preceded it.
 Clocks are updated only when messages are sent.
Schiper-Eggli-Sandoz Protocol...
Sending a message:
 All messages are timestamped and sent out with a list of all the timestamps of messages
sent to other processes.

 Locally store the timestamp that the message was sent with.

Receiving a message:
• A message cannot be delivered if there is a message mentioned in the list of timestamps
that predates this one.

• Otherwise, a message can be delivered, performing the following steps:

1. Merge in the list of timestamps from the message:


 Add knowledge of messages destined for other processes to our list of processes if
we didn't know about any other messages destined for one already.

 If the new list has a timestamp greater than one we already had stored, update our
timestamp to match.

2. Update the local logical clock.

3. Check all the local buffered messages to see if they can be delivered.
Distributed Computing
Introduction
What is Distributed
Computing/ System?

 Distributed computing
 A field of computing science that studies distributed system.
 The use of distributed systems to solve computational problems.
 Distributed system
 Wikipedia
 There are several autonomous computational entities, each of which has its own local
memory.
 The entities communicate with each other by message passing.
 The components interact with each other in order to achieve a common goal.
 Operating System Concept
 The processors communicate with one another through various communication lines, such
as high-speed buses or telephone lines.
 Each processor has its own local memory.
What is a distributed system?
Distributed program
 A computing program that runs in a distributed system
Distributed programming
 The process of writing distributed program

Autonomous: able to act independently

Communication: shared memory or message passing


“Concurrent system”: a better term probably

A very broad definition:


A set of autonomous processes communicating among themselves to perform
a task.
What is Distributed Computing/System?
Common properties

Fault tolerance
When one or some nodes fails, the whole system can still work fine except
performance.
Need to check the status of each node
Each node play partial role
Each computer has only a limited, incomplete view of the system.
Each computer may know only one part of the input.
Resource sharing
Each user can share the computing power and storage resource in the
system with other users
Load Sharing
Dispatching several tasks to each nodes can help share loading to the
whole system.
Easy to expand
We expect to use few time when adding nodes. Hope to spend no time if
possible.
What is a distributed system? Cont..

A more restricted definition:


A network of autonomous computers that communicate by
message passing to perform some task.

A practical “distributed system” will probably have both



Computers that communicate by messages

Processes/threads on a computer that communicate by messages
or shared memory
Why Distributed Computing?
The nature of application
Performance
 Computing intensive
The task could consume a lot of time on computing.
 Data intensive
The task that deals with a lot mount or large size of
files.
Robustness
 No SPOF (Single Point Of Failure)
 Other nodes can execute the same task
executed on failed node.
Advantages & Issues
Advantages
 Resource Sharing
 Higher Performance
 Fault Tolerance
 Scalability

Why is it hard to design them?



Un-reliability of communication and Unpredictable communication delays

Lack of global knowledge/clock

Lack of synchronization and No globally shared memory

Concurrency control

Failure and recovery
Common Architectures
Communicate and coordinate works among
concurrent processes
– Processes communicate by sending/receiving
messages
Common Architectures
– Synchronous/Asynchronous

1. In a synchronous system, operations (instructions,


calculations, logic, etc.) are coordinated by one, or more,
centralized clock signals.

2. An asynchronous system, in contrast, has no global


clock. Asynchronous systems do not depend on strict arrival
times of signals or messages for reliable operation.
Common Architectures
Master/Slave
architecture
 Master/slave is a model of
communication where one
device or process has
unidirectional control over
one or more other devices
Database replication
 Source database can be
treated as a master and the
destination database can
treated as a slave.
Client-server
 web browsers and web
servers
Common Architectures
Data-centric architecture
 Using a standard, general-purpose relational database
management system  customized in-memory or file-based
data structures and access method
 Using dynamic, table-driven logic in  logic embodied in
previously compiled programs
 Stored procedures  logic running in middle-tier application
servers
 Shared databases as the basis for communicating between
parallel processes  direct inter-process communication via
message passing function
Best Practice
Data Intensive or Computing Intensive
 Data size and the amount of data
The attribute of data you consume
Computing intensive
 We can move data to the nodes where we can execute
jobs
Data Intensive
 We can separate/replicate data to difference nodes, then
we can execute our tasks on these nodes
 Reduce data replication when executing tasks
Master nodes need to know data location
No data loss when incidents happen
 SAN (Storage Area Network)
 Data replication on different nodes
Best Practice
Synchronization
 When splitting tasks to different nodes, how can we
make sure these tasks are synchronized?
Robustness
 Still safe when one or partial nodes fail
 Need to recover when failed nodes are online.
No further or few action is needed.
 Failure detection
When any nodes fails, master nodes can detect this
situation.
 App/Users don’t need to know if any partial
failure happens.
Restart tasks on other nodes for users
Best Practice
Network issue
 Bandwidth
Need to think of bandwidth when copying files from
one node to other nodes if we would like to execute
the task on the nodes if no data in these nodes.
Scalability
 Easy to expand
Optimization
 What can we do if the performance of some
nodes is not good?
Monitoring the performance of each node
Resume the same task on another nodes
Best Practice
App/User
 shouldn’t know how to communicate between
nodes
 User mobility – user can access the system
from some point or anywhere
Models for Distributed Algorithms
 Topology: Completely connected, Ring, Tree etc.

 Communication: Shared memory / Message passing


(reliable? Delay? FIFO/Causal? Broadcast/multicast?)
 Synchronous/asynchronous

 Failure models: Fail stop, Crash, Omission, Byzantine…

 An algorithm needs to specify the model on which it is supposed to


work
Complexity Measures

 Message complexity: no. of messages

 Communication complexity / Bit Complexity: no. of bits

 Time complexity:
– For synchronous systems, no. of rounds
– For asynchronous systems, different definitions are there.
Some Fundamental Problems

Ordering events in the absence of a global clock

Capturing the global state

Mutual exclusion

Leader election

Clock synchronization

Termination detection

Constructing spanning trees

Agreement protocols

You might also like