DC Midsems

Distributed Deadlock
Definitions
• Resources:
– There are two types of resources in computer system
• Reusable resources
– They are fixed in number , neither can be created nor can be
destroyed
– To use the resource, the process must request for it, hold it
during the usage (allocation) and release it on completion
– The released resources may be re‐allocated to other process
– Example: Memory, CPU, Printer, Disk blocks etc
• Consumable resources
– These resources will vanish once they are consumed
– Producer process can produce any number of consumable
resources if it is not blocked
– Example: Messages, Interrupt signals, V operation in semaphore
etc
Type of Resource Accesses
• Shared
– In this mode, the resource can be accessed by any number
of processes simultaneously
– Example: Read lock on data item
• Exclusive
– In this mode, the resource can be accessed by only one
process at any point of time
– Example: Write lock on data item
– In theory of deadlocks, mostly exclusive locks are
considered
– Reusable resources can be accessed in exclusive or shared
mode at a time
– Consumable resources always accessed on exclusive mode
Resource Request Model
• Single unit resource request model
– In this model, a process is allowed to request only one unit
of the resource at a time. The process is blocked till that
resource is allocated.
– Example: A transaction (process) request for the write lock
on data item [ write_lock(X)]
• AND request model
– In this model, a process is allowed to request multiple
resources simultaneously. It is blocked till all the resources
are available.
– Example: Consider the data item X is replicated at N sites.
A transaction request for write lock on X need to request
for locks at all N where X is located and is blocked till all
the write request is granted
Resource Request Model Contd.
• OR request model
– In this model, a process is allowed to request the multiple resources
simultaneously. However, it is blocked till at‐least one resource is
allocated.
– Example: Consider the data item X is replicated at N sites. A
transaction request for read lock on X need to request for locks at all N
where X is located. However, the transaction is blocked till at least one
of the read lock request is granted.
• AND‐OR request model
– Here the request of the process is specified in the form of a predicate
where its atoms / variables are the resources.
– Example: R1 AND (R2 OR R3)
• P out of Q request model
– Here, a process can simultaneously requests Q resources and will be
blocked till any P out of Q resources are available.
– Note that if P = 1, the model is OR request model; if P = Q, the model
is AND request model.
Deadlock ‐ General
• A set of processes is said to be in deadlock state, if each of them is
waiting for the resources to be released by the another process in
the set.
• Necessary condition for the deadlock:
– Mutual exclusion: ‐ Non sharable characteristic of the resources. Ex:
Memory location
– No pre‐emption:‐ The allocated resources cant be pre‐empted from
the process before its release by the process
– Hold and wait: The process holding some resources and waiting for
other resources
– Circular wait:‐ The processes are waiting for one another for resources
in a circular fashion
• Sufficient condition for the deadlock:
– Note that the above mentioned conditions are not sufficient to say
that the set of processes in deadlock. However, once the set of
processes are in deadlock then we can observe all of those conditions.
Hence they are necessary conditions.
Deadlock Handling Strategies
• Deadlock Prevention
– Idea: Resources are granted to the requesting processes in such a way
that there is no chance of deadlock ( Vaccination). For example,
allocating the resources requested if all available. Else wait for all. [ So
no hold and wait condition holds in that case.]
• Deadlock Avoidance
– Idea: Resources are granted as and when requested by the processes
provided the resulting system is safe. The system state is said to be
safe, if there exist at least one sequence of execution for all the
process such that all of them can run to completion without getting
into deadlock situation.
• Deadlock Detection and Recovery
– Idea: In this strategy, the resources are allocated to the processes as
and when requested. However, the deadlock is detected by deadlock
detection algorithm. If deadlock is detected, the system recovers from
it by aborting one or more deadlocked processes.
Distributed Deadlock Algorithms
Distributed Deadlock Prevention
• Basic Idea:
1. Each process is assigned a globally unique
timestamp using Lamports logical clock, process
number and site number [i.e., <logical clock value,
process id, site id>].
2. Every request from the process for the resource
should accompany the process timestamp.
3. The timestamp of requesting process (for the
resource) is compared with the one who is holding
the resource and suitable decision is made to
prevent the occurrence of deadlock.
• Algorithm for distributed deadlock prevention
– Suppose a resource R is held by P1 at some site,
and the process P2 requests R. Let TS(P1) and
TS(P2) are timestamps of P1 and P2 respectively.
– Wait‐die method:
• If TS(P2) < TS(P1) then P2 waits /* P2 is older */
• Else P2 is killed /* P2 is younger */
R P1 P2 Pn
TS(P1) > TS(P2) > > TS(Pn)
POSSIBLE WAITING SEQUENCE FOR RESOURCES (Assumed that Pi waiting for
some resource hold by Pi‐1)
– Note on Wait‐die method:
1. P2 waits if resource holder (i.e. P1) is younger process
2. P2 is killed if the resource holder is older process
3. Killed process will be restarted with SAME timestamp,
will be older after some time and will not be killed
4. No circular wait condition will be hold in this method
� The waiting sequence, TS(P1) > TS(P2) > > TS(Pn) leads to a
circular wait provided P1 waits for some resources hold by
Pn, i.e., P1 � Pn. This is possible only if TS(P1) < TS(Pn). This
contradict the waiting sequence, i.e., TS(P1)> TS(Pn)
5. No preemption of process (resource holder) in this
method. Here, the requester (P2) will either waits or
die.
– Wound‐wait method:
• If TS(P2) < TS(P1) then P1 is killed /* P2 is older */
• Else P2 is waits /* P2 is younger */
P1 P2 Pn
TS(P1) < TS(P2) < < TS(Pn)
POSSIBLE WAITING SEQUENCE FOR RESOURCES (Assumed that Pi
waiting for some resource hold by Pi‐1)
– Note on Wound‐wait method:
1. The older process never waits for the younger
resource holder
2. P1 is killed if the resource requester is older process
3. Killed process will be restarted with SAME timestamp,
will be older after some time and will not be killed
4. No circular wait condition will be hold in this method
� The waiting sequence, TS(P1) < TS(P2) < < TS(Pn) leads to a
circular wait provided P1 waits for some resources hold by
Pn, i.e., P1 � Pn. This is possible only if TS(P1) > TS(Pn). This
contradict the waiting sequence, i.e., TS(P1) < TS(Pn)
5. There is preemption of process (resource holder) in
this method. Here, the requester (P2) will wait or
resource holder (P1) will be wounded.
• Method to handle more than one process waiting
on same resource (R):
– Method 1:
• At most only one process is allowed to wait for the resources
and all other processes are killed. If another process P3
requesting the same resource R, then Wound‐wait is
applied between P2 and P3 to select oldest. Then, either
Wound‐wait or Wait‐die method can be used between
oldest waiting process and P1, the resource holder.
– Method 2:
• The waiting processes are ordered in the increasing order of
their timestamps. A new process requesting for the resource
is made to wait if it is not the older than resource holder. If
so, then Wound‐wait method is applied between the new
process and resource holder
Distributed Deadlock Detection and
Recovery
• Two components in this strategy:
– Distributed Deadlock detection
– Distributed Deadlock recovery
• Distributed Deadlock detection
– Using Wait for graph (WFG)
• WFG is a directed graph (V,E), where the vertices are the
processes and the directed edge eij indicates that the
process Pi is waiting for the resource hold by the process Pj.
• The process Pi belong to any node in DCS.
• All the resources are assumed to be single unit based
Distributed Deadlock Detection
• In Single Unit Resource Request Model:
– A deadlock in this model is detected by existence
of cycle in WFG
– Note that a process can involve in only one cycle
P2 Cycle, hence P1 , P2 and P3 are

in deadlock state
P1 P3
P4
• In AND Request Model:
– A deadlock in this model is detected by existence
of cycle in WFG
– Note that a process can involve in more than one
cycle P3 in both cycles. Note that P3
P2 is requested for the resources
hold by P1 and P5 . P3 is
holding the resources
requested by P2 and P6
P1 P3
P5
P4 P6
• In OR Request Model:
– A cycle in the WFG is not sufficient condition for
the existence of the deadlock
cycle
P2
P6
P1 P3
P5
P4
• In OR Request Model:
– A cycle in the WFG is not sufficient condition for
the existence of the deadlock
cycle Cycles in WFG does not implies the
P2 deadlock situation. This is because,
P1 is requested for the resources
hold by P2 ,P4 and P5. Once it get
P6
the resource hold by P4, the
P1 P3
request edges from P1 to P4 , P1 to
P5 and P1 to P2 will be removed and
hence there will not be any cycle
P5 and no deadlock.
P4
• OR Request Model
– The necessary and sufficient condition for detecting
the deadlock is the presence of knot.
– Knot: A set of processes S is said to be a knot, if
• ∀Pi ∈ S,
– Dependency Set(Pi) ⊆ S and
– Dependecy Set(Pi) ≠ Φ
– Dependency Set of a process Pi (DS(Pi)) : Set of all
processes from which Pi is expecting the unit of
resources to be released.
– Knot implies deadlock in any resource request model
• OR Request Model: An Illustrative example
P2
P6
P1 P3
P5
P4
Here, DS(P1) = {P2, P4, P5}, DS(P2) = {P3} , DS(P3)={P1} , DS(P4)={}, DS(P5)= {P6}, DS(P6)={P1}
Note that S= {P1, P2, P3} is not a knot, because DS(P1) is not in S. If you include P5 and P4
then, S = {P1, P2, P3, P4, P5} is again not a knot because DS(P4) is a null set. And so on.
Similar argument will follow for S= {P1, P5, P6} to show S is not a knot.
• In AND‐OR Request Model:
– Presence of KNOT in WFG implies that the system
is in deadlock
• In P out of Q Request Model:
– Presence of KNOT in WFG implies that the system
is in deadlock
Requirements of Distributed Deadlock
Detection Algorithm
• If there is a deadlock, then the algorithm
should detect all such deadlocks. (i.e.,
algorithm should detect all deadlocks)
• If the algorithm says that there is a deadlock,
then definitely there should one. (i.e., no false
deadlock detection by the algorithm)
Pseudo Deadlock in Distributed
Environment
• Let P1, P2,Pn be the sequence of processes such that
Pi is waiting for the release of resources hold by Pi+1
(where 1 ≤ i≤ n‐1).
• Let Pn releases the resources first and then request for
the resource hold by P1. For this, Pn will send the
message (M1) to the resource controller to release the
allocated resource for which Pn‐1 is waiting and then
send the message (M2) to the resource controller to
allocate the resource hold by P1.
• If M2 reaches first than M2 at deadlock detection
algorithm, then there is a false / pseudo deadlock!
P1 P2 P3 . . . Pn‐1 Pn Release (M1)
Request (M2)
Algorithms
• Centralized Approach
• Distributed Approach: Chandy ‐ Misra ‐Haas

Algorithm
Centralized Approach
• The single unit request resource model is assumed in this
approach. So, a cycle in the WFG implies the deadlock.
• The local wait for graph (LWFG) is constructed at each site.
• The global wait for graph (GWFG) is constructed at
coordinator site based on following criteria:
– Whenever an edge is added or deleted at LWFG, the local site
will inform this to coordinator. Or
– Periodically the local site will send its LWFG.
• The coordinator will construct the GWFG which is the
concatenation of LWFGs.
• If there is a cycle, then the coordinator will detect that the
system is in deadlock.
• But the problem is there is a possibilities of false deadlock
as demonstrated in next slide.
T1 R1
T1 R1 R1
R2 T2
R2
T2
T3 R3
T3 R3
Site S1 Site S2 Coordinator Site
Local Wait For Graphs Global Wait For Graph
At S1:‐ M1 : T3 releases R2

At S2:‐ M2 : T3 requests R3 } M1 occurs before M2
M2 reaches coordinator before M1
T1 R1
T1 R1 R1
R2 T2
R2
T2
M2
T3 R3
T3 R3

False
Deadlock
T1 R1
T1 R1 R1
R2 T2
R2
T2
M2
T3 R3
T3 R3

False
Deadlock
T1 R1
T1 R1 R1
R2 T2
R2
T2
M2
T3 R3
T3 R3

False
Deadlock
Solution: Timestamp based messages; ordering at Coordinator
Handling of Pseudo Deadlock
• Pseudo deadlocks can be handled using
timestamps based on Lamports clock value.
• Every messages pertaining to LWFG from the
local site to the coordinator carries the
timestamp.
• If the coordinator observe the cycle due to the
message (M) from the local site, then the
coordinator will broadcast that is there any
message having the timestamp lesser than M?
• The decision about the cycle will be taken after
the receipt of all the acknowledgements from the
local site.
Handling of Pseudo Deadlock
• In the above example, since T3 released R2 and
then requested R3, M1 timestamp should be
smaller than M2.
• When the coordinator receives M2, it suspect the
deadlock and it sends the message asking that if
anyone has the message with timestamp lesser
than M2. For this, site S1will send the positive
acknowledgement regarding M1. The coordinator
now reforms the GWFG with M1 first and then
M2, hence no deadlock.
Chandy Misra Haas Algorithm
• Here, processes are allowed to request multiple
resources at a time, so the process may wait for two or
more resources.
• The process either wait for the resources hold by their
co‐processes (i.e., process in the same system) or by
the processes exist in other machine.
• The algorithm is invoked when the process has to wait
for the resource.
• For this, the process will generates probe message and
send it to the process from whom it is waiting for the
resource
• The probe message consists of three components:
– probe originator‐id, sender‐id, receiver‐id
• When the process receives the probe message, it
checks whether it is waiting for any process(s), If so, it
update the probe message by updating its 2nd and 3rd
fields and forward it to the process(es) for whom it is
waiting.
• If the probe message goes all the way round and comes
back to the originator (i.e., probe originator‐id ==
receiver id), then the set of processes along the path of
probe message are in deadlock.
• The probe initiator may identify itself as the victim and
will commit suicide to break the deadlock.
The probe message initiated by process 0 is shown below: The arrow mark from
process i to j indicates that process i is waiting for the resource hold by j
(0,8,0)
(0,4,6) 6 8
4
(0,2,3)
0 1 2 3
5 7
(0,5,7)
Site 0 Site 1 Site 2
Probe message from process 0 is forwarded
by process 2
Probe message (0,0,1) initiated by process 0, since it is waiting for the resource hold by process
1
Once process 0 receives the probe (0,8,0), then it realize that it is in the set of processes under
deadlock. So, it will identify itself to commit suicide to break the deadlock.
• The problem:
– In the above example, it is possible that the processes, 0,1,2,3,4,6,8
may initiate the probe messages and identify themselves as the victim
to commit suicide.
– This leads to unnecessarily termination of many processes in the same
deadlock path.
• Solution:
– The process ids along the way is attached to the probe message in the
form of a queue.
– When the probe message comes back to the originator, it sees the
process with highest / lowest process id in the queue and selects it as
victim and sends the message to the victim to commit suicide by itself.
– Even though many processes may initiate the probe messages, the
same set of processes are identified in a cycle. Further, there is a
single highest / lowest process id in that cycle. Hence, only one victim
is selected.
Summary
• Generic resource request models are
discussed
• Distributed deadlock prevention algorithms
and distributed deadlock detection and
recovery algorithms are outlined.
References
• Advanced Operating Systems
– M Singhal and N G Shivarathri, McGraw Hill,
International
• Distributed Operating Systems
– A S Tanenbaum, Prentice Hall
• Distributed Algorithms
– Nancy A Lynch, Morgan Kaufman
Load Balancing in Distributed
System
CPU Scheduling ‐ Conventional
• Issue: In multiprogramming environment (with
single CPU), which job is scheduled to the
processor NEXT?
• Need: To allocate the job for execution
J DIFFERENT SCHEDULING TECHNIQUES:
1. FIRST COME FIRST SERVE
O 2.
3.
4.
PRIORITY BASED
ROUND ROBIN BASED
MULTI LEVEL FEEDBACK QUEUES
.
.
B 5. ETC.
S
Load (Job) Scheduling
• Issue: In distributed environment, which job is
scheduled to which distributed processor?
• Need: To allocate the job for execution
JOB
JOB
.
.
.
?
JOB
Load Balancing
• Issue: Redistribution of processes/task in the DCS.
• Redistribution: Movement of processes from the heavily
loaded system to the lightly loaded system
• Need: To improve the Distributed Systems’ throughput
? ?
Process
Process
Job Scheduling
Local queue(s)
. . . CPU SITE 1
Job scheduler Local queue(s)
. . . CPU SITE 2
Stream of jobs
Local queue(s)
. . . CPU SITE N
Can be considered as a QUEUEING MODEL ‐ MULTI JOB MULTI QUEUES SYSTEM
Job Scheduling Policies
• Random:
– Simple and static policy
– The job scheduler will randomly allocate the job to the site i
with some probability pi, where Σpi = 1
– No site state information is used
• Cycle:
– The job scheduler will allocate the job to the site i, if the
previous job allocation was to the site i‐1 using the function (i‐
1)+1 mode N
– It is semi static policy, where in the job scheduler remembers
the previously allocated site.
• Join the Shortest Queue (JSQ):
– The job scheduler remembers the size of local queue in each
site.
– The job will be allocated to the queue which is shortest at the
point of arrival of that job.
Job Scheduling Policies – Parameters
of Interest
• Mean response time of jobs
• Mean delay of the jobs
• Throughput of the system
• Obviously the JSQ is having better edge over
other two in terms of these parameters.
Load Balancing
• Basically two types:
– Sender Initiated Load Balancing
– Receiver Initiated Load Balancing
Sender Initiated Load Balancing
Components of Sender Initiated Load
Balancing
• Idea: Node with the higher load (sender) initiate
the load balancing process.
• Transfer Policy
– Policy about whether to keep the task/process at that
site or transfer to some other site (or node)
• Location Policy
– If decided to transfer, policy about where to transfer?
• Note that any load balancing algorithm should
have these to components
Transfer Policy
• At each node, there is a queue
• If queue length of a node < τ (threshold)
– Originating task is processed in that node only
• Else
– Transfer to some other node
Local queue
. . . CPU
τ
• In this policy each node uses only local state
information
Location Policy
• Random Policy
• Threshold Location Policy
• Shortest Location Policy
Random Policy
• Node status information are not used
• Destination node is selected at random and task is
transferred to that node
• On receipt of the task, the destination node will do the
following:
– If the threshold of the node is < τ, then accept the task
– Else transfer it to some other random node
• If number of transfers reach some limit, Llimit then, the
last recipient of the task has to execute that task
irrespective of its load. This is to avoid unnecessary
thrashing of jobs.
Threshold location policy
• Uses node status information for some extent about
the destination nodes.
• Selects the node at random. Then, probe that node to
determine whether the transferring task to that node
would place its load above the threshold.
– If not, the task is transferred and the destination node has
to process that task regardless of its state when the task
actually arrives.
– If so, select the another node at random and probe in the
same manner as above.
• The algorithm continues either suitable destination is
found or number of probes reaches some limit, Tlimit. If
the number of probes > Tlimit then, the originating node
should process the task
Shortest Location Policy
• Uses additional information about the status of other
node to make “best” choice of the destination node.
• In this policy, Lp number of nodes are chosen at
random and each is polled to determine their queue
length.
• The task is transferred to the node having shortest
queue length among the one with threshold < τ
• If none exist with the threshold < τ, then the next Lp
number of nodes are polled and above step is repeated
• Once group of node selection reaches some limit, Ls,
then the originator should handle the task
Receiver Initiated Load Balancing
Components of Receiver Initiated Load
Balancing
• Idea: Under loaded node (receiver) initiate the
load balancing process. Receiver tries to get the
task from overloaded node, sender.
• Transfer Policy (threshold policy)
– Where the decision is based on CPU queue length. If it
falls below certain threshold, τ, the node is identified
as receiver for obtaining task from the sender
• Location Policy
– If decided to receive, policy about from where to
receive?
Location Policy
• Threshold Location Policy
– A random node is probed to see whether it can
become a potential sender. If so, the task is
transferred from the polled node. Else, the process is
repeated until a potential sender is found or number
of tries reaches a PollLimit.
• Longest Location Policy
– A pool of nodes are selected and probed to find the
potential sender with longest queue length (greater
than τ). If found, then the task is received from that
sender. Else the above process is repeated with new
pool of nodes.
Drawback of Receiver Initiated
Algorithm
• Most of the tasks selected for transfers from
the senders are all preemptive one.
– The reason is: The job scheduler always gives
higher priority for allocating the fresh job to the
processor compared to the existing processes at
different stages of execution. So, by the time the
receiver decide to pick the task, it underwent
some execution.
Symmetrically Initiated Algorithm
• These are algorithms having both sender
initiated and receiver initiated components.
• Idea is that at low system loads the sender
initiated component is more successful in
finding the under loaded nodes and at high
system loads the receiver initiated component
is more successful in finding the overloaded
nodes.
Symmetrically Initiated Algorithm
• Above Average Algorithm
• Adaptive Algorithms
– Stable Symmetrically Initiated Algorithm
– Stable Sender Initiated Algorithm
Above Average Algorithm
• Idea: There is ‘acceptable range (AR)’ in terms of load is
maintained.
– The node is treaded as sender if its load > AR
– The node is treated as receiver if its load < AR
– Else it is a balanced node.
• Transfer Policy: AR is obtained from two adaptive
thresholds and they are equidistance from estimated
average load of the system
– For example, if the estimated average load of the system =
2, then lower threshold (LT) = 1 and upper threshold (UT) =
3
– So, if the load of the node is <=LT then it is a receiver node.
If the load of the node is >= UT, then it is a sender node.
Balanced node other wise.
Above Average Algorithm – Contd.
• Location Policy: Consists of two components
– Sender Initiated Component:
1. The node with load > AR is called sender. The sender broadcasts
TOOHIGH message, sets TOOHIGH timeout alarm and listen for
ACCEPT message
2. On receipt of TOOHIGH message, the receiver (whose load < AR)
• cancels its TOOLOW timeout alarm
• sends ACCEPT message to the node which has sent TOOHIGH message
• increments its load value
• set AWAITINGTASK timeout alarm
• if no task transfer within AWAITINGTASK period, then its load value is
decremented
3. On receipt of ACCEPT message, sender sends the task to the
receiver. [Note that the broadcasted TOOHIGH message will be
received by many receiver and for the first ACCEPT message, the
sender will transfer the task].
4. On expiry of TOOHIGH timeout period, if no ACCEPT message is
received by the sender, then sender infers that its estimated
average system load is too low. To correct the problem, it
broadcasts CHANGEAVERAGE message to increase the average
estimated load at all other sites.
Above Average Algorithm – Contd.
– Receiver Initiated Component:
1. Receiver broadcasts TOOLOW message, sets TOOLOW
timeout alarm and wait for TOOHIGH message.
2. On receipt of TOOHIGH message, perform the
activities as in step 2 of Sender Initiated Component
3. If TOOLOW timeout period expires, then it infers that
its estimated average system load is too high and
broadcasts CHANGE AVERAGE message to decrease
the estimated average load at all sites.
Stable Symmetrically Initiated
Algorithm
• Idea: In this algorithm, the information gathered during
polling is used to classify the node as SENDER,
RECEIVER or BALANCED.
• Each node maintains a list for each of the class.
• Since this algorithm updates its lists based on what it
learns from (or by) probing, the probability of
selecting the right candidate for load balancing is high.
• Unlike Above average algorithm, no broadcasting,
hence the number of messages exchanges are less.
• Initially each node assumes that every other node is
RECEIVER except itself. So, the SENDER and BALANCED
lists are empty to start with.
Algorithm – Contd.
• Transfer Policy:
– This policy is triggered when the task originates or
departs.
– This policy uses two thresholds: UT (upper
threshold) and LT (lower threshold).
– The node is sender, if its queue length > UT, a
receiver, if its queue length < LT and balanced if
LT ≤ queue length ≤ UT
• Location Policy: Has two components
– Sender Initiated Component:
1. When the node becomes sender, it polls the node at the head of its
RECEIVER list. The polled node removes the sender under
consideration from its RECEIVER list and put it into the head of its
SENDER list (i.e., it learns!). It also informs to the sender that
whether it is a sender or a receiver or a balanced node.
2. On receipt of the reply the sender do the following:
• If the polled node is receiver, sender transfers the task to it and updates its
list (putting the polled node at the head of RECEIVER or BALANCED list)
• Otherwise, updates the list (putting the polled node at the head of SENDER
or BALANCED list) and again start polling the next node in its RECEIVER list
3. The polling process stops if
• The receiver is found or
• RECEIVER list is empty or
• Number of polls reaches POLL‐LIMIT
4. If polling fails, the arrived task has to be processed locally. However,
there is a chance of migration under preemptive category.
– Receiver Initiated Component:
1. When the node becomes receiver, it polls the node at its head of
SENDER list. The polled node updates its list (i.e., places this
node at the head of RECEIVER list). It also informs the receiver
that whether it is a sender or a receiver or a balanced node.
2. On the receipt of reply the receiver do the following:
• If the responded node is a receiver or a balanced node, then its list is
updated accordingly.
• Otherwise (i.e., responded node is a sender), the task sent by it is
received and the list is updated accordingly
3. The polling process stops if:
• The sender is found or
• No more entries in the SENDER list or
• The number of polls reaches POLL‐LIMIT
Note that at high load, receiver initiates the poll and at low load
sender initiates the poll. So this will improve the performance.
Stable Sender Initiated Algorithm
• In this algorithm, there is no transfer of tasks
when a node becomes receiver. Instead, its status
information is shared. Hence there is no
preemptive task transfers.
• The sender initiated component is same as that
of stable symmetrically initiated algorithm. (like
list generation, learning the status via polling
…etc)
• Stable sender initiated algorithm maintains an
array (at each node) called status vector of size =
number of nodes in DCS.
Status vector
1 2 j N
. . . . . . Node i
‐The entry j in the status vector of node i indicates the best guess (receiver /
sender /balanced node) about node i by the node j
‐SENDER INITIATED COMPONENT
‐ When the node become sender, it polls the node (say j) at the head of its
RECEIVER list
‐ The sender updates its jth entry of its status vector as sender.
‐ Likewise, the polled node (j) updates ith entry in its status vector based on
its reply it sent to the sender node.
‐ Note that above two aspects are additional information it learns along with
the other things as in sender component of stable symmetrically initiated
algorithm
‐RECEIVER INITIATED COMPONENT
‐ When the node becomes receiver, it checks its status vector and informs all
those nodes that are misinformed about its current state
‐ The status vector at the receiver side is then updated to reflect this changes
• Advantages:
– No broadcasting of messages by the receiver
about its status
– No preemptive transfer of jobs, since no task
transfers under receiver initiated component
– Additional learning using status vector reduces
unnecessary polling
Challenges Load Balancing Algorithm
• Scalability:
– Ability to make quicker decision about task transfers with lesser
efforts
• Location transparency:
– Transfer of tasks for balancing are invisible to the user.
• Determinism:
– Correctness in the result inspite of task transfers
• Preemption:
– Transfer of task to the node should not leads degraded
performance for the task generated at that node. So, there is a
need for preemption of task when the local task arrives at node
• Heterogeneity:
– Heterogeneity in terms of processors, operating systems,
architecture should not be hindrance for the task transfers.
Summary
• Differences between CPU scheduling, Job
scheduling and load balancing are discussed.
• Different load balancing algorithms are
discussed. They are categorized under sender
initiated, receiver initiated, symmetrically
initiated and variations of symmetrically
initiated algorithms are discussed.
References
• Advanced Operating Systems
– M Singhal and N G Shivarathri, McGraw Hill,
International
• Distributed Operating Systems
– A S Tanenbaum, Prentice Hall
• Distributed Systems Concepts and Design
– G Coulouris and J Dollimore, Addison Wesley
Leader Election
Leader Election
Leader election is the process of designating a single process as the
organizer of some task distributed among several computers (nodes).
Before the task is begun, all network nodes are either unaware which
node will serve as the "leader" (or coordinator) of the task, or unable to
communicate with the current coordinator.
After a leader election algorithm has been run, however, each node
throughout the network recognizes a particular, unique node as the task
leader.
The network nodes communicate among themselves in order to decide
which of them will get into the "leader" state.
For that, they need some method in order to break the symmetry among
them.
For example, if each node has unique and comparable identities, then
the nodes can compare their identities, and decide that the node with the
highest identity is the leader.
Leader Election
The problem of leader election is for each node eventually to
decide whether it is a leader or not, subject to the constraint that
exactly one node decides that it is the leader .
-- An algorithm solves the leader election problem if:

States of nodes are divided into elected and not-elected states.
Once elected, it remains as elected (similarly if not elected).
In every execution, exactly one node becomes elected and the rest
determine that they are not elected.
-- A valid leader election algorithm must meet the following conditions:
Termination: the algorithm should finish within a finite time once the
leader is selected.
Uniqueness: there is exactly one node that considers itself as
leader.
Agreement: all other nodes know who the leader is.
Leader Election
An algorithm for leader election may vary in following aspects:
Communication mechanism: the nodes are either synchronous
in which processes are synchronized by a clock signal or
asynchronous where processes run at arbitrary speeds.
Process names: whether processes have a unique identity or are
indistinguishable (anonymous).
Network topology: for instance, ring, acyclic graph or complete
graph.
Size of the network: the algorithm may or may not use knowledge
of the number of nodes in the system.
Ring Network
-- A ring network is a connected-graph topology in which each node is

exactly connected to two other nodes, i.e., for a graph with n nodes, there
are exactly n edges connecting the nodes.
-- A ring can be
unidirectional, means nodes only communicate in one direction (a
node could only send messages to the left or only send messages to
the right), or
bidirectional, meaning nodes may transmit and receive messages in
both directions (a node could send messages to the left and right).
Leader Election in Rings
Models
 Synchronous or Asynchronous
 Anonymous (no unique id) or Non-anonymous (unique ids)
• A ring is said to be anonymous if every nodes is identical.
• There is no deterministic algorithm to elect a leader in anonymous

rings, even when the size of the network is known to the processes.
• This is due to the fact that there is no possibility of breaking symmetry
in an anonymous ring if all processes run at the same speed and their
states remain identical at any instant.
 Uniform (no knowledge of ‘n’, the number of nodes in the ring
network) or non-uniform (knows ‘n’)
Leader Election in Rings
Known Impossibility Result:

 There is no Synchronous, non-uniform leader election
protocol if the processors are of anonymous rings model.
• Implies that there are no uniform algorithms as well
• Implies that there are no asynchronous algorithms as well
Election in Asynchronous Rings
Lelann-Chang-Robert’s Algorithm
– The algorithm assumes that each node has a Unique

Identification (UID) and that the nodes can arrange
themselves in a unidirectional ring with a communication
channel going from each process to the clockwise neighbour.
– The algorithm can be described as follows:
 Initially each node in the ring is marked as non-participant.
 A node that notices a lack of leader starts an election.
 It creates an election message containing its UID and then sends
this message clockwise to its neighbour.
 Every time a node sends or forwards an election message, the node
also marks itself as a participant.
– When a node receives an election message it compares the UID in

the message with its own UID.
If the UID in the election message is larger, the node unconditionally
forwards the election message in a clockwise direction.
If the UID in the election message is smaller, and the node is not yet
a participant, the node replaces the UID in the message with its own
UID, sends the updated election message in a clockwise direction.
If the UID in the election message is smaller, and the node is already
a participant (i.e., the node has already sent out an election message
with a UID at least as large as its own UID), the node discards the
election message.
If the UID in the incoming election message is the same as the UID
of the node, that node starts acting as the leader.
– When a node starts acting as the leader, it begins the second

stage of the algorithm.
The leader node marks itself as non-participant and sends an elected
message to its neighbour announcing its election and UID.
When a node receives an elected message, it marks itself as non-
participant, records the elected UID, and forwards the elected
message unchanged.
When the elected message reaches the newly elected leader, the
leader discards that message, and the election is over.
– Assuming there are no failures this algorithm will finish.

– Algorithm works for any number of processes N, and does not
require any node to know how many nodes are in the ring.
In Summary
 send own id to node on left
 if an id received from right, forward id to left node
only if received id greater than own id, else ignore
 if own id received, declares itself “leader”
works on unidirectional rings
message complexity = θ(n^2)
Hirschberg-Sinclair Algorithm
Algorithm
 Operates in multiple phases, requires bidirectional ring
 In kth phase, send own id to 2^k processes on both
sides of yourself (directly send only to next processes
with id and k in it)
 if id received, forward if received id greater than own id,
else ignore
 last process in the chain sends a reply to originator if its
id less than received id
 replies are always forwarded
 A process goes to (k+1)th phase only if it receives a
reply from both sides in kth phase
 process receiving its own id – declare itself “leader”
Note:
Message Complexity: O(nlgn) [check the
lower bound of Comparison based sorting]
Lots of other algorithms exist for rings
Lower Bound Result:
 Any comparison-based leader election algorithm in
a ring requires Ώ(nlgn) messages
Leader Election in Arbitrary Networks
FloodMax
Theorem-- FloodMax algorithm solves the leader-election problem in
a synchronous general network.
 synchronous, round-based
 at each round, each process sends the max. id seen so far (not
necessarily its own) to all its neighbors
 after diameter no. of rounds, if max. id seen = own id, declares itself
leader
 Time Complexity = diameter rounds
 Communication Complexity= O(dXm), where d is the diameter and
m = no. of edges
 does not extend to asynchronous model trivially
 Variations of building different types of spanning trees with no pre-
specified roots.
 Chosen root at the end is the leader (Ex., the DFS spanning tree)
Mutual Exclusion
Mutual Exclusion
Very well-understood in shared memory systems
Requirements:
 at most one process in critical section (safety)
 if more than one requesting process, someone
enters (liveness)
 a requesting process enters within a finite time (no
starvation)
 requests are granted in order (fairness)
Classification of Distributed Mutual Exclusion
(DME) Algorithms
Non-token based/Permission based

 Permission from all processes:
 e.g.
• Lamport, Ricart-Agarwala,
• Raicourol-Carvalho etc.
 Permission from a subset: e.g. Maekawa
Token based
 e.g. Suzuki-Kasami
Some Complexity Measures
No. of messages/critical section entry

Synchronization delay
Response time
Throughput
Lamport’s Algorithm
Every node i has a request queue qi, keeps requests
sorted by logical timestamps (total ordering enforced by
including process id in the timestamps)
To request critical section:

 send timestamped REQUEST (tsi, i) to all other
nodes
 put (tsi, i) in its own queue
On receiving a request (tsi, Pi):

 send timestamped REPLY to the requesting node P i
 put request (tsi, Pi) in the queue
To enter critical section:
– Pi enters critical section if (tsi, Pi) is at the top if its own
queue, and
– Pi has received a message (any message) with timestamp
larger than (tsi, Pi) from ALL other nodes.
To release critical section:

– Pi removes it request from its own queue and sends a
timestamped RELEASE message to all other nodes
– On receiving a RELEASE message from Pi, Pi’s request is
removed from the local request queue
Some points to note:
Purpose of REPLY messages from node P i to Pj is to

ensure that Pj knows of all requests of Pi prior to
sending the REPLY (and therefore, possibly any
request of Pi with timestamp lower than Pj’s request)
Requires FIFO channels.
3(n – 1 ) messages per critical section invocation
Synchronization delay = max. message
transmission time
requests are granted in order of increasing
timestamps
Ricart-Agarwala Algorithm
Improvement over Lamport’s
Main Idea:
 node Pj need not send a REPLY to node P i
 if Pj has a request with timestamp lower than the request of P i
 (since Pi cannot enter before Pj anyway in this case)
Does not require FIFO
2(n – 1) messages per critical section invocation
Synchronization delay = max. message transmission
time
requests granted in order of increasing timestamps
 send timestamped REQUEST message (tsi, Pi)
On receiving request (tsi, Pi) at Pj:

 send REPLY to Pi
• if Pj is neither requesting nor executing critical section or
• if Pj is requesting and Pi’s request timestamp is smaller
than Pj’s request timestamp.
 Otherwise, defer the request.

 Pi enters critical section on receiving REPLY from all nodes

 send REPLY to all deferred requests
Roucairol-Carvalho Algorithm
--- Improvement over Ricart-Agarwala
Main idea
 once Pi has received a REPLY from Pj,
 it does not need to send a REQUEST to P j again
 unless it sends a REPLY to Pj (in response to a
REQUEST from Pj)
 no. of messages required varies between 0 and
2(n – 1) depending on request pattern
 worst case message complexity still the same
Maekawa’s Algorithm
Permission obtained from only a subset of other

processes, called the Request Set (or Quorum)
Separate Request Set Ri for each process Pi
Requirements:
 for all i, j: Ri ∩ Rj ≠ Φ
 for all i: i Є Ri
 for all i: |Ri| = K, for some K
 any node i is contained in exactly D Request Sets,
for some D
A simple version

 Pi sends REQUEST message to all process in Ri
On receiving a REQUEST message:

 send a REPLY message if no REPLY message has been
sent since the last RELEASE message is received.
 Update status to indicate that a REPLY has been sent.
Otherwise, queue up the REQUEST

 Pi enters critical section after receiving REPLY from all
nodes in Ri
 send RELEASE message to all nodes in R i
 On receiving a RELEASE message, send REPLY
to next node in queue and delete the node from
the queue.
 If queue is empty, update status to indicate no
REPLY message has been sent.
Message Complexity: 3*sqrt(N)
Synchronization delay =
2 *(max message transmission time)
Major problem: DEADLOCK possible
Need three more types of messages (FAILED,

INQUIRE, YIELD) to handle deadlock.
Message complexity can be 5*sqrt(N)

Token based Algorithms
Single token circulates, enter CS when token is

present
No FIFO required
Mutual exclusion obvious
Algorithms differ in how to find and get the token
Uses sequence numbers rather than timestamps to
differentiate between old and current requests
Suzuki Kasami Algorithm
Broadcast a request for the token

Process with the token sends it to the requestor if it
does not need it
Issues:
 Current vs. outdated requests

 determining sites with pending requests
 deciding which site to give the token to
The token:
 Queue (FIFO) Q of requesting processes
 LN[1..n] : sequence number of request that j executed most
recently
The request message:

 REQUEST(i, k): request message from node i for its k th
critical section execution
Other data structures

 RNi[1..n] for each node i, where RN i[j] is the largest
sequence number received so far by i in a REQUEST
message from j.
 If i does not have token, increment RN i[i] and send
REQUEST(i, RNi[i]) to all nodes
 if i has token already, enter critical section if the token is
idle (no pending requests), else follow rule to release
critical section
On receiving REQUEST(i, sn):

 set RNj[i] = max(RNj[i], sn)
 if j has the token and the token is idle, send it to i if RN j[i] =
LN[i] + 1.
 If token is not idle, follow rule to release critical section
 enter CS if token is present

 set LN[i] = RNi[i]
 For every node j which is not in Q (in token), add
node j to Q if RNi[ j ] = LN[ j ] + 1
 If Q is non empty after the above, delete first node
from Q and send the token to that node
Points to note:
 No. of messages: 0 if node holds the token

already, n otherwise
 Synchronization delay: 0 (node has the token) or

max. message delay (token is elsewhere)
 No starvation
Raymond’s Algorithm
Forms a directed tree (logical) with the token-holder

as root
Each node has variable “Holder” that points to its

parent on the path to the root. Root’s Holder variable
points to itself
Each node i has a FIFO request queue Q i

To request critical section:
 Send REQUEST to parent on the tree, provided i
does not hold the token currently and Q i is empty.
Then place request in Qi
When a non-root node j receives a request from i

 place request in Qj
 send REQUEST to parent if no previous
REQUEST sent
When the root receives a REQUEST:
 send the token to the requesting node
 set Holder variable to point to that node
When a node receives the token:
 delete first entry from the queue
 send token to that node
 set Holder variable to point to that node
 if queue is non-empty, send a REQUEST message to the
parent (node pointed at by Holder variable)
To execute critical section:
 enter if token is received and own entry is at the top of the
queue; delete the entry from the queue
To release critical section

 if queue is non-empty, delete first entry from the queue,
send token to that node and make Holder variable point to
that node
 If queue is still non-empty, send a REQUEST message to
the parent (node pointed at by Holder variable)
Points to note:
Avg. message complexity O(log n)
Sync. delay (T log n)/2, where T = max.

message delay
Capturing Global State
Global snapshot is global state
 Each distributed application has a number of processes (leaders)
running on a number of physical servers
• These processes communicate with each other via channels (text
messaging)
• A snapshot captures the local states of each process (e.g., program
variables) along with the state of each communication channel
Why do we need snapshots?

 Checkpointing: restart if the application fails
• Collecting garbage: remove objects that don’t have any references

• Detecting deadlocks: can examine the current application state
• Other debugging: a little easier to work with than printf...
Causal consistency
 Related to the Lamport clock partial ordering
• An event is presnapshot if it occurs before the local snapshot on a
process
• Postsnapshot if afterwards
• If event A happens causally before event B, and B is presnapshot, then A
is too
Proof
 If A and B happen on the same process, then this is trivially true
• Consider when A is the send and B is the corresponding receive event on
processes p and q, respectively
– Since B is presnapshot, q can’t have received a marker and p can’t have sent
a marker
– A must also happen presnapshot
• Similar logic for A happening postsnapshot
Global State Collection
Applications:
 Checking “stable” properties, checkpoint &
recovery
Issues:
 Need to capture both node and channel states
 System cannot be stopped
 No global clock
Some notations:
 LSi : local state of process Pi

 send(mij) : send event of message mij from
process Pi to process Pj
 rec(mij) : similar, receive instead of send
 time(x) : time at which state x was recorded
 time(send(m)) : time at which send(m) occurred
send(mij) є LSi iff
time(send(mij)) < time(LSi)
rec(mij) є LSj iff

time(rec(mij)) < time(LSj)
transit(LSi,LSj) = { mij | send(mij) є LSi and rec(mij) є

LSj}
inconsistent(LSi, LSj) = {mij | send(mij) є LSi and

rec(mij) є LSj}
Global state: collection of local states
GS = {LS1, LS2,…, LSn}
1. GS is consistent iff
for all i, j, 1 ≤ i, j ≤ n,
inconsistent(LSi, LSj) = Ф
2. GS is transitless iff
for all i, j, 1 ≤ i, j ≤ n,
transit(LSi, LSj) = Ф
3. GS is strongly consistent if it is consistent and

transitless.
Chandy-Lamport’s Algorithm
Uses special marker messages.
One process acts as initiator, starts the state

collection by following the marker sending rule below.
Marker sending rule for process P:

 P records its state;
 then for each outgoing channel C from P on which
a marker has not been sent already,
 P sends a marker along C before any further
message is sent on C
When Q receives a marker along a channel C:
 If Q has not recorded its state then Q records the

state of C as empty; Q then follows the marker
sending rule
 If Q has already recorded its state, it records the

state of C as the sequence of messages received
along C after Q’s state was recorded and before Q
received the marker along C
Points to Note:
Markers sent on a channel distinguish messages sent

on the channel before the sender recorded its states
and the messages sent after the sender recorded its
state
The state collected may not be any state that actually
happened in reality, rather a state that “could have”
happened
Requires FIFO channels
Network should be strongly connected (works
obviously for connected, undirected also)
Message complexity O(|E|), where E = no. of links
Lai and Young’s Algorithm
Similar to Chandy-Lamport’s, but does not require

FIFO
Boolean value X at each node, False indicates state
is not recorded yet, True indicates recorded
Value of X piggybacked with every application
message
Value of X distinguishes pre-snapshot and post-
snapshot messages, similar to the Marker
Ordering of Events

Lamport's “Happened Before” Relationship:
For two events a and b, a -> b if
 a and b are events and a occured before b
 a is a send event of a message m and b is the corresponding recieve
event at the destination process
 a->c and c->b for some event c then
 a->b implies a is a potential cause of b

Causal Ordering: potential dependencies happened before
relationship casually orders events.
 If a->b then a casually effects b
 If a->b and b->a then a and b are concurrent (a | | b)
Logical Clock

A mechanism for capturing chronological and causal relationships in a
distributed system.
– Distributed systems may have no physically synchronous global
clock, so a logical clock allows global order

In logical clock systems each process has two data structures: logical
local time and logical global time.
– Logical local time is used by the process to mark its own events,
and logical global time is the local information about global time.
– A special protocol is used to update logical local time after each
local event, and logical global time when processes exchange
data

Logical clocks are useful in 1) computation analysis, 2) distributed
algorithm design, 3) individual event tracking, and 4) exploring
computational progress.
Lamport's clock

Each process Pi keep a clock Ci

Each event a in Pi is timestamped C(a), the value of C i is
when a occured.

Ci is incremented by 1 for each event in Pi.

if a is a send event of message m from process Pi to Pj ,
then on recieve of m,
Cj=max (Cj, C(a)+1)

Points to note:

If a->b , then C(a) < C(b)

-> is irreflexive partial order

Total ordering possible by arbitarily ordering concurrent
events by process numbers
Limitation:
a-> b implies C(a) < C(b)
BUT
C(a) < C(b) doesn't imply a->b !!
So not a true clock!
Solution: Vector Clocks
An algorithm for generating a partial ordering of events in a distributed
system and detecting causality violations.
A system of N processes is a vector of N logical clocks, one clock per

process; a local "smallest possible values" copy of the global clock-
array is kept in each process, with the following rules for clock
updates:
1. Initially all clocks are zero.
2. Each time a process experiences an internal event, it increments its own
logical clock in the vector by one.
3. Each time a process prepares to send a message, it sends its entire vector
along with the message being sent.
 Each time a process receives a message,
 It increments its own logical clock in the vector by one and
 updates each element in its vector by taking the maximum of the value in
its own vector clock and
 the value in the vector in the received message (for every element).
Ci is a vector of size n, where n is no. of processes
C(a) is similarly a vector of size n
Update rules:
Ci[i]= Ci[i] + 1 for every event at process P i

if a is send event of message m from Pi to Pj with vector
timestamp tm, then on receive of m:
Cj[k] = max(Cj[k], tm[k]) for all k
Partial Order between Timestamps
For events a and b with vector timestamps ta and tb,
and tb,
Equal: ta = tb iff ∀i, ta [i] = tb[i]
Not Equal: ta ≠ tb iff ∃i, ta [i] ≠ tb[i]
Less or equal: ta ≤ tb iff ∀i, ta [i] ≤ tb[i] ta
Not less or equal: ta tb iff ∃i, ta [i] > t b[i]
Less than: ta < tb iff (t a ≤ tb and ta ≠ tb)
Not less than: ta tb iff ¬ (ta ≤ tb and ta ≠ tb)
Concurrent: ta || tb iff (ta < tb and tb < ta)

Properties
– a -> b iff ta < tb
– Events a and b are causally related

iff ta < tb or
tb < ta, else
they are concurrent
– Still not a total order ie. partial ordering
– Antisymmetry: ¬ tb < ta iff ta < tb

– Transitivity: if ta < tb and tb < tc
then ta < tc
» Or if a ->b and b->c
then a->c
Causal ordering of messages:
Application of vector clocks
If send(m1)-> send(m2), then every recipient of both

message m1 and m2 must “deliver” m1 before m2.
“deliver” – when the message is actually given to the

application for processing
Birman-Schiper-Stephenson Protocol
Assumes broadcast communication channels that do not loose or
no corrupt messages. ( i.e. everyone talks to everyone ).
Use vector clocks to "count" number of messages ( i.e. set d = 1 ).

n processes.
1. To broadcast m from process Pi, increment Ci(i), and timestamp m

with tm = Ci
2. When Pj (j≠i) upon receiving m with timestamp tm, Pj delays
delivery of m until both
 Cj[i] = tm[i] –1 and
has received all messages that Pi had received before sending m.
- Cj[k] ≥ tm[k], k= 1, 2, 3,….n and k ≠ i
 Delayed messaged are queued in Pj sorted by vector time.

Concurrent messages are sorted by receive time.
3. When m is delivered at Pj, Cj is updated according to vector clock

rule.
Schiper-Eggli-Sandoz Protocol
 The goal of this protocol is to ensure that messages are given to the
receiving processes in order of sending.
 Unlike the Birman-Schiper-Stephenson protocol, it does not require
using broadcast messages.
 Each message has an associated vector that contains information for
the recipient to determine if another message preceded it.
 Clocks are updated only when messages are sent.
Schiper-Eggli-Sandoz Protocol...
Sending a message:
 All messages are timestamped and sent out with a list of all the timestamps of messages
sent to other processes.
 Locally store the timestamp that the message was sent with.
Receiving a message:
• A message cannot be delivered if there is a message mentioned in the list of timestamps
that predates this one.
• Otherwise, a message can be delivered, performing the following steps:
1. Merge in the list of timestamps from the message:

 Add knowledge of messages destined for other processes to our list of processes if
we didn't know about any other messages destined for one already.
 If the new list has a timestamp greater than one we already had stored, update our
timestamp to match.
2. Update the local logical clock.
3. Check all the local buffered messages to see if they can be delivered.
Distributed Computing
Introduction
What is Distributed
Computing/ System?
 Distributed computing
 A field of computing science that studies distributed system.
 The use of distributed systems to solve computational problems.
 Distributed system
 Wikipedia
 There are several autonomous computational entities, each of which has its own local
memory.
 The entities communicate with each other by message passing.
 The components interact with each other in order to achieve a common goal.
 Operating System Concept
 The processors communicate with one another through various communication lines, such
as high-speed buses or telephone lines.
 Each processor has its own local memory.
What is a distributed system?
Distributed program
 A computing program that runs in a distributed system
Distributed programming
 The process of writing distributed program
Autonomous: able to act independently
Communication: shared memory or message passing

“Concurrent system”: a better term probably
A very broad definition:

A set of autonomous processes communicating among themselves to perform
a task.
What is Distributed Computing/System?
Common properties
Fault tolerance
When one or some nodes fails, the whole system can still work fine except
performance.
Need to check the status of each node
Each node play partial role
Each computer has only a limited, incomplete view of the system.
Each computer may know only one part of the input.
Resource sharing
Each user can share the computing power and storage resource in the
system with other users
Load Sharing
Dispatching several tasks to each nodes can help share loading to the
whole system.
Easy to expand
We expect to use few time when adding nodes. Hope to spend no time if
possible.
What is a distributed system? Cont..
A more restricted definition:

A network of autonomous computers that communicate by
message passing to perform some task.
A practical “distributed system” will probably have both


Computers that communicate by messages

Processes/threads on a computer that communicate by messages
or shared memory
Why Distributed Computing?
The nature of application
Performance
 Computing intensive
The task could consume a lot of time on computing.
 Data intensive
The task that deals with a lot mount or large size of
files.
Robustness
 No SPOF (Single Point Of Failure)
 Other nodes can execute the same task
executed on failed node.
Advantages & Issues
Advantages
 Resource Sharing
 Higher Performance
 Fault Tolerance
 Scalability
Why is it hard to design them?


Un-reliability of communication and Unpredictable communication delays

Lack of global knowledge/clock

Lack of synchronization and No globally shared memory

Concurrency control

Failure and recovery
Common Architectures
Communicate and coordinate works among
concurrent processes
– Processes communicate by sending/receiving
messages
– Synchronous/Asynchronous
1. In a synchronous system, operations (instructions,

calculations, logic, etc.) are coordinated by one, or more,
centralized clock signals.
2. An asynchronous system, in contrast, has no global

clock. Asynchronous systems do not depend on strict arrival
times of signals or messages for reliable operation.
Master/Slave
architecture
 Master/slave is a model of
communication where one
device or process has
unidirectional control over
one or more other devices
Database replication
 Source database can be
treated as a master and the
destination database can
treated as a slave.
Client-server
 web browsers and web
servers
Data-centric architecture
 Using a standard, general-purpose relational database
management system  customized in-memory or file-based
data structures and access method
 Using dynamic, table-driven logic in  logic embodied in
previously compiled programs
 Stored procedures  logic running in middle-tier application
servers
 Shared databases as the basis for communicating between
parallel processes  direct inter-process communication via
message passing function
Best Practice
Data Intensive or Computing Intensive
 Data size and the amount of data
The attribute of data you consume
Computing intensive
 We can move data to the nodes where we can execute
jobs
Data Intensive
 We can separate/replicate data to difference nodes, then
we can execute our tasks on these nodes
 Reduce data replication when executing tasks
Master nodes need to know data location
No data loss when incidents happen
 SAN (Storage Area Network)
 Data replication on different nodes
Best Practice
Synchronization
 When splitting tasks to different nodes, how can we
make sure these tasks are synchronized?
Robustness
 Still safe when one or partial nodes fail
 Need to recover when failed nodes are online.
No further or few action is needed.
 Failure detection
When any nodes fails, master nodes can detect this
situation.
 App/Users don’t need to know if any partial
failure happens.
Restart tasks on other nodes for users
Best Practice
Network issue
 Bandwidth
Need to think of bandwidth when copying files from
one node to other nodes if we would like to execute
the task on the nodes if no data in these nodes.
Scalability
 Easy to expand
Optimization
 What can we do if the performance of some
nodes is not good?
Monitoring the performance of each node
Resume the same task on another nodes
Best Practice
App/User
 shouldn’t know how to communicate between
nodes
 User mobility – user can access the system
from some point or anywhere
Models for Distributed Algorithms
 Topology: Completely connected, Ring, Tree etc.
 Communication: Shared memory / Message passing

(reliable? Delay? FIFO/Causal? Broadcast/multicast?)
 Synchronous/asynchronous
 Failure models: Fail stop, Crash, Omission, Byzantine…
 An algorithm needs to specify the model on which it is supposed to

work
Complexity Measures
 Message complexity: no. of messages
 Communication complexity / Bit Complexity: no. of bits
 Time complexity:
– For synchronous systems, no. of rounds
– For asynchronous systems, different definitions are there.
Some Fundamental Problems

Ordering events in the absence of a global clock

Capturing the global state

Mutual exclusion

Leader election

Clock synchronization

Termination detection

Constructing spanning trees

Agreement protocols

DC Midsems

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DC Midsems

Uploaded by

Copyright:

Available Formats

Distributed Deadlock

P2 Cycle, hence P1 , P2 and P3 are

• Distributed Approach: Chandy ‐ Misra ‐Haas

At S1:‐ M1 : T3 releases R2

At S1:‐ M1 : T3 releases R2

At S1:‐ M1 : T3 releases R2

At S1:‐ M1 : T3 releases R2

-- An algorithm solves the leader election problem if:

-- A ring network is a connected-graph topology in which each node is

• There is no deterministic algorithm to elect a leader in anonymous

Known Impossibility Result:

– The algorithm assumes that each node has a Unique

– When a node receives an election message it compares the UID in

– When a node starts acting as the leader, it begins the second

– Assuming there are no failures this algorithm will finish.

Very well-understood in shared memory systems

Non-token based/Permission based

No. of messages/critical section entry

To request critical section:

On receiving a request (tsi, Pi):

To release critical section:

Purpose of REPLY messages from node P i to Pj is to

On receiving request (tsi, Pi) at Pj:

To enter critical section:

To release critical section:

Permission obtained from only a subset of other

To request critical section:

On receiving a REQUEST message:

To enter critical section:

Major problem: DEADLOCK possible

Need three more types of messages (FAILED,

Message complexity can be 5*sqrt(N)

Single token circulates, enter CS when token is

Broadcast a request for the token

 Current vs. outdated requests

The request message:

Other data structures

On receiving REQUEST(i, sn):

To release critical section:

 No. of messages: 0 if node holds the token

 Synchronization delay: 0 (node has the token) or

Forms a directed tree (logical) with the token-holder

Each node has variable “Holder” that points to its

Each node i has a FIFO request queue Q i

When a non-root node j receives a request from i

To release critical section

Avg. message complexity O(log n)

Sync. delay (T log n)/2, where T = max.

Why do we need snapshots?

• Collecting garbage: remove objects that don’t have any references

 LSi : local state of process Pi

rec(mij) є LSj iff

transit(LSi,LSj) = { mij | send(mij) є LSi and rec(mij) є

inconsistent(LSi, LSj) = {mij | send(mij) є LSi and

3. GS is strongly consistent if it is consistent and

One process acts as initiator, starts the state

Marker sending rule for process P:

 If Q has not recorded its state then Q records the

 If Q has already recorded its state, it records the

Markers sent on a channel distinguish messages sent

Similar to Chandy-Lamport’s, but does not require

A system of N processes is a vector of N logical clocks, one clock per