106106168

INDEX
S. No Topic Page No.

Week 1
1 Lecture 01 - Introduction to Distributed Systems 1
2 Lecture 02 - Basic Algorithms in Message Passing System 25
3 Lecture 03 - Leader Election in Rings 55
Lecture 04 - Distributed Models of Computation, Causality & Logical
4 Time 87
Week 2
Lecture 05 - Size of Vector Clock, Matrix Clocks, Virtual Time and
5 Physical Clock Synchronization 124
6 Lecture 06 - Global State and Snapshot Recording Algorithms 154
Lecture 07 - Distributed Mutual Exclusion and Non-Token based
7 Approaches 182
8 Lecture 08 - Quorum Based Distributed Mutual Exclusion Approaches 198
Week 3
9 Lecture 9 - Token Based Distributed Mutual Exclusion Approaches 227
10 Lecture 10 : Consensus and Agreement Algorithms 260
11 Lecture 11 - Checkpointing & Rollback Recovery 298
Week 4
12 Lecture 12 - Deadlock Detection in Distributed Systems 335
13 Lecture 13 - Distributed Shared Memory 364
14 Lecture 14 - Distributed Minimum Spanning Tree 395
Week 5
15 Lecture 15 - Termination Detection in Distributed System 428
16 Lecture 16 - Message Ordering and Group Communication 453
17 Lecture 17 - Self-Stabilization 480
Week 6
18 Case Study 01 - Distributed Randomized Algorithms 512
Case Study 02 - Peer-to-Peer Computing and Structured Overlay
19 Network 533
20 Case Study 03 - The Google File System (GFS) 555
Week 7
21 Case Study 04 - MapReduce 586
22 Case Study 05 - HDFS 620
23 Case Study 06 - Spark 633
24 Case Study 07 - Distributed Algorithms for Sensor Networks 665
Week 8
25 Case Study 08 - Authentication in Distributed Systems 689
26 Case Study 09 - Bitcoin: A Peer-to-Peer Electronic Cash System 717
27 Case Study 10 - BlockChain Technology 736
Distributed Systems
Prof. Rajiv Misra
Department of Computer Science and Engineering
Indian Institute of Technology, Patna
Lecture - 01
Introduction to Distributed Systems
So this is the first lecture Introduction to the Distributed Systems.
(Refer Slide Time: 00:21)
In this particular distributed system lecture introduction, we are going to discuss about
the different requirements of a distributed computing systems and we are going to
discuss, what are the different topics we are going to cover the textbooks and so on so
forth.
So, before that let us begin with the preface. So, the explosive growth of distributed
computing systems makes understanding imperative yet difficult because of uncertainties
introduced by the asynchrony, limited local knowledge, and partial failures. The nature
solves it perfectly, such as flock of birds where these birds are the mobile agents they
communicate with each other to achieve a common goal.
However, in the field of distributed computing providing all these intricacies, that is
asynchrony, limited local knowledge, partial failures. To understand this course will
provide a theoretical underpinning for design and analysis of many distributed systems,
1
such as and the concepts such as communication, coordination, synchronization and
uncertainty to the lower bounds techniques. These together will be discussed in the part
of the course. And this particular course will be quite useful as far as the different
applications are concerned.
The course structure of a distributed system goes like this.
So, this particular course if you see is divided into two different parts. The first part is the
perspective from the systems perspective distributed systems. The second part is called
distributed systems from algorithms perspective. So, the algorithms which will run on
this model of a distributed system are required to be understood here. So, model also is
required to be understood. And then the algorithms: how to design these algorithms, how
to analyze it and different intricacies of this algorithm design in this particular problem
setting.
So, the main topics which we are going to focus on from algorithms perspective means
how to build the spanning trees using flooding algorithms, then the leader election
algorithm. These are basically most of the important algorithmic design techniques they
are the basic building blocks of the distributed systems. From systems perspective we are
going to cover up the global state recording, mutual exclusion, consensus, shared
memory, check pointing, rollback, distributed hash table. And the case studies of a
2
distributed systems which we will cover here in this part of the course structure is peer to
peer, Google file system, HDFS and introduction to the spark.
With this particular course, we will use two textbooks. The first we will deal about the
systems perspective, the first one mentioned here as the authors Kshemkalyani and
Singhal. The other textbooks will deal with the algorithms perspective and that is by
Jennifer Welch. We have the reference book also that is distributed algorithms by Nancy
Lynch.
3
Let us begin with the definition of a distributed system. Distributed system is a collection
of independent entities that cooperate to solve a problem that cannot be solved
individually. So, basically it is nothing but a collection of computers. Thus, this
particular collection do not share a common memory or do not have a common physical
clock, and the only way they can communicate is through the message passing and for
that they require a communication network.
The computers used here in distributed systems are semi-autonomous and they are
loosely coupled while they cooperate to address the problem collectively.
So, before we understand in more detail about the distributed systems, let us have some
properties of distributed systems to keep in a mind at this point of time. So, heterogeneity
is one of the properties, because here the system comprises of different computers
autonomous computers and they may be heterogeneous having heterogeneous hardware
and software components.
The concurrency is another property of a distributed system, shared data is also another
property of a distributed system. So, no global clock is also one of the important
properties and inter dependencies are there. They have interdependent components
depend on each other.
4
Now, to understand the distributed system from the system perspective, let us see this
particular figure or diagram. In this particular diagram, you can see the computers are
autonomous computers presented as processor memory with the operating system and
basically the communication protocol stack.
And, this particular these different computers they can communicate through the network
that is the communication network. Now, as far as the software is concerned which will
build a distributed system this particular software is called basically the middleware and
the part of these middleware is basically the software which runs on each computer they
are called software’s; they are written as the software components.
So, basically this particular distributed system software will basically use the existing
computers their operating system and underlying computer network and they run the part
of the middleware and together this will form a distributed system. So, middleware will
bind the distributed system.
5
So, again further explain the distributed system connects autonomous processors by
communication network. And the software component that run on each of the computers
use the local operating system and network protocol stack. The distributed software is
termed as middleware.
The distributed execution is the execution of the processes across the distributed system
to collectively achieve a common goal. The execution is also sometimes termed as the
computation or error in a distributed system.
6
Furthermore; the distributed software which is also called a middleware is designed in a
layered architecture to simplify the complexity of the distributed software. And this
particular middleware or the distributed software that drives the distributed system, it
also provides the concurrency of heterogeneity at the platform level.
So, in the diagram you can see the distributed application mentioned over here. So, this
particular distributed application will use the distributed software which is a middleware;
middleware runs on the operating system of each collection of computers and also it will
use the underlying network protocol stack for the communication. And there are several
standards also evolved over a time for this particular middleware application;
middleware for the distributed software development. That is OMG, CORBA, RPC,
DCOM, RMI, MPI, and so on.
So, that was the overview of the distributed system from a system perspective. Now, we
are going to touch upon the motivation of the distributed system. So, inherently
distributed computation that is many applications such as money transfer in the banking,
or reaching a consensus among the parties that are geographically distant, the
computation is inherently distributed. So, for that the model that is distributed system is
required for that computation that is the applications which are inherently distributed.
Then, next is the motivation called resource sharing the sharing of the resources such as
peripherals, and a complete data set and so on and so forth; is a basically the motivation
7
behind this building of distributed system. Another motivation is to access the
geographically remote data and resources, such as bank database, supercomputer and so
on. Reliability: enhanced reliability possibility of replicating the resources and execution
to enhance the reliability. Geographically distributed resources are not likely to crash at
the same time. That is the motivation of building the distributed system for that.
So, reliability entails several aspects in that case. So, these are basically the availability
the resources should be accessible at all the times. Integrity the value or oblique the state
of the resource should be correct, in the face of concurrent access, and fault-tolerance the
ability to recover from system failures. Increased performance oblique cost ratio by
accessing geographically remote data and resource sharing. So, these are basically the
reliability will entail these aspects.
8
So, basically other advantage of distributed system is scalability, adding more processor
to the communication network does not pose a bottleneck to the communication network.
Then, the next advantage is modularity and incremental expandability. So, here the
process or heterogeneous processor can be added without any bottleneck problems.
Now, we are going to discuss the design issues and challenges in the distributed system
design.
9
So, from system perspective of a distributed system design, we are going to see; what are
the intricacies and we have to understand the theoretical basis for this design. Another
thing is algorithmic perspective of it in distributed system design. The next design issues
and challenges are based on the recent technology advances and also driven by the new
applications and which will basically be the motivation or be the design issues. And also
becoming challenges in evolving the distributed systems.
So firstly, we are going to look upon the design challenges from system perspective of
distributed systems.
Here, the components which are involved here in the systems perspective are the
communications that is the communication network where the processors can basically
communicate with each other through which processors can communicate. Processes
some of the issues involved are: the management of the processes and the threads at the
client server; code migration; design of software mobile agents. Synchronization is the
most important part.
Synchronization or the coordination among the processes is essential. Mutual exclusion

is an example of synchronization, but many other forms of synchronization, such as
leader election, physical clocks, logical clocks, global state recording algorithms, all
require different form of synchronization that we are going to cover up in this part of the
course in more details.
10
Now, another system level challenge is the fault tolerance.
So, this fault tolerance requires maintaining correctness in spite of the failures of a links,
nodes and processes. So, this particular fault tolerance is basically achieved using the
process resilience, reliable communication, distributed commit, check pointing and
recovery, agreement and consensus, failure detection, self-stabilization these are some of
the techniques which we are going to cover up, when we discuss the design from systems
perspective.
11
Another system perspective design angle or aspect is transparency. So, transparency is to
hide the implementation policies from the user and this can be a different kind of
transparencies: the first one is called access transparency. When it hides the difference
says in the data representation on different systems and location transparency when it
makes the transparency of the location of the resources.
And migration transparency allows the relocating resources without changing the name.
Relocation transparency the ability to relocate the resources as they are being accessed is
relocation transparency. Replication transmit does not let the user become aware of any
replication.
Concurrency transmits deal with masking the concurrent use of shared resources for the
user. Failure transparency refers to the system being reliable and fault-tolerant. It is not
known to the user at this at any point of time.
12
Now, that was the distributed system from system perspective. Now, we are going to
touch upon another important component of this particular distributed system distributed
computing system that is called distributed algorithms.
So, the algorithms are to be evolved. So, we are going to cover up the fundamental
algorithms which will be the basic building blocks of developing the distributed
applications. So, in distributed applications, distributed systems, different complexity
measures are of interest such as: the time and space. They were used in the classical or
the sequential algorithms as well, but now communication is also evolved. So,
communication cost is one of the complexity measures.
So, communication cost includes the number of messages, size of the message and
number of share variables. And also another component which will be used in the
complexity is called basically the number of faulty versus non-faulty components. Now
because of the complications faced by distributed system they lead to the increase the
scope of negative results, lower bounds and impossibility results.
So, all these things will be covered up in the distributed algorithm design and thus we
will discuss more these particular distributed algorithms in the details. So, the
fundamental issues in the design of distributed algorithms are the following three factors.
13
The asynchrony: so asynchrony, limited knowledge and failures. There are three different
important fundamental design issues in the distribute algorithm. Asynchrony is basically
absolute and relative timing of the events cannot be known precisely.
So, in this particular setting how the algorithms are to be evolved or developed. Local
view that is the computing entities can only be aware of the information it acquires, so it
has only the local view of a global situation. Third one is the failures. So, the computing
entities can fail independently, leaving some components operational while others are
not. So, these three different factors they add the more complications in design of the
distribute algorithm and becomes a challenging to evolve the distribute algorithm in
these particular problem setting, that is asynchrony that is we are not knowing the events
when it they are going to occur. Local view, we do not know the complete picture of the
global situation yet we have to come up with an algorithm. Failures means the
components which are basically involved in that distributed system they can fail
independently, and basically this expected that the applications should basically keep on
running in spite of failures.
14
So, distributed computing systems are studied since 1967, starting with Dijkstra and
Lamport. Dijkstra in 1972 got Turing award for the works on the distributed algorithms
and distributed systems. Leslie Lamport very recently has got the Turing award for his
work on basically the distributed algorithms and systems.
Special mention to the Leslie Lamport, because most of the works whatever he has done
we are going to cover up as far as distributed systems fundamentals are concerned.
15
So, Leslie Lamport devised important algorithms develop formal model verification
protocols to improve the quality of real distributed systems. Fundamental contribution to
the theory and practice, notably the inventions of the concepts such as causality and
logical clocks, safety and liveness, replicated state machines, sequential consistency or
some of them. So, Lamport was the winner of 2013 that is Turing award for distributed
computing.
Now, algorithmic challenges in the distributed system, we are going to touch upon that is
previously we have seen the design challenges from system perspective. Now we have to
see the algorithmic challenges in the developing the distributed system designing the
distributed systems. So, time and global state in a distributed system. So, first of all this
is the important challenge let us see what this is. The processes in the system are spread
across three-dimensional physical space. Another dimension, is the time, has to be
superimposed uniformly across a space.
The challenges pertain to providing accurate physical time, because there is no common
clock and to provide a variant of a time, that is called a logical time. So, logical time is
the relative time and eliminates the overhead of providing the physical time for the
different applications. And basically the logical time basically can capture the logic and
the inter-process dependencies within the distributed program, and also track the relative
progress at each process.
16
So, instead of physical having a common physical clock are basically implementation of
a common physical clock here we are going to see how the logical clock and solve
without having the physical clock these particular problems.
The other problem other algorithmic challenge is the synchronization coordination

mechanisms. So, the processes must be allowed to execute concurrently, except when
they need to synchronize to exchange the information, that is, communicate about the
shared data; so synchronization essential for the distributed processes to overcome the
limited observation of the system state. The following mechanisms are used for the
synchronization and the coordination. First of all leader election: deals with the
asymmetry of a process. And then mutual exclusion: access to the critical resources has
to be coordinated through that is done through mutual exclusion.
Then termination detection, that is, cooperation among the processes they will basically
able to detect the required state the required global state that is called termination state
and that is called termination detection in a distributed system. The next important; that
means, thing is called garbage collection detecting the garbage requires another
coordination. So, these are basically the synchronization and coordination mechanisms
which will basically be the used up in designing the distributed applications.
17
Another thing is another important notion is the reliable and the fault-tolerant distributed
system. So, reliable and fault-tolerant environment has multiple requirement aspect, and
these can be address by the various strategies which we are going to cover up in this part
of the course.
The first one is called consensus algorithm, second is the replication and the replica
management, voting and quorum systems, distributed databases and distributed commit,
self-stabilization system and check-pointing and recovery algorithm, and failure
detectors. So, these together; these strategies will be able to provide the fault-tolerance,
that is, if the components like nodes, links are failing yet how the distributed application
can basically work on without any disruptions and that is the main requirement of the
distributed motivation of a distributed system to have a reliable in a distributed system.
So, basically if the processors or the algorithms which are required are consensus
algorithm because in the failures. So, the remaining non-faulty processes they have to
basically come up with their consensus on the values and the applications are continuing
to run. Another thing is called replica and replica management, because the replicas of
that data is available, then the application can run without the problem of failures.
Then voting and a quorum is also a important criteria for example, when some of the
systems or when some of the important system is failed, then basically among the
remaining they have to evolve through the voting and quorum mechanism to basically
18
run those applications and distributed databases and distributed commit is also basically
one of the important applications where they have to decide among discussion with each
other they have among the synchronization to see whether the commit which is taking
place has to be done or has to be aborted by taking the decisions.
Then, self-stabilization system means if the components are failing then how the system
evolves and how much time it takes to stabilize itself. So, that is self-stabilization
systems that we are going to also cover up briefly in this part of the course. Check-
pointing and recovery systems are very important as far as the fault-tolerant is concerned
fault-tolerant means if there is a failure how basically the operations which are done has
to be basically with the minimal loss has to basically resume their operations. So, check-
pointing and rollback recovery algorithms, we are going to cover up in this part of the
course. Then failure detectors are also very important if how to detect that the nodes or
the links are basically not working or failed.
Another important algorithmic challenge is basically forming the group communication,

multicast, and ordered message delivery. So, there are some applications where the group
communication is required so, basically this paradigm is also useful for them to develop
the application.
19
Distributed shared memory abstraction; now the middleware is the software is the
distributed system software will basically use this particular this one abstraction called
distributed shared memory. Although, it is not having a common memory, but distributed
shared memory is realizable using the message passing systems that we are going to see.
Now, the applications of distributed computing systems and basically the newer
challenges are: Mobile systems: Mobile systems typically use the wireless
communication which is based on electromagnetic waves and utilizes a shared broadcast
20
medium. So, mobile system is one of the applications of a distributed computing
application where different elements are involved and to come up with this particular
service and that is called a mobile service and that is an application of a distributed
computing. Sensor network is another application.
Sensor is a processor equipped with an electro-mechanical interface and that is capable

of sensing physical parameters, such as temperature, velocity, pressure, humidity, and
chemical. So, this particular kind of nodes called a sensor node, if basically deployed to
basically monitor the cyber physical will make a cyber physical system to monitor any
physical activity or event for example, to monitor the whether it is having a volcanic
eruption or any other situation.
So, it is having a lot of use in cyber physical system and it is a kind of large scale
distributed system that is basically using the principles of a distributed systems. Another
application here is called ubiquitous or a pervasive computing: Ubiquitous systems
represent the class of computing where the processors embedded in seamlessly providing
through the environment perform the applications of functions in the background.
Examples are the smart environment; smart environment or smart building, smart cities
all are examples of a ubiquitous and pervasive computing.
Another application is basically called Peer-to-Peer computing: Peer-to-Peer represents

the computing over the application network where all interactions among the processors
21
are at the peer levels, without any hierarchy among the processors. Thus, all processors
are equal and play a symmetrical role in the computation. So, peer-to-peer network
systems are used in providing the resources and services. The peer-to-peer networking is
now basically a very challenging as far as developing distributed applications are
concerned. Technically also it is quite difficult and, but the simpler model like client
server paradigm which is not purely a peer-to-peer it is not purely distributed, this is used
by the different applications because industry feels comfortable with client server model.
Peer-to-peer computing models for different applications are evolving an over a period of
time the recent application which is added is called bitcoin and bitcoin is based on peer
to peer distributed computing design. We are going to touch upon later on in this part of
the course.
Distributed data mining is another application of a distributed computing. So, distributed

data mining algorithms examine large amount of data to detect the patterns and trends in
the data, to mine or exact useful information. The traditional example is: examining the
purchasing patterns of the customer in order to profile the customers and enhance the
efficiency of the directed marketing schemes.
Another application of a distributed computing is found in the grid computing:

Analogous to the electrical power distribution grid, it is envisaged that information and
22
computing grid will become a reality someday. Very simply stated, idle CPU times of the
machines connected to the network will be available to the others.
Another application is the security in the distributed. The traditional challenges of the
security in a distributed setting include: confidentiality; that means, only authorize
person can access, authentication means ensure that the source received the information
and the identity of the sending process are basically the genuine and availability is the
maintaining allowed access of service despite malicious actions.
So, security in a distributed system is used in the payment systems, the online purchasing
and also the recent bit coin, that is, the digital money how the digital money is realized
lot of security algorithms are involved in a distributed setting. So, we are going to cover
up this aspect also and basically this is important, because in the financial market this is
most of this is one of the most important factors in designing such applications in the
So, in the Nutshell, what we have seen here is the distributed systems as having a wide
variety of application real world scenarios and some of them we have covered up in this
part of the introduction. And this will be the basis and of understanding or pinpointing or
underpinning the intricacies the theoretical and intricacies to understand the reasoning
how the things are being designed and evolved and also we can verify its correctness,
23
that it is correctly working and also to understand its contribution it is required to be
familiar with the fundamental principles.
So, I told you about this. So, this lecture first categorizes the distributed systems and the
distributed algorithms by looking at various informal definitions. The design issues and
the challenges based on theoretical and the systems aspects.
In the upcoming lecture, we will try to give an insight on the detailed concept that will
give a good understanding of the further details.
Thank you.
24
Distributed Systems
Prof. Rajiv Misra
Lecture – 02
Basic Algorithms in Message Passing Systems
Lecture 2 is based on the basic algorithms on a message passing system.
So, preface recap of the previous lecture distributed, we have seen in the previous lecture
that distributed algorithms assume a synchrony local knowledge and the failures in this
particular lecture, we are going to further on discuss these 3 issues; how using these 3
different constraints we are going to design the distributed algorithms. So, content of this
lecture would be that the formal model of the distributed message passing system will be
used to deal with this asynchrony and a local knowledge 2 main timing models which we
are going to cover up is synchronous and asynchronous and few simple algorithms for
message passing systems with arbitrary topology with both synchronous and
asynchronous model will be discussed.
These algorithms which we are going to cover today will do the broadcast of information
in the network they will collect the information and they will construct the spanning tree
of the network. So, today’s lecture will become the basic building blocks of distributed
algorithms message passing model.
25
So, in a message passing model processors communicate by sending the messages over
the communication channel where each channel provides a bidirectional communication
between 2 processors. So, the pattern of connection provided by the channels describes
the topology of the system the collection of the channel is referred to as the network.
So under the message passing model, the algorithms is basically consists of n different
processors P 0 to P n minus 1 and this processors are indexed by i and they become the
node of a graph in the topology which we are going to discuss now the next component
26
here in the algorithm is the bidirectional point to point channels. They are undirected
edges of the topology graph. Each processor labels its incident channels with these
numbers 1, 2 and so on and this particular number is nothing, but the degree of the node.
So, this particular each processor do not know who is at the other end because of the
local knowledge.
So, this particular model is assumed and this can be explained using this particular
diagram.
Here you can see the node P 1 the 2 channels it has numbered as 1 and 2, similarly this
particular channel which is connecting between P 0 and P 1 and P 0 is basically
numbering the same channel as 3. So, basically because P 0 is not knowing what is
basically a channel number is assigned by P 1. So, it is only based on the local
knowledge and this will form a message passing model. So, here you can see that in this
particular topology P 0 and P 3 this particular channel both P 0 and P 3 they are giving
the channel id as one similarly this particular channel which is connecting P 0 and P 2
both P 0 and a P 2 incidentally they have given the same number.
Similarly, the channel which is connecting P 1 and P 2 both P 1 and P 2; they are giving
the same numbers.
27
Now, modeling processors and the channels; so, the processor is a state machine
including the local state of a processor mechanisms for modeling channels and this
basically will be the processor. So, the channel is the directed from processor Pi to Pj is
modeled into 2 pieces the first one is called outbuf that is the buffer variable of Pi and
inbuf variable of Pj. So, every channel has basically which is a connection of 2 processor
Pi and Pj will require these 2 buffers. So, outbuf corresponds to the physical channel;
that means, the messages which are sent on the channel, but not yet delivered that will be
there in outbuf and inbuf is refers to the buffer to the incoming message queue and it is
basically associated with the processors.
28
So, this particular connection of these 2 variables inbuf and outbuf can be explained
using this particular diagram. So, here we can model the processor and channels like this.
So, here you can see that P1s local variable and its inbuf which is denoted using the pink
color that is local variables and inbuf is an accessible state for a particular processor
similarly P2 is also having its own local variable and its inbuf which is also denoting P
twos that is basically denoted with the index 2. So, that is P2’s local variables and the
white channels are basically indicating the outbuf that is nothing, but the communication
channel where the messages are being sent.
29
The configuration the vector of processor states including outbufs that is the channels
one per processor is the configuration of a system captures this particular configuration
captures the current a snapshot of the entire system that is accessible processor states
local variables and incoming message queues as well as the communication channel is
basically the configuration.
The next component here in this particular message passing model is called events. So,
occurrence 2 types of event occur in the system, they are basically deliver events and
computation events.
30
So, deliver event in a message passing system denotes that it moves a message from
sender’s outbuf to a receiver’s inbuf that is the message will be available next time
receiver takes a step. So, this particular deliver event can be understood by this
illustrative diagram. Here the sender P 1 is having the inbuff and the message m 1 let us
say that sender P 1 want to send to the receiver P 2. So, you see next time in the step. So,
m 1 will be delivered to the inbuf of P 2.
31
So, from sender to receiver, the message will be delivered using this particular event
called deliver event; the next event which is possible in a message passing system is
called computation event.
So, computation event is represented by computation compi will occur at one processor
that is why for a particular processor Pi it is denoted as compi occurs at one process we
start with the old accessible state which is nothing, but a local variables and the incoming
messages apply the transition function of a processors state machine; handles all the
incoming messages now end with the new accessible state with an empty inbox and the
new outgoing messages this computation event can be understood using this particular
illustrative diagram.
So, here you can see the this one, the first state which is representing the old local state
and its inbuf for this particular processors are having different messages in inbuf and for
the 3 outgoing for the 3 channels associated with the 3 channels. So, computation event
what it will do it will change the transition, it will make a transition in the states. So, it
will it will provide you a new state and then basically you see that this particular
associated inbuf will become empty and the new messages will be appearing in the
outbuf of this particular state or that is associated with the channels of that particular
process.
32
So, all 3 things will happen in the computation event. So, pink indicates the accessible
states that are the local variables and incoming messages a white indicates the outgoing
message buffers.
So, this is the computation event now the execution. So, execution in a message passing
system is represented using a format which is alternating sequence of configuration,
event, configuration, event and so on. So, in the first configuration the processor is in the
initial state and all the inbufs are empty for each consecutive this configuration event
configuration, the new config is same as old except if the delivery event that is the
specified message is transferred from senders outbuf to the receivers inbuf if the
computation event happens then specified processors state change according to the
transition function.
33
So, that is called execution now admissibility. So, the definition of the execution gives
some basic syntactic conditions now usually safety conditions are specified as that
nothing bad has happened yet and this particular safety condition ensures that the step
P 1 never immediately follows a step by any other process other than P 0. So, P 1 is
followed from P 0.
34
So, this will ensure a safety condition in the execution and further on if we impose
additional constraint which is called liveness conditions. So, liveness condition means
eventually something good happens.
So, liveness condition is a condition that must hold a certain number of times possibly
infinite number of times. So, eventually something good happens after that. So, any
sequence that satisfies all the required safety conditions that we have discussed earlier
for a particular system will be called an execution. If an execution also satisfies the
liveness condition that it is called an admissible. So, execution satisfying the additional
constraints is admissible.
So, these are the executions that must solve the problem of interest now the type of
message passing systems there are 2 types of message passing systems they are
asynchronous and synchronous.
So, asynchronous system the system is said to be asynchronous if there is no fixed upper
bound on how long it takes for a message to be delivered or how much time elapses
between consecutive the steps of a processor so; that means, that means this
asynchronous is basically dealing with the message delays and also the computation
delays together and that is not known or unbounded sometimes. So, for example, an
example of an asynchronous system is the internet where the message can take days to
arrive, although often they may take seconds.
35
Now, the second type of system is called the synchronous system in synchronous model
processors execute in locksteps, the execution is partitioned into the rounds and in each
round each processor can send a message to each neighbor the messages are delivered
and every processor computes based on the message just to received now asynchrony
message passing systems asynchronous message passing systems is an execution is a
admissible for asynchronous model if every message in the outbuf is eventually
delivered and every processor takes infinite number of the steps no constraints on when
these steps these events takes place arbitrary message delays and relative processor
speeds are not ruled out.
Models the reliable system that is no message is lost and no processor is stops. So, this is
an assumption here when we cover up these 2 models for designing the distributed
algorithms.
36
The second is the synchronous message passing system. The new definition of
admissible captures the lockstep unison features of a synchronous model. This definition
also implies: every message sent is delivered, every processor takes an infinite number of
steps. So, time is measured as the number of rounds until the termination. We have
introduced the basic model for distributed algorithms which are based on synchronous
and asynchronous model without any failure and using this particular model. Now we are
going to discuss some of the basic algorithms and which is the basic building blocks.
37
So, the first basic algorithm which we are going to cover up is about the broadcasting
and the converge casting on a spanning tree. So, let us begin with this algorithm design
broadcast over a rooted spanning tree now broadcast is used to send the information to
call the nodes. Now suppose the processor already have the information about the rooted
spanning tree of the communication topology. So, what is a spanning tree here, tree is a
connected graph with no cycles and that tree which can which is spans over all the
processor is called the spanning tree and rooted means there is a unique root node of a
tree is called a rooted spanning tree.
So, we assume in this particular algorithm: a rooted spanning tree is given is already
provided as the infrastructure facility and then we are going to introduce or we have to
see the algorithm; that is called a broadcast. Now this particular rooted spanning tree can
be implemented using 2 variables that is the parent and the child these are the local
variables at each processor. So, this will indicate which incident channels lead to the
parent and the children in the rooted spanning tree.
So, this particular; these 2 variables form the rooted spanning tree structure over the
network.
Now, the algorithm; so, root initially sends message M to its children; now when the
processor i receives M from its parent, then it will send M that is the message to its
children and terminates; so, this will form the basic algorithm for broadcasting.
38
Let us understand here in this algorithm that is the algorithm one spanning tree broadcast
algorithm: initially m is a message is in transit from Pr. Pr is a root from a processor
which is required or which will initiate the broadcast to send the message to all its
children in the spanning tree. So, the code for the root for that is Pr is goes like this to the
steps step number 1 upon receiving no message, then basically it will terminate the code
number or for the codes for the different processor other than other than root.
Upon receiving the M that is the message from the parent send m to all the children and
then terminates.
39
So, this will be the broadcast algorithm over a rooted spanning tree let us understand
through this particular illustrative figure. So, here in the figure one you see that the root
node pr will want to send the message m to all the other nodes. So, it will basically send
to its outgoing channel because you see the structure is a spanning tree spanning tree is
shown as the direct lines and dotted means the channels which are not in the spanning
tree. So, M will be sent to the dark lines.
Now, after receiving after the m is received by the received node that is Pi will receive
from its parent. So, then it will send to the other children which is shown as the dark
nodes and so on. So, in this particular way, the steps will form and the message will
eventually be delivered to all the nodes and this is called broadcasting.
40
If we see the analysis that is the complexity analysis of this broadcast algorithm with on
a rooted spanning tree if we consider the synchronous model then the time is the depth of
a spanning tree; that means, how much time this particular algorithm is going to take to
terminate at all the nodes is nothing, but the depth of a spanning tree.
The depth of a spanning tree can be at most n minus 1 when the topology is a chain. So,
the number of messages here is n minus one by because only one message is sent over
the each edge of a spanning tree and the total number of edges in any trees n minus 1. So,
n minus 1 messages are required. Now asynchronous model also will have the same
complexity that is the time is required is the depth d of a tree and the message is n minus
1.
41
So, that was the simple algorithm for broadcasting. So, broadcasting algorithm is used in
the distributed system or in a network to transmit the information to all the nodes.
The other simple algorithm which we are going to see is about the convergecast.
Convergecast is an opposite of the broadcast. So, convergecast is used to collect the
information again we assume a rooted spanning tree and where a rooted spanning tree is
implemented using parent and children variable at each node now the leaves will send
the information to their parents and a non leave will wait to get the message from each
child.
42
And then it will aggregate or combined and then it will combine information will be
transmitted to the parent take this particular example.
In this example again we follow the same convention. The convention is that in this
topology the dotted lines are non tree edges and the solid lines are basically the tree
edges these are the tree edges. So, as you know the convergecast will start from the leaf
nodes. So, all the leaf node will send the information to the parent like this you see the
nodes d, g, h; they will send the information to their to their parent nodes.
43
And after receiving the information from all the childs; then it will aggregate and send
the information to the parents. So, here you see; the node e will aggregate its information
and the incoming information both are aggregated or combined and will send to its
parent.
So, this is continued till all the nodes are able to communicate their information to the
root node, this is the convergecast. Another example same thing here we can see in the
picture come the combination or the aggregate function is the maximum values. So, here
P 1 is having x 1 and P 3 is having x 3. So, maximum of x 1 and x 3 is computed only
that value will be basically sent to the parent similarly over here now here the max of x 3
x 1 and basically x 2 will be basically aggregated and sent to the to the root here in this
case.
So, x 3, x 1 and x 2, similarly from this end also it will be done and from the root, it will
be basically again aggregated and the result will be there and this is called convergecast.
This aggregation function differs from application to application. So, this particular
algorithm is useful for convergecast you can combine broadcast and convergecast
together in some of the application where you want to send the information what
together and when together in the convergecast will basically collect the information and
deliver to the root node.
44
Now, these 2 previous algorithms assume or infrastructure that is called a spanning tree
already in place now we are going to see how to construct a spanning tree in a distributed
using distributed algorithm over the network.
So, the first algorithm which we are going to discuss is: finding a spanning tree, when a
root node is given.
So, this algorithm is based on flooding algorithm. So, flooding algorithm is a very very
simple algorithm where in a particular node initiates the flooding process with their
45
message m it will send to all its outgoing channels the channels which receives these
messages will further send those messages on its children and. So, on this particular after
transmitting those particular nodes will terminate and this is called a flooding algorithm
using the modification of the flooding algorithm this particular spanning tree with a
given root, we are going to discuss.
So, let us see the algorithm. So, the root will send message m to its neighbors, now when
a non root node first gets m it will send the sender as its parent and also it will send a
parent message to its sender the sender m will send m to all its neighbors that is children
other than from where it has received and then it will terminate. So, when m is received
otherwise means if it is and receive, then basically, it will send the reject message to the
sender meaning to say that if the node has already set its parent and it is receiving m then
it will send a reject to its basically parent now. So, that is why the parent and the reject
there are 2 kinds of messages basically will be sent to the parent after receiving the
message and then it will terminate the nodes.
So, this is the algorithm for finding a spanning tree of a root node in more details let us
see this particular algorithm uses 2 variables parent and children parent is initialized to
null for all the processors all the processors means this is the code for a processor pi.
Similarly this particular code will run on all the processors from 0 to n minus 1 and there
46
are 2 variables which will be initialized to parent will be initialized to null and children
will be initialized to empty and others will be initialized to basically also empty.
Now upon receiving, if the root has not yet sent, then it will send the message to all its
neighbors and it will also send its parent is Pi because it is a root upon receiving message
m from the neighbor Pi if the parent is empty as I told you then it will sent the parent and
it will send the parent message to the parent and a message will be sent to all the nodes
all the neighbors except Pj.
Else means if the parent is already set, then it will send already message to Pj. Now upon
receiving the parent message from a neighbor Pj it will add Pj to the children. So, and the
children union others will if contains all the neighbors except the parent, then it will
terminate, then the node will terminate upon receiving already message from Pj, it will
add Pj to the other if the children union other contain all the neighbors except the parent,
then it will terminate when all the nodes terminates then the algorithm completes and
basically a spanning tree will be constructed.
Let us see the example of the algorithm for a spanning tree construction for a given root.
So, here we can see that, if the algorithm runs with a given root let us see a; it will
construct the spanning tree shown as basically the dark lines and dotted lines are non tree
edges that will not be covered this particular tree is basically constructed when the model
47
is synchronous and you see this particular tree is a BFS tree. So, if the model is
synchronous, then it will always construct a spanning tree which is also a BFS a
spanning tree. Now on the other hand, if the model, the timing model is the
asynchronous then this particular tree is not necessarily the BFS why because the
messages and the computational even the processors they have a different kind of delays.
So, that is why the asynchronous model will not necessarily gives the BFS always using
this particular algorithm, but as far as the complexity is concerned, total number of
messages is M; M is the number of channels or the edges are there in that particular a
graph and the time which it takes is of the order diameter that is.
So, we can see here is that, in asynchronous execution what kind of tree a spanning tree
we are going to get kind means there are 2 kinds of the spanning tree we are talking
about one is DFS and the other is BFS. So, as we have seen that in a synchronous mode
of this algorithm this particular algorithm always gives a BFS, but if it is asynchronous
then sometimes it will give BFS sometimes it gives a DFS or it may not give BFS or
DFS.
So, basically asynchronous executions in this tree neither BFS nor a DFS kind of tree is
being resulted. So, basically it is a spanning tree only we without this particular property
of a BFS and DFS if it is the model is asynchronous model.
48
Now, if the root is given, then how are we going to find out a DFS a spanning tree? So,
this is the algorithm which we are going to discuss now.
So, this algorithm definitely gives you the DFS spanning tree that is a spanning tree will
be having this property and this is spanned in this particular algorithm requires a root to
be specified; that means, a initiate of node to be known and it requires if the variables
like parent which is initialize to nil and children is empty that unexplored is all the
neighbors of a particular processors pi.
49
So, there are 3 different variables are required in this algorithm. So, we are going to see
how these particular variables are used. So, initially all the parents will be empty or a
null. So, let us see upon. So, the parent will start by initializing its parent variable as pi
and it will explore. So, if the parent is having an unexplored nodes in its variable then it
will it will select pk that is the first processor you know explore remove pk from
unexplored and send message m to pk and then upon receiving the message m from Pj if
the parent is null the first time, it is receiving the message then it will set the parent from
where it is receiving the message that is Pj and will remove Pj from unexplored it is a
unexplored of this Pj and then again it will explore and the. So, basically that message it
will send to one of the neighbors one of its neighbors and which will be taken out from
the Pj and so on.
Now, if the message is already basically if the parent is not null then it will send already
message to Pj because Pj has already set its parent variable. So, it will send already
message back. So, upon receiving already message from pj, it will start exploring; that
means, it will explore to the next possibility in; it is an unexplored node and so on. So,
the termination will occur here when it comes to the explored and the explored is already
exhausted and if the parent is not Pi then it will send the parent to the parent message to
the parent and it will terminate and when the root will get the parent message then the
entire algorithm terminates. So, this particular algorithm 3 will construct a DFS spanning
tree; it is almost same as the sequential algorithm which we have seen on the on the DFS,
but it will run on the distributed setting.
50
So, the previous algorithm ensures the spanning tree is always having a property which is
called a DFS tree. So, as I told you it is analogous to the sequential DFS algorithm about
message complexity. So, it is it will be of the order m since the constant numbers of
messages are sent over the over the edges time complexity is also of the order m since
each edge is explored in this particular series. So, this particular algorithm is having
same order m for the message complexity and order m for the time complexity. A
challenge is that how to design that a DFS spanning tree when the root is not given.
51
So, the next algorithm would be to find a spanning tree without a root node. Now we
assume that the processors have the unique identifiers and here it is written otherwise
impossible. So, impossible means there is impossibility result, this says that if the
processors do not have the unique identifier then this algorithm will not work. So, this
algorithm assumes that the processors have unique identifiers. So, the idea here in this
particular algorithm is that each processor runs the copy of DFS algorithm which we
have seen earlier with a root node with itself as a root and tag each message with the
initiator id to differentiate.
Now, when the copies collide with and the copy with a larger id wins. So, let us see first
this particular algorithm and then we will discuss about the complexity of this particular
algorithm.
So, this particular algorithm is about the DFS when the root is not given. So, basically
some of the nodes which will wake up they will start constructing DFS as per the
algorithm number 3 so; that means, several DFS tree construction will be initiated if
more than one node they wake up simultaneously and after that they will send the
message they will explore so; that means, they will set its leader and its parent Pi and
then it will explore.
So, in the explore; it will send the leader to one of the neighbor node through the channel
and upon receiving this leader message from Pj the node has different possibilities. So,
52
for example, if the node has is already basically a part of another construction of a DFS
tree then it will collide. So, this particular case will deal with the collide and if the node
which has received this leader is on the same tree in the same DFS tree then leader is
equal to new id and it will send already message to the already leader to the Pj otherwise
what it will do; otherwise, it will stall the new construction of a DFS to grow further so;
that means, collide in the sense if the collide and the incoming a new incoming leader of
a node is having higher id then it will basically survive and it will basically grow further
otherwise it will stall will stall the growth.
So, there are 3 possibilities which are explained over here and rest of the algorithm steps
are basically the same. Now the complexity of this particular algorithm is shown as the
message complexity is of the order n times m why because order m is the complexity of
the DFS with a root node and since n different processors can initiate it. So, it becomes
order of m times n message complexity and the time complexity is of the order m that is
the total number of in the edges or the channels in the graph.
So, in the conclusion we can see that this particular lecture has introduced you a formal
model of a distributed message passing system for synchronous and asynchronous timing
model with no failures. So, we have not assumed any failure and in this particular model
we have seen some basic algorithms.
53
So, these basic algorithms in this particular model also we have seen; how to analyze;
how to do the analysis that is the time analysis and the message complexity analysis the
algorithms these algorithms which we have seen here will solve the problems of
broadcasting converge casting and construction of a spanning tree that is a DFS. So,
these are used as the basic building blocks of the distributed algorithm.
Now, in the upcoming lectures, we will try to give a more detailed discussion they are
more complex algorithms and they are basically basic are called the leader election
algorithm and minimum cast spanning tree construction algorithm.
Thank you.
54
Distributed Systems
Prof. Rajiv Misra
Lecture - 03
Leader Election in Rings
Lecture 3: leader election in the rings preface recap of previous lecture.
In previous lecture, we have discussed the formal model of distributed message passing
system that is synchronous and asynchronous timing models with no failures. We have
seen in the previous lecture, a few simple algorithms for message passing systems and
these algorithms were to understand the concepts and the complexity measures in the
distributed algorithm, design the algorithms which we have seen in the previous lecture
used to solve the problem of broadcast, convergecast, DFS and is being used as the basic
building block for the distributed algorithms.
55
Content of this lecture, in this lecture we will discuss the leader election problem in a
message passing system for a ring topology in which the group of processors must
choose one among them to be a leader.
We will present the different algorithms for leader election problems by taking cases like
anonymous or a non-anonymous rings uniform or non uniform rings and synchronous
and asynchronous rings.
56
So, let us begin this particular introduction of leader election problems. So, in this
particular lecture, we are considering the topology of a message passing system as a ring.
So, ring is a convenient topology and is basically resembles to the physical token ring
and corresponds to the ring structure of a physical ring and is easy to understand or
design the algorithms in this particular setting that is basically the ring structure.
So, the leader election problem has several variants. Leader election problem is for each
processor to decide that either it is a leader or a no leader; subject to the constraint that
exactly one processor decides to be a leader. So, leader election problem represents a
general class of symmetry breaking problems; so, for example, when a deadlock is
created because of the processors waiting in a cycle for each other.
This particular deadlock can be broken by electing one of the waiting processors as the
leader and removing it from the cycle and thus breaking the deadlock. So, this is called
this particular example is an example, where the symmetry breaking is applied in the
form of a leader election algorithm.
So, leader election definition; the processor each processor has a set of elected that is one
and non-elected states. So, once an elected state is entered the processor is always in an
elected state and basically similarly for the non elected states. So, in every admissible
execution, every processor eventually enters an elected or a not-elected state and exactly
57
one processor that is the leader enters an elected state; so, again the use of leader election
algorithm.
We are going to stress upon this use to motivate basically the construction of the leader
election algorithm.
So, leader election can be used to coordinate activities in a distributed system; so, for
example, finding a spanning tree using the leader as the route. So, we have seen that it
becomes easy to basically construct a spanning tree that is DFS, if the route is given and
to identify the route in a network leader election algorithm can be used.
Similarly, if in a token ring network if a token is lost, then to reconstruct the lost token a
leader election algorithm can be of great help. So, leader election algorithm will identify
one of the node and that particular node will recreate the token and will be start resume
the token ring a structure. So, in this lecture, we will study the leader election in the
message passing system that is called the ring structure.
58
So, we are going to now describe about the ring topology. So, in an oriented ring
processor have a consistent notion of left and right. So, for example, this particular
diagram which is illustrated over here you can see the processors p 0, p 1, p 2, p 3 and p
4; they have p 0 and p 1; they have the channel. So, p 0 is connected to p 1 on the left
side, it is numbered as 1 and p 1 is connected to p 0 with a number that is 2 that is called
the right side. So, if you see all the numbers as 1, then the orientation of the ring is
basically nothing, but the orientation of a ring is a clockwise. So, if we basically keep on
navigating over the number 1.
Similarly, if we are navigating over number 2 that is p 0 is basically communicating with

p 4 and p 4 is p 3 that is through the number 2, then basically it is a counter clockwise.
So, this is called oriented ring. So, orientation of a ring can be basically formed using
these a numbering of a channel and that is that to be done by the processors at its own
level there is a local level with a local knowledge.
59
So, for example, if the messages are always forwarded on a channel one there will be
cycle clockwise around the ring. So, why we study the ring because in a message passing
system which is basically used as a ring structure or a ring topology gives a starting point
and also easy to analyze the algorithm and design the algorithm in this particular setting
that we are going to see that is the leader election algorithm.
This abstraction of this particular ring is a token ring also. So, and the lower bounds and
impossibility results for the ring topology also applies to an arbitrary topologies. So,
whatever algorithms we are a design in this particular setting can also be applied as well
in arbitrary topologies.
60
So, the different kinds of rings we are now going to start and discuss the first type is
called anonymous rings; in anonymous rings, the processors do not have the unique
identifiers; that means, all the processors are anonymous and thus the ring form out of
these processors without having any ids unique ids they are called anonymous rings.
So, in this particular setting, each processor will have the same state machine and also
the algorithm relies on knowing the ring size that is if that is how many number of
processors are there then.
61
There is another kind of ring structure and that is called basically a uniform ring.
So, uniform ring will have n nodes, but these particular numbers of nodes are not known
then it is called uniform because all the processors are not uniform the algorithm which
uses this information that is without knowing the ring size called uniform algorithm. So,
uniform algorithm does not use the ring size formally every processor in every size ring
is modelled with the same state machine here and that is why it is called uniform
algorithm or uniform ring structure a non uniform algorithm uses the size of ring in the
algorithm design. So, formally each value of n every processor in the ring of size n is
modelled with the same state machine that is of an so; that means, for a different ring
size of value n different algorithms or different state machine will represent by An.
Now, we are going to see if the ring is anonymous that is the unique ids are not given to
the processors. So, about in this particular setting about the structure that is a ring leader
election algorithm, how it will work in this particular setting. So, there is a theorem
which says that there is no leader election algorithm for anonymous rings even if the
algorithm knows the ring size that is if it is non uniform and also, it is basically
following a synchronous model.
So, this result is called impossibility result; that means, no leader election is possible if
the ring is anonymous that is if the IDs are not given we are going to see the proof and
62
then this impossibility results will be used in as basically a information to develop the
leader election algorithm.
So, the proof is sketch goes like this every processor now begins in the same state with
the same outgoing message since they are anonymous. So, every processor receives the
same message does the same state transition and sends the same message in the round
one. Now you may ask if a processor is sending the message to whom it is sending how
it will you know to whom, it is going to send or that who is the destination.
Then we have seen that the channel numbers that is if it is oriented ring; that means, if it
is a clockwise then it will always basically send to the channel 1 and so on. So, basically
the channel numbers will be used instead of if the IDs are not known and this particular
structure will represent the same state machines. So, same message will be transmitted in
same state transition will happen after the receipt of a message.
So, eventually some processor is supposed to enter in a elected state bun then, but then
all would enter into a elected state.
So, in this particular setting if we analyze using the safety property that never elects
more than one leader here every node is elected as a leader or liveness eventually elect at
least one leader is also violated. So, in this particular proof we have seen that there is no
leader election algorithm for anonymous rings.
63
Now, since we have proved this theorem for since the theorem was proof for non
uniform and synchronous rings.
And the same result will hold for weaker models weaker models in the sense; if it is
uniform means the value of n is not known and also a weaker one that is called
asynchronous timing model then also this particular theorem holds and leader election is
not possible for anonymous rings.
64
Now, we are going to discuss the rings with ids because if the leader election is not
possible in basically the anonymous rings, then we are going to consider the ring
structure with the unique ids. So, ids are assigned out of the natural numbers and so, each
processor has a unique id now.
So, now we are going to discuss about that ids; how the ids are assigned. So, is
specifying a non anonymous ring where each processor is assigned a unique id is
illustrated in this particular diagram and I will be explaining you through this particular
65
example. So, you start with the smallest id from here smallest id is 3. So, let us note
down this particular ring in this manner and list ids in a clockwise order. So, clockwise
order goes like this.
So, first id is 3, then the next id is 37 and furthermore the next id is nineteen then 4 and
finally, 25. So, just see that this particular structure if this rule is followed, then it will
give a oriented ring and this will be a non anonymous oriented ring structure which we
have seen as an example because and this particular structure we will be used to design
the leader election algorithm.
So, non anonymous algorithm if it is uniform; that means, there is one state machine for
every ids and no matter what the size of the ring; that means, the algorithm in this
particular setting does not know the size of the ring that is n is not known non
anonymous it knows it knows the value of n there is one state machine for every id and
every different ring size and basically, it will form an non uniform non anonymous
algorithm.
66
So, these definitions are tailored for leader election in the ring. So, now we are going to
discuss the first leader election algorithm with the message complexity of the O(n 2) and
this algorithm is called LeLann-Chang-Roberts algorithm that is LCR algorithm. So, it is
a simplest algorithm and it will give by start point to understand the leader election
algorithm and design in the distributed system. So, here every node every processor will
send the value of its id to the left in a form of a message. So, we can see this particular
example. So, every processor will have the ids and that is unique id because it is a non
anonymous ring. Now send the value of its id. So, it will send a message with its id 0 to
the left.
Similarly, this also we will send and everyone will send its values now when an id j here
from the right from the right it will receive the message and if the j that is the idea of
incoming message is greater than its id of a receiving process, then it will forward to the
left here in this case this situation is not considered this; this situation is not happening
the second j = id that is also not satisfying and if j < id here, then it will do nothing; that
means, 0 will not be this particular message will be swallowed and this particular
structure will go on. So, you see that in this structure here this will also be swallowed
this message will be swallowed this message will be swallowed, but this will continue
and this again will be continued over here. So, 2 message of size 2 will continue over
here again and this will also forwarded. So, 2 will come again over here from where it is
being originated.
67
So, the correctness of the algorithm goes like this that it will always elect a processor
with the largest id and the message containing largest id passes through every processor
and basically comes over here again as I told you the message of a processor id with the
highest id will goes along the entire ring and comes back to it and it will be elected as a
leader and all other nodes then after electing a leader it will send a message that is called
termination message and all other node will become a non leader . So, the time which is
taking here in this algorithm is of the O(n) and the message complexity if will analyze it
depends on how the ids are arranged. So, the largest id travels all around the ring that is it
will basically result in into n different propagation of the messages second largest id will
travel until reaching the largest one third largest id will travel until reaching the largest or
a second largest and so on.
68
So, we can understand using this particular example the message complexity that is order
n e square. So, just see that here the id which is that is the highest id that is the id which
is number 4 will be propagated along all the ring and will come back over here again
similarly the second lowest id that is 3 will propagate and it will not pass through p 4
why because it is having the highest id, it will swallow this particular message.
So, these particular messages are being flowing. So, if we see if we count how many
messages are total number of messages is being flown you can see that the highest id will
69
basically result into n different messages the next largest id will have (n-1), (n-2) ; so, on
up to n. So, if we if we sum up it will become the n 2. So, this particular topology will
incur into n2 number of messages. So, that is why the message complexity of this
algorithm is of the O(n2). So, we go of n2; that means, this particular topology this
example of a topology, we have seen where exactly n2 messages are required maybe in
other topologies, it will require even less than n square, but in no situation the number of
messages will exceed n square and that is why the this particular algorithm guarantees
that it will take n square number of messages.
So, again this particular analysis we can see from this particular diagram that in any
admissible execution, it will send no one more than order n square messages that is
return over here more over there is an admissible execution in which the algorithm sends
theta n square messages that we have seen in the previous example, here also in this
example you can see consider a ring.
Where the identifiers of the processors are 0, 1, 2 and so on; they are being numbered
and they are ordered in figure in this particular figure. So, in this configuration the
messages of the processor with identifier i is send exactly (i+1) times and the total
number of messages including the n termination message is n plus n is the termination
total number of termination message and this is the total number of messages which is
send by the other processor to elect a leader and in this particular ring can see this
70
particular structure that a processor that is an id (n – 1) will basically navigate for n
different messages will propagate n different messages. So, a particular processor I will
basically propagate total number of exactly identifier is propagated (i + 1) times.
So, if we sum up this particular formula again it will give (n2) and this (n2); that
means, in this particular topology, it will basically incur n2 messages.
So, this particular algorithm never incurs more than n square in any admissible execution
now the question is can we develop an algorithm which will use fewer than these number
of messages that is O(n2) message. So, that is an idea and we are going to see another
algorithm. So, the next algorithm will be based on this idea is that why. So, many
messages are used can we reduce the number of messages. So, here it is the idea is
saying they try to have messages containing smaller idea travel smaller distances in the
ring.
So; that means, smaller ideas are not going to elect as a leader; so, why they are
travelling more distances they can be contained in a smaller region and only the ids with
a higher ids or larger ids are allowed to travel across all the ring. So, this way we can see
in number of messages.
71
In this particular algorithm and with this idea an another algorithm, we are going to see
that is called O(nlogn) messages that is leader election algorithm is given by Hirschberg
and Sinclair HS algorithm, it is well known algorithm. So, we are not going to describe
this algorithm in more details. So, to describe this algorithm we first define the k
neighbourhood of the processor pi in a ring to be the set of processors that are at a
distance at most k from pi in the ring that is either to the left or to the right.
Note that the k neighbourhood of a process includes exactly (2k+1) processor to

understand this through a diagram. So, you can see a ring of 3 nodes in this particular
node the k neighbourhood of this particular processor pi one neighbourhood of this
process is nothing, but the left and the right only 2 processors will be there similarly if
this particularly instead of 1, if it is k neighbourhood. So, many number of k number of
processors will be on the left k number of processors on the right. So, this means that a k
neighbourhood of a processor includes (2k+1). So, here one neighbourhood will include
3 processors. So, this algorithm that is O(nlogn) messages leader election algorithm will
require the knowledge of k neighbourhood of a particular processor pi for all the
processors.
So, the algorithm operates in the phases. So, it is convenient to start numbering the
phases with 0. So, in kth phases; so, 0 phase we have seen. So, in kth phase it is
convenient to start the numbering with the phase 0 and kth phase a processor tries to
72
become a winner of that phase to be the winner it must have the largest id in its 2k
neighbourhood only the processors that are winner in the kth phase continue to compete
in the k plus oneth phase thus fewer processors exceed to a higher phase until at the end
only one processor is the winner and is elected as the leader of the whole ring.
So, let us see more detail about phase 0 and then you will see about phase k. So, in phase
0 in more detail phase 0 each processor attempt to become a phase 0 winner take the
same example.
So, in phase 0 every processor will initiate and try to become a phase 0 winner let us
have these ids 0, 1 and 2. So, to become each processor attempts to become a phase 0
winner and sends a probe message containing its id to its one half neighbourhood that is
to each of its 2 neighbours left and right if the identifier of the neighbour receiving the
probe is greater than the id oh in the probe it swallows the probe otherwise it sends back
the reply message. So, here in this case it is one and it is 2. So, the id of the processor
receiving the probe message with the id 0 as having higher id, it will swallow it similarly
here also it will swallow; it will not give back the reply, but here in this case if node 2
sends the probe message.
Both the message will send back the reply sends back the reply if the processor received
the reply from both its neighbours, then the processor becomes the phase 0 winner. So,
73
here processor 2 becomes phase 0 winner because it has received the replies from both of
its neighbours.
And we will continue to phase 1. So, we are going to describe about if a processor is
winner in a phase k-1 and now it is eligible to be participating in phase k. So, phase k
that is in general in phase k a processor pi that is the phase k minus 1 winner sends the
probe message with its ids 2k neighbourhood that is one in each direction each such
message traverses 2k processors one by one a probe is swallowed by a processor if it
contains an id that is smaller than its own id that we have also seen in the phase k.
If a probe arrives at the last processor on the neighbourhood without being swallowed
then that last processor sends back a reply message to pi if pi received a reply from both
the direction it becomes a phase k winner and it continues to the phase k+1. So, the
processor that receives its own probe message terminates the algorithm as a leader and
sends the termination message around the ring. So, if we can see if this is the ring. So,
every node will have 2k neighbourhoods for a phase k. So, 2k neighbourhood on the left
2k neighbours on the on the right if that particular message goes through the last node in
each of these 2k neighbourhood on the left and right; if it is basically going through up to
the last one.
Then it will send back the replies and it will the winner of k phase of the algorithm it will
be eligible to proceed to the (k+1) phase.
74
So, that I have explained you the phase k of this particular algorithm the entire algorithm
is explained over here. So, here you can see that the probe structure we will have 3
different probe will have 3 different information the first information is called id the id of
the node which has initiated on a particular phase and in which phase it is initiating.
So, that phase number is given and the hop count because it has to travel through to 2 k
neighbourhood. So, here it is let us say 2k hops how many hops and this is the phase
number which phase message it is it belongs to and the ids which particular. So, in this
particular send message you can see that the probe has one 2 3 different arguments 3
arguments means id of that particular processor.
Which has initiated that particular leader election in phase 0 and it is going to send to one
hop neighbourhood to the left and right that we have seen now upon receiving this
particular probe with j, k and d. j means the idea of the probe message k means the kth
phase and d means the number of hops it has traverse so far. So, upon receiving this
probe from the left and it will also receive from the right it will perform 3 different cases
now if j = id; that means, the message which has traverse along all along all the nodes or
a processors of the ring and came back to the same point.
Then that mode will be terminated and elected as the leader. Now if j that is the receiving
message his id or his id is greater than the node which is receiving this particular
message and also the number of hops < 2k then it will send the probe to the to the to the
75
left and to the right as well depending upon from where the message it is going to
received. Now if the number of hops is reached to 2k on the left side or on the right side
that will send the replies message; so, it will send the reply in that case because it has
reached 2k hop neighbours, then it will send the reply message. So, it has to decide based
on whether it has reached 2k the last node of 2k neighbourhood.
If it has then basically it will send a reply otherwise it will keep on forwarding the probe.
So, all 3 cases are given now upon receiving the reply from the left and the right if j is
not equal to id then it will send the reply to the right; that means, it will keep on
forwarding the replies if it is not equal to id and if it is id then the reply will be will be
reached if already receive already received reply from j from the right and from the left
then it will be the winner of phase k and it will start the probe for (k+1)th. So, the
structure will be the same node which is the winner of kth phase will go and send the
probe at (k+1)th phase or initiate the (k+1)th phase and basically the number of hops, it
will basically begin and this hop count is initialize to or is being sent to 1 and keeps on
incrementing till it reaches 2k and there.
These 2 conditions are being checked whether the probe is to be move forwarded or it is
to be returned back to the originator. So, this algorithm is explained which is called a
leader in O(nlogn) leader election algorithm in an in a ring.
76
So, the pseudo-code which appears in the algorithm 5 we have just seen that algorithm 5
the phase k for a processor corresponds to the period between its ending of a probe
message in line 4 or a line 15 that I have explained with the third parameter k and it
sending of a probe message in line 4 or 15 with the third parameter (k+1) the details of
sending the termination message around the ring have been left out in the code and only
the leader terminates the correctness of the algorithm follows in the same manner as in
the simple algorithm that is over n square algorithm because they have the same
swallowing rules.
It is clear that the probes of the processor with a maximal ids are never swallowed
therefore, this processor will terminate the algorithm as the leader on the other hand, it is
also clear that no other probe can traverse the whole ring without being swallowed
therefore, the processor with the maximal identifier is the only leader elected by this
particular algorithm.
So, again more description about this algorithm; so, each processor now here in this
algorithm tries to probe successively larger neighbourhoods in both the directions that is
the size of the neighbourhood doubles in each phase. Now if the probe reaches the node
with the largest id the probe starts, if the stops if the probe reaches end of the
neighbourhood, then the reply is sent back to the initiator that have I explained you in the
algorithm.
77
If the initiator gets back the replies from both directions, then it goes to the next phase
that also we explained in the algorithm if the processor receives a probe with its own id
then it will elect itself a leader the same thing is explained here in this particular picture.
So, this is now you can very easily identify this is phase 0; phase 0 means this is the one
neighbourhood one hop neighbourhood because 20 is 1 this is the phase 1. So, here 2 1
that is 2 so; that means, here this particular neighbourhood; that means, it is basically 2
hop neighbourhood in both the directions in the phase 1. Similarly this is the phase 2, in
the phase 2, 22 is equal to 4.
So, here you just see that this is 4 neighbours on the left 4 neighbours on the right and
the probe will go the 4 different hops and the replies has to come back then only pi has to
be winner if the pi has to be the leader then it has to be elected as a leader in phase 0 then
it has to be elected in phase 1 it has to be elected in a in a phase 2. So, if it is a ring of
this size then pi will be the leader elected in this particular algorithm; this is explained
over here.
78
Now, this correctness is similar to O(n2) that I have explained you now message
complexity of this algorithm the message complexity of this algorithm belongs to a
particular phase and is initiated by a particular processor. So, now, we are going to count
how many different messages are being used to elect a leader here in this case and how it
becomes O(nlogn).
Now, the probe distance in a particular phase k is 2 k that you know already. So, the
number of messages initiated by a processor in phase k is at most 4*2 k why because 2*2k
probes messages and 2 into 2k replies messages if you count it becomes 4*2k different
messages in a phase k which is being initiated by a processor pi. So, we have to count
from any such processors will be there in a phase k.
79
So, how many processors will be initiating in a particular phase k that we are going to
compute now for phase for k = 0 that is for phase 0 every processor will basically
initiate. So, it becomes n; n different processors are there now if the phase is not one is
more than or more than 0, then every processor that is the winner in a phase (k-1).
Does the winner means the largest id in its (2k-1) neighbourhood.
So, that we are going to compute. So, how many num maximum number of phase
(k-1)winner occurs when they are occurs. So, we can see this when these particular phase
80
(k-1) winners are basically packed densely in this particular manner. So, what is
basically? So, if the phase k winner between 2 phase (k-1) winner let us say pi on the left
it has (2k-1)processors. So, another processor pj on it is right, again, it will have (2k-1)
processors. So, so if this kind of packing is done all around the ring structure.
So, then we can count how many different phase (k-1) winner will be there. So, that
particular count if we see the total number of phase k (n-1) winner will be at most
n/(2k-1 + 1) plus 1 means including this and (2k + 1) this. So, total number of winners in a
phase (k-1) will be this particular figure that is explained over here using this formula.
• Now, next thing is what happened just pause start. So, how many phases
are there that we are going to now find out now. So, at each phase the
number of phase winners is cut approximately in half from n/(2k-1 + 1)
to n/(2k + 1).
So, after the approximating log n phases only one winner is left out that is precisely max
phases  log(n–1) +1. So, if you want to understand; this particular how many different
phases will be there. So, you can see that how many. So, it will be if we go back. So,
total number of phase (k – 1) winner is at most n raised power.
81
So, total number of phase (k-1) winners is at most n/(2k-1 + 1). So, these are total number
of phase (k-1) winners. So, in the end there will be only one winner. So, this will become
n = (2k-1 + 1) when total number of winners is only one and if we move this one over
here it will become (n-1) = 2k-1. So, if you take log out of it both the side.
So, that becomes k = total number of phases will be will be  log(n–1) +1. So, this
particular expression we have explained you how this particular expression came and
this will represent how many number of phases the total number of phases in the
algorithm; we have computed.
82
Now, we are going to count how many messages are required in this particular algorithm;
how many total number of messages are flown to decide a leader in this particular
algorithm. So, we can see here 4n messages will be required in the phase 0 then when the
leader is decided then it will send a termination message that will be n different messages
in all other cases. So, it will be the summation like this. So, 4 into 2 raised power k
different messages will be there that will be a probe and reply message and it is
bidirectional.
So, basically 4•2k that we have seen in the previous formula also and these are basically
the number of phase (2k-1)s winners and total number of phases will be the summation
one 2logn-1 that we have seen. So, if we count total number of messages. So, that is what
I have explained total number of messages it comes out to be this figure is 5n and this
can be approximated as (8nlogn+2) which will be of the O(nlogn).
83
Now, the O(nlogn) algorithm is more complicated we have seen than O(n 2) algorithm
why because; but it uses a fewer messages in the worst case.
So, it works both in the synchronous and asynchronous case. So, can we reduce the
number of messages even further than nlogn? So, we can see here that it is not possible
why because there is a lower bound which is proved as nlogn in this particular model
which is called asynchronous model. So, in asynchronous model this algorithm is
optimal because nlogn is the lower bound which is proved here in asynchronous model.
84
So, it is the theorem says that any leader election algorithm for asynchronous rings
whose size is not known a priori has the lower bound ῼ(n log n) message complexity
holds also for the unidirectional rings. So, both LCR; LCR means big O of n square
algorithm and HS; HS means oh O(nlogn) are compared are companion based algorithm
that they use identifies only for comparisons.
In synchronous networks O(n) message complexity can be achieved if general automatic

operations are permitted and if the time complexity is unbounded.
So, to summarize of the leader election in the message passing system which is the ring
with the non anonymous ring having the distinct ids assigned the we can see the
complete scenario as an overview like this that there exist algorithm and the nodes have
unique ids that we have seen after seeing the impossibility results then we have evaluated
them according to their message complexities that is in the; if it is a synchronous ring
that we will take nlogn messages if it is synchronous ring.
Then; it will take (n) messages under certain conditions otherwise it will be (nlogn)
messages all bounds are asymptotically tight.
85
So, the conclusion; so, this particular lecture provided in depth study of the leader
election problem in the message passing system for a ring topology we have presented
different algorithm for leader election problem by taking there are different cases like
anonymous non anonymous rings uniform and non uniform ring synchronous and
asynchronous ring. So, in the upcoming lectures we will discuss about causality and the
time concept in the distributed system why because we have already seen that distributed
system is not having a common global clock yet how the events are to be ordered for
that; we are going to see in the next lecture more details about it.
Thank you.
86
Distributed Systems
Prof. Rajiv Misra
Lecture - 04
Models of Distributed Computation, Causality and Logical Time
Lecture 4: Models of Distributed Computation, Causality and Logical Time. Preface:

recap of previous lecture.
In the previous lecture, we have discussed about the leader election problem in the
message passing system for a ring topology. We have also seen different algorithm for
leader election problem by taking different cases of a topology, like anonymous, oblique,
non-anonymous rings, uniform, oblique, non-uniform rings, synchronous and
asynchronous rings.
87
Content of this lecture: in this lecture we will discuss about the models of distributed
computation causality and a general framework of logical clock in a distributed system.
Also in the absence of global physical time in distributed system, we present three
systems of logical time namely scalar vector and matrix time to capture the causality
between the events of a distributed system.
Before we start I should mention that causality is the concept of causality is the
fundamental for the design of distributed systems. Usually causality is tract using
physical time since you know that distributed system do not have a global physical time.
So, there is a possibility to realize it using an approximation of it. So, logical clock is
basically able to capture the fundamental monotonicity property associated with the
causality of a distributed system. So, that is we are going to cover up in this part of the
lecture models of distributed computation.
88
Introduction; distributed system consists of a set of processors that are connected by a

communication network. The communication network provides the facility of
information exchange among the processors. The processors do not share a common
global memory and communicates only by passing the messages over the communication
network.
There is no physical global clock in the system to which the processes instantaneous
access. The communication medium may deliver the messages out of order messages
may be lost garbled or duplicated due to the timeout entry transmission, processors may
fail and communication link may go down.
89
So, these are the characteristics of the distributed systems in which we are going to
discuss how to write down the applications and the program.
So, about the distributed program definition distributed program is composed of a set of
n asynchronous p1 to pn process execution and the message transfers are asynchronous,
without loss of generality we assume that each process is running on a different
processor. So, by this way either we call it as process or a processor both are signifying
the same thing here in this part of the discussion.
Now, channel let Cij denote the channel from a process p i 2 process p j and let mij we
denote the message send by the process i to process j here.
90
We assume that the message transmission delay is finite, but unpredictable models of
distributed execution. The execution of a process consists of a sequential execution of it
is actions. The actions are atomic and the actions of a process are modeled as 3 different
type of events namely the internal events, message send event and message receive
events. Let eix denote x-th event at a process pi for a message m let send m and receive m
denote the send and receive events respectively.
The occurrences of events change the state of the respective processes and the channel
thus causing the transitions in the global system state, and internally even changes the
state of a process at which it occurs a send event or a receive event changes the state of a
process that sends or receives the message and the state of a channel on which the
message is sent.
91
The events at a process are linearly ordered by their order of occurrence. The execution
of a process i produces a sequence of events that is e 1 e 2 and so on ex of a particular
process i which is subscripted by i, and is denoted by a capital Hi. So, capital Hi is
nothing, but small hi and the sequence in which they are these events are occurring and
that is denoted by a binary relation which is shown as an arrow over here. Small hi is the
set of events produced by pi and the binary relations on the set of events denote the linear
order of on these events. So, the relation which basically linearly order them expresses
the causal dependencies among the events of pi.
92
The send and receive event signify the flow of flow information between processes and
establish causal dependency between from sender process to the receiver process.
The relation which is denoted for the message a transmission that captures the causal
dependency during the message exchange is defined as follows. For each message m that
is exchanged between two processes, we have send m proceeded by the receive m
message. So, the relation which establishes between send and receive event denotes the
causal dependency between the pair of corresponding send and event send and receive
events.
Now, the evolution of distributed execution is depicted by a space time diagram, the
horizontal line in this space time diagram represents the progress of the process and the
dot represents the event and a slant arrow indicates the message transfer between two
processes.
Since, we have assumed that an event execution is atomic, that is individual and
instantaneous it is justified to denote it as dot on the process line.
93
In figure for a process p 1 the second event is the message send event third event is the
internal event and forth event is the message receive event and here you can see in this
particular illustrative diagram that for a process p 1 the event e 2 will be the message
send event, because it is beginning with the slanted arrow over here. The message three
of a process 1 indicates the internal event and the event number 4 of a process one
indicates the message receive event why; because the slanted message arrow is heading
towards this particular dots and dots are basically the events and they are atomic events.
94
Now, one preliminaries which we are going to basically use in the further discussion. So,
let me brief about that, that is called partial order relation. So, the definition of a partial
order relation goes like this, a binary relation are on a set A is a partial order if and only
if it is reflexive anti symmetric and transitive. The ordered pair A and R is called poset or
partially ordered set where R is a partial order. So, example here is if the relation less
than or equal to on a set of integers I will form a partial order and the set I and the
relation r is a poset relation.
Another preliminary and definition of a total order relation a binary relation R on A set a
total order if and only if it is partial order and for any pair of elements a and b of A,
<a, b> pair in R or b, a pair in R exist that is every element is related with every element
on one way or the other then it is basically both conditions satisfied then it is called a
total order.
So, total order is also called a linear order example of total order is the less than or equal
to relation, on a set of integer is basically the total order.
95
Now with these two definitions we are going to define the causal precedence relation the
execution of a distributed application results in a set of distributed events produced by
the processes. Let capital H be the union of all the events h is denote the set of events
executed in the distributed computation.
Now, let us define a binary relation which is shown as an arrow over here on this
particular set H as follows that expresses causal dependencies between the events in a
distributed execution. The causal precedence relation induces and irreflexive partial
order, on the events of a distributed computation that is denoted by capital H and this is
nothing, but capital H is small h followed by n followed by a binary relation that is a
pair. So, this will be described as. So, let e i and e j. So, e i precedes e j this by implies
that e i and e j they are related with this particular in by the event that is if i = j and x < j
that is call internal event or the events which are connected or related causally using by
sending the message.
Let us say e i of x preceded by e i of j using the relation which is established using the
message exchange. So, with this is a message send e i is a message send and e j is
message received. So, this is a send and receive event or there is a transitive relation also
establishes the presidents causal precedence relation e i happened before e k and e k
happened before e j. So, that indicates that e i has happened before e j. So, there are 3
kind of causal precedence relation which is which induces an irreflexive partial order on
96
the set of events, which are happening in a distributed system which is represented by
capital H and that relation together will basically denote the irreflexive partial order in a
distributed system.
And this basically establishes the causal precedence relation. Note that the relation arrow
or the this binary relation is nothing, but a Lamport’s happen before relation for any two
events e i and e j if they are connected by this relation, ei → ej then even e j is directly or
transitively dependent on e i. Graphically it means that there exist a path consisting of a
message arrows and process in line segments along increasing the time on in the space
time diagram that it starts at e i and ends at e j.
97
So, for example, in the figure e 1 and e 3; so e11→ e33 and this basically this particular
relation is basically you can see is established using a path and this particular path 1, 2,
3, 4, 5; that means, this is resulting out of the 5 different events to establish this causal
residence relation of e11and e33
The binary relation or the causal precedence relation or a happened before relation all 3
things are same, denotes the flow of information in a distributed computation and e i has
happened before e j dictates that all the information available at e i is potentially
accessible at e j. So, here you can see that e 22 that e 26 has basically is preceded and has
basically the information or accessing the information of all other events, which are
happened before the occurrence of e 26 in this particular diagram you can see.
98
Now, then concurrent events for any two events e i and e j, if e i has not happened before
e j and e j also has not been happened before e i then events e i and e j are said to be a
concurrent event and they are denoted by a ||.
So, in the execution e 13 and e 33 they are basically the parallel y because e 13 has not
happened before e 33 nor e33 has happened before e 13. So, there is no causal
precedence relation hence they are called concurrent event and represented using a
vertical bars. Now the relation concurrent event which is shown by the vertical bar is non
99
transitive that is e i and e j they are related with a concurrent event and e j and e k they
are related with the concurrent event that does not imply that e i and e k they are related
with the concurrent event.
Now, for any two events e i and e j in a distributed system these 3 different relation,
holds either e i is happened before e j or e j has happened before e i or e i and e j they are
concurrent event.
Now, logical versus physical concurrency in a distributed computation two events are
logically concurrent, if they do not causally affect each other physical concurrency. On
the other hand has connotation that events occur at the same instant in the physical time,
note that the two or more events may be logically concurrent even though they do not
occur at the same instance in the physical time.
100
For example in figure 4.1 events in the set {e 13,e24 ,e33 } are logically concurrent, but they
occurred at different instant in the physical time; however, note that If the processor
speeds and the message delays had been different the execution of these events could
have very well coincided in the physical time.
So, although they are logically concurrent, but physical time is a different way of
expressing it. So, here even if they incidentally get the same physical time or not does
not makes any difference. So, both are basically representing the logical concurrent
101
events. So, whether I set up logically concurrent event coincide in a physical time or in
what order in the physical time they occur does not change the outcome of the
computation.
Now another model of a distributed computation that is the model for commutation
network, briefly we are going to discuss this. So, there are several models of service
provided by the communication network namely FIFO model non-FIFO model and
causal ordering model in FIFO model each channel acts as a first in first out message
queue. Thus the message ordering is preserved by the channel itself in the non-FIFO
model.
102
The channel acts like a set in which the sender adds the message and the receiver
removes the message from it in a random order, and causal ordering model is based on
Lamport’s happened before relation a system that supports the causal ordering model
satisfies the following condition. Causal order for any two message m ij and m kj which
are going to the same destination, if send of m ij is happened before send of m kj then
receive of m ij causally related and happen before receive of m kj. This property ensures
that causally related message such destiny destined to the same destination are delivered
in an order that is consistent with their causality relation.
So, causally ordered delivery of the messages implies FIFO message delivery. So, causal
ordering model considerably implies or signifies the design of distributed algorithm
because it provides inbuilt synchronization and is used in various applications causality
and logical time.
103
The concept of causality the concept of causality between the events is fundamental to
the design and analysis of parallel and distributed computing and operating system
usually causality is tracked using physical time like we are doing in our daily life for
example, a queue for purchasing a ticket are basically a line of people standing for
arrival of a bus and so on and so forth.
So, they are all tracked using the physical time. So, whosoever has come first, come first
means the causality of coming and joining the queue and that is basically is done or is
being tracked using physical time. With this explanation we are now going to see how
we are going to use it causality in the distributed system. In distributive system it is not
possible to have a global physical time; it is possible to realize only approximation of it.
As a synchronous distributed computation make progress in his parts the logical time is
sufficient to capture the fundamental monotonicity property associated with causality in
a distributed system.
So, meaning to say that though we do not have the global physical time yet we are going
to use an approximation of it because that is what only is required to capture the
causality in distributed systems by between the events.
104
So, this lecture discusses 3 ways to implement the logical time, which is an
approximation of a global physical time and which will capture the causality of events in
the distributed system without having the common physical clock. So, there are 3
different ways to implement the logical time we are going to discuss in this part of the
lecture they are the scalar time, vector time and matrix time.
Now, causality among the events in a distributed system is a powerful concept in

reasoning analyzing and drawing inferences about the computation. The knowledge of
causal precedents relation among the events of processes helps solve various problems in
a distributed system such as when we see the distributed algorithm design for mutual
exclusion, there you can use there you can see that for fairness we are going to use this
particular time concept. Similarly in a replicated database databases this time concept is
used to have a consistent updates, similarly in a deadlock detection correctly using for
avoiding the friend terms and other problems this particular time concept or a causality
concept we are going to use it; similarly, the tracking of dependent events knowledge
about the progress of a computation and concurrency measures.
105
So, you just see that causality is one of the important means, concept in the design of
distributed algorithm in the system that we are going to discuss in this course in this
lecture.
Now, framework for a system of logical clocks a system of logical clocks consists of a
time domain T and a logical clock C, the elements of T from a partially ordered set over
a relation which is shown as less than, but it is a relation less than is called happen before
or a causal precedence relation. Intuitively this relation is analogous to the earlier than
relation provided by the physical time. So, now, we are correlating the properties of
physical time using alternative concept that is called basically the logical time or a
logical clocks. So, the logical clock C is a function that maps and even e in a distributed
system to an element in a time domain T and that is denoted by C and it is also called the
timestamp of an event e and is denoted by this particular function. So, C is a function
which will take the events out of the distributed events which is denoted by capital H and
will basically map on to a time domain and this particular function will form a time
stamp that is C(e) of a particular event in a distributed system.
Now, such that the following properties are satisfied; so for any two events e i and e j and
if e i is happened before e j or they are related with this particular relation causal
precedence relation, e i is happened before e j this will imply that the timestamp of e i is
106
less than the timestamp of e j this monotonicity property is called the clock consistency
condition.
When T and C that is the time and a clock domain satisfies the following conditions that
is for any two event e i and e j, e i is happened before e j this by implies that the
timestamp of e i is less than time stamp of e j or you can also say that the timestamp of e
i is less than timestamp of e j this implies that e i is happened before e j, if both ways this
particular relation holds then this system of clock is called strongly consistent system of
clocks.
So, we have seen the two condition one is a clock consistency condition the other is
called strongly consistent condition of the clocks. Here only the implication will give the
timestamp if the relations are related with the happen before. Then, basically the
timestamp is less than the timestamp of other events and if it is a strongly consistent then
it is a by implies; that means, if the timestamps are given by that we can infer whether
the two events are happened before or not that we are going to see in more detail further.
Now, implementing the logical clocks; implementation of a logical clock requires

addressing two issues first one is the data structure local to every process to represent the
logical time and the protocol to update the data structure, to ensure the consistency
condition. So, each process pi maintains a data structure that allows the following two
capabilities the first one is called local logical clock and denoted by lc i that helps
107
process pi measure it is own progress or you can also say that this will ensure the
progress of internal events.
Now, second is about the protocol and it is basically called the loop the logical global
clock that is gci that is the representation of a process pi is local view of the logical global
time typically lci is a part of gci. So, the protocol ensures that the process is logical clock
and thus it is view of the global time is managed consistently the protocol consists of the
following two rules this rule governs R1 rule governs how the local logical clock is
updated by a process, when it executes an event R2 rule this rule governs how the
process updates it is global logical clock to updates it few of the global time and a global
progress.
Now, the systems of logical clocks differ in their representation of the logical time and
also in the protocol to update the logical clocks. So, that was the general framework.
108
Now, we are going to see the first type of logical time that is called the scalar time. This
particular scalar time was proposed by the Lamport in 1978 as an attempt to totally order
the events in a distributed system. Here the time domain is the set of non negative
integers, the logical local clock of a process p i and it is local view of a global time are
squashed into one integer variable C i.
Now, the rules of implementing the protocol goes like this the rule R 1 and R 2 to update
the clocks are as follows. R 1 rule before executing an event that is send receive or
internal the process pi execute the following C i is equal to C i plus d, where d is greater
than zero in general every time R 1 rule is executed d can have different values; however,
typical d is kept as one rule R 2. So, each message piggybacks the clock value of it is
sender at the sending time.
109
Now, when a process pi receives a message with the timestamp C of message it executes
the following action. So, there are 3 actions first action is, this C of message is the
timestamp which basically was piggybacked and received at the particular process and
Ci is basically the clock value internally. So, the maximum of this will be updated in Ci
value then it will execute the rule one and would deliver the message. So, the figure 4.2
shows illusion of the scalar time.
110
So, let me explain through this particular scalar time now p 1 the first event that is
basically it has the clock C 1. So, using C 1 the event was timestamp as one now event
two was time stamped as two here for the process p 1. Now event two is a send of a
message. So, this particular timestamp will be piggybacked on the message and it will
when it will arrive over here. So, it will take the maximum of maximum of this particular
event C 2which is 1 and C message that is 2 that becomes 2 and then it will apply the
rule R 1 rule R 1 says that this C 2 = C 2 + d and d = 1. So, this becomes 3.
So, 3 will be time stamped on this particular event which is will be the receive of event
receive of a message.
This particular scalar time follows following basic properties; the first property is called
consistency property. Scalar clock satisfies the monotonicity property and hence the
consistency property that is for any to event e i and e j if e i is happened before e j, this
will imply that the timestamp of e i is less than the timestamp of e j. That means, in our
normal situation also if i event has happened before the other event or if somebody is
standing in the front of the queue.
The next person who comes afterwards his time will be more than the person who is
standing in the front of the queue or who arrives before him. So, using this concept also
here you can basically see that this particular events happening and the timestamp of the
clock will hold us this particular relation that is less than relation.
111
Now the next property which will be satisfied by the scalar time is called total ordering
scalar clocks can be used to total order the events in a distributed system, the main
problem in total ordering events is that two or more events at different processes may
have identical timestamp because these clocks at different processes that is nothing, but
the variable this non negative or integer variable will be incremented independently.
So, for example, in a figure 4.2 third event of a process p 1 and the second event of
process to have identical scalar timestamp here, how we are going to order these events
than in that case or total order.
112
So, to do the total ordering in this particular case where by scalar time is the same a
tiebreaking mechanism is needed to order such events a tie is broken as follows. The
process identifiers are linearly ordered and the tie among the events with identically
scalar timestamp is broken on the basis of their process ids.
So, meaning to say that if they have the same timestamp of x and timestamp of y, but
their ids of x and id of y will be used to order them in case the timestamp are same. The
lower the process ids in the ranking the higher the priority, the timestamp of an event is
denoted by your tuple. T comma where T is the time of occurrence and i is the identity of
a process, where it occurred that I have already explained.
So, the total order in relation denoted by this symbol, total order relation on two events x
and y with a timestamp h i and k j respectively is defined as follows. So, x and y they are
related with the total order relation this by implies that the time stamp of x that is h is
less than k; that means, either x has happened before y by the time stamp or if their time
stamps are same then the id of x that is nothing, but an (i < j) and if that is happening
then this will total order them.
So, tiebreaking and partial order will form the total ordering of event and this is same as
the definition of a total order, we have seen in the previous slides that is it follows the
partial order and also a tiebreaking mechanism in this way. Further properties are that
you can use the scalar time for event counting.
113
If the increment value d is always 1 the scalar time has the following interesting
property. That is if the event e has a timestamp h then (h-1) represents the minimum
logical duration, counted in the unit of events required before producing the event e. We
call it a height of event e. In other words (h-1) event have been produced sequentially
before event e regardless of the process that produced these events.
So, we can count how many events have happened before a particular event for example:
in the figure five events proceed event b on the longest causal path ending at b.
114
So, in this particular figure you can see that the event b has basically to occur event b 5
different events have been preceded before it and that is why this number 5. So, here in
this particular example or illustration you can see that the event b which is having the
time stamp six. That means before that, five different events have happened then only
this particularly event b is occurring. So, five events you can see this this this.
Now, another property is no consistency property for a scalar time. So, the system of the
scalar clocks is not strongly consistent, that is for any two events e and e i and d j if we
compare the clock values that does not indicate that does not imply that e i is happening
before e j for example, in figure 4.2 the third event of a process p 1 has a smaller still a
timestamp than the third event of process 2; however, the former did not happen before
the latter.
115
So, here in this particular example you can see that by smaller a scalar timestamp does
not mean that they basically hold the relation that is happened before relation.
Now, with this particular discussion of a scalar time they does not have the strong
consistency property, we have a motivation to how another system of clocks or a time we
should have the strong consistency property because by looking at the time stamp if we
can infer that the events have happened before or not. So, for those kinds of applications
we require another clock or another time. So, this is the motivation to study another time
116
which is called the vector time. So, the system of vector clocks was developed
independently by Fidge Mattern and schmuck in the system of vector clocks the time
domain is represented by a set of n dimensional non negative integer vectors.
So, each process pi maintains a vector of size n, where vti is the local logical clock of pi n
denotes the loop the logical time progresses at pi or that is equivalent to the scalar time
of i of pi vti[j] represent the process p is latest knowledge of the process pjs local time.
So, pis latest knowledge of pjs local time is basically stored in the i-th indexed vector of
a process i. Now if vti[j] = x then process p i knows that the local time at process pj has
progressed till x the entire vector vi vt i constitutes pi is view of the global logical time
and is used to this timestamp the events.
Now, the process pi used the following two rules to update it is vector clock. The rule
one for the vector clock says that before executing an event process i update this local
logical time just like a scalar clock that is the vector of i is assigned to vector i plus d.
So, rule 2 of a vector clock says that each message m is piggybacked with the vector
clock vt of a sender process at the sending time. On the received of such message m
comma vt the process pi execute the following actions, the first action says that update is
global logical time as follows. So, it will take it is vector time and also the vector time
which is basically piggyback here in the message and the maximum of that will be
updated so; that means, it will update the global logical time according to this particular
117
formula and once having done that then it will execute R 1 and it will deliver the
message.
So, the timestamp of an event is the value of the vector clock of it is process when the
event is executed.
Now, figure 4.3 shows an example of vector clock progress with the increment value d is
equal to 1, initially a vector clock is all zeroes initialized. So, the example of a vector
118
clock you can see here in this is space time diagram. So, in the vector indexed 1 it is
same as by scalar clock you see that it is growing in a linear manner.
Similarly as far as for p 2 the second indexed of a vector time is growing linearly. And
similarly the third vector for p 3 is also growing linearly in the vector timing. And
whenever there is a send of a message or whenever there is a message exchange these
global views will be updated at the other end of a process.
Now, comparing the vector timestamps the following relations are defined to compute to
compare the two vector timestamps vh and vk. So, these are the operations by which we
can compare the timestamps and infer the causal precedence relation between these
events. Now if the time stamps vh and vk they are same this by implies that for all values
of ∀x : vh[x ] = vk [x ] if it is ≤ vk this by implies that for all values of x. vh[x ] ≤ vk
[x ]. vh ≤ vk this by implies that vh ≤ vk and there exists an x where vh[x ] < vk [x ].
Vh and vk that is the vector timestamps they are concurrent this by implies ¬(vh < vk ) ∧
¬(vk < vh) then they are concurrent events.
So, these set of comparative relations are used to compare the timestamps of between
two events and we can infer the causal precedence relation between these two events. So,
if the process at which the event occurred is known the test to compare the two
timestamp can be simplified as follows. So, if events x and y respectively occurred at
processes pi and pj and are assigned time stamp vh and vk respectively then x has
119
happened before y this by implies that vh[i] ≤ vk [i]. So, this is a very important
understanding or if let us say vh[i ] ≤ vk [i ] then it will also indicate that x has
happened before y.
Similarly, x and y they are concurrent events if their vector time stamps are basically if
vh[i ] > vk [i ] ∧ vh[j ] < vk [j ]then they are concurrent events.
So, another property of a vector clock is isomorphism. So, if the events in a distributed
system are time stamp using a system of vector clock we have the following properties.
So, if the two events x and y have time stamp vh and vk respectively, then x is happened
before y this will be indicated using v h is less than vk if x and y they are parallel; that
means, their vector timestamp also are basically un comparable thus there is
isomorphism between the set of partially ordered events produced by the distributed
computation that is h and their vector timestamp. So, they are basically holding the
isomorphism property.
120
The vector time follows by strong consistency property the system of vector clocks is
strongly consistent that is by examining the vector timestamp of the two events we can
determine if the events are causally related. However, Charron and Bost showed that the
dimension of a vector clocks cannot be less than n, n is the total number of processes in
the distributed computation for this particular property to hold this particular discussion
we will see in the next class in more details.
Now another property of a vector clock is the event counting when d is equal to one in
rule one then i-th component of the vector clock at a process i that is vti [i ], denotes the
number of event that have been that have occurred at pi until that particular time. So, we
can do the event counting just like we have seen in the scalar clock.
So, if an event e has a timestamp v h. So, vh[j] indicates the number of events executed
by pj that causally precede e clearly.
121
If we take the summation of ∑vh[j ] − 1 represent the total number of events that
causally precede e in the distributed computation. Conclusion: in the distributed system a
set of processes they communicate by exchanging messages over the communication
network the distributed computation is spread geographically over the distributed
processes. The processes do not share a common global memory or a common physical
clock to which the process have instantaneous access.
Thus, we have seen in this particular lecture how to overcome from these particular
difficulties of this particular model which is called a distributed system. So, instead of
having the physical clock we have seen that the logical clock which will give will
capture the causal relations between different distributed events. So, in this lecture we
have presented the idea of logical clock, which is going which captures the causal
relation between the event that was proposed by the Lamport in 1978 in an attempt to
order the events in a distributed system. We have discussed two system of logical clock
in this lecture namely the scalar and vector clocks to capture the causality between the
events in a distributed computation, holding the strong consistency property and clock
consistency property.
So, in the upcoming lectures we will discuss about the size of the vector clock, matrix
clock and the virtual time and a physical clock synchronization; all these important
122
aspects which is basically giving you the concept of causality in a distributed system,
which is the most fundamental in the design of the distributed system.
Thank you.
123
Distributed Systems
Prof. Rajiv Misra
Lecture - 05
Size of vector clock, Matrix clocks, Virtual time and Physical clock
Synchronization
Lecture 5: Size of vector clock, Matrix clock, Virtual time and Physical clock
Synchronization. These are the topics of lecture 5. Preface: recap of previous lecture.
In previous lecture we have discussed the models of distributed computation and

presented the idea of causality and logical time that was proposed by Lamport in 1978, in
an attempt to order the events in a distributed system. We have discussed two systems of
logical clock namely, scalar and vector clocks to capture causality between events in a
distributed system.
124
Content of this lecture, in this lecture we will discuss about the size of vector clocks,
matrix clock, virtual time and physical clock synchronization.
So, in the last class we have seen the concept of causality which is the fundamental in the
design of distributed systems. Usually, causality is tracked using physical time. Now
physical time in day to day life we also use it in the form of the loosely synchronized
clocks, which we have in the form of wristwatch wall clocks and so on. Our activities are
timed using access to these clocks in computer that is in a centralized system there is a
single clock and the processes using the issue of the system calls. They can access the
computer clock and the events or the processes can get the time out of it, the next process
when demands the time it will get the next time and so on.
So, in a centralized system there is no need of clock synchronization, in a distributed

system processors have their clocks loosely synchronized all the clocks and as far as
synchronization of a clock is concerned becomes a big problems and we are going to see
in this particular topic, today all these aspects of causality. How we are going to obtain
the causality in a distributed system and physical clocks or a logical clock whichever of
them is going to be useful for a design of distributed system.
So, in the last class we have seen the logical clocks in two forms, the first is called scalar
clock, the other was the vector clock. Scalar clock provided the clock consistency
property and the vector clock provided the strong consistency property. So, the
125
applications which it needs them will be basically going to use them, today we are going
to see that the size of the vector clocks; why it is required to be the size of the number of
process that is called n. So, that is called size of vector clock, is it required to have an
always or we can basically work with less than n still we can have the strong consistency
property let us go ahead.
So, size of vector clocks an important question to ask is whether the vector clocks of size
n are necessary, in the computation consisting of n processes to answer this question, we
have to examine the usage of vector clocks. So, usage of vector clock means the
applications where we are going to use them will answer this kind of question. So, the
vector clock provides the latest known local time at each other process. So, if this
information in the clock is to be used to explicitly track the progress at every other
process, then vector clock of size n is necessary.
Now a popular use of vector clock is to determine the causality between the pair of
events, given any events e and f the test that e ≺ f if and only if, the T(e) < T(f), which
requires a comparison of vector clocks of e and f. Although it appears that the size of n is
necessary that is not quite accurate, that we are going to see here. It can be shown that
the size equal to the dimension of a partial order that is e and that happened before
relation is necessary, where the upper bound on this dimension is n not the size of n.
126
So, the definitions in this regard we are going to see to understand this result, on the size
of clocks during determining the causality between the pair of events, we firstly to do
some definitions linear extension of the partial order, is the linear ordering of e that is
consistent with the partial order that, is if two events are ordered in a partial order they
are also ordered in the linear order. So, linear extension can be viewed as projecting all
the events from different processes on a single time axis, the dimension of a partial order
is the minimum number of linear extensions whose intersection gives exactly the partial
order.
So, having given this particular definition of a linear order so now we are going to see
that linear order, will necessarily introduce the ordering between the pair of events and
some of these orderings are not in a partial order, also observe that differently resistances
are possible in general and let P denote the set of tuples in a partial order defined by
causality relation.
So, there is a tuples (e,f) in P for each pair of events (e, f). So, that e has happened before
f, let linear extensions L1, L2 and so on; denote the set of tuples in different linear
extensions of this partial order. The set capital P is contained in the set obtained by taking
the intersections of such collections of linear extensions L1, L2 and so on. This is because
Li contain all the peoples that is the causality dependencies that are in P.
127
Now, let us take one example when there is a client server interaction application, now
here the client server interaction happens between a pair of processes the queries to the
server and the responses to the client occur in a strict alternating sequences, although n=2
that is client and server; all the events are is strictly ordered and there is only one linear
extension of all events that is consistent with the partial order, hence the dimension of
this partial order is 1.
So, the scalar clock such as one implemented by the Lamports scalar, clock rules is id
equal to t to determined e ≺ f for any 2 events. So, here only although (n=2), but the
vector, but only the size of the clock is 1 required to solve this application.
128
Similarly, another application it is called concurrent send and read, send and receive. So,
now, consider an execution on processes P 1 and P 2, such that each send a message to
the other before receiving the others message. So, the 2 send events are concurrent, as
are the 2 receive events. To determine the causality between the send events are between
the receive events it is not sufficient to use a single integer a vector clock of size 2 is
necessary, this execution exhibits a graphical property called crown. So, a crown of n
messages has n dimension.
129
So, we are going to see another example of a complex execution. So, in a complex
execution to determine the dimension of a partial order is not going to be straight
forward. Next figure 5.1 shows an execution of 4 processes; however, the dimension of
this partial order is going to be 2. So, that is a vector clock of size 2 is good enough to
provide the strong consistency property, among the four process interaction in this
particular example.
To see this informally let us consider the longest chain <a, b, d, g, h, i, j>, there are
events outside this chain that can yield multiple linear exchanges and the dimension is
more than one, the right side of figure 5.1 we are going to show. Now, that earliest
possible and the latest possible occurrences of the events are not in this chain, with
respect to the events in this particular chain see this particular example here.
So, in this example we have seen so the longest linear extension chain that is <a, b, d>
then <g, h, i and j>. So, that is mentioned over here the longest, now out of this particular
longest chain. What we can see here these are basically the events which are outside this
particular chain that is <c, e and f>. So, <c, e and f> basically outside the chain so, now
we are going to form the linear extension. So, 2 linear extensions are possible here in this
case so; obviously, the dimension of the partial order is not going to be 1, but more than
one.
130
So, 2 linear extensions if you see is possible with the earliest time and the latest time of c
e and f.
So, there are 2 different linear extensions are possible who are here in this case and these
2 linear extensions if we take intersection, it will give the partial order set that is what is
going to be shown here in this particular complete illustrative example.
So, L1 is a linear extension L2 is another linear extension and when (L1 \ P ) ∩ L2= ∅,
similarly (L1 \ P ) ∩ L2= ∅. So, hence the intersection of L 1 and L 2 these are the 2
linear extension this will basically generate the complete partial order.
Hence the dimension of this particular execution is 2. So that means, 2 linear extensions
are good enough to generate this partial order, hence the vector clock of size 2 is good
enough to ensure by strong consistency property and hence basically to solve this
application. So, all for the total number of processes is equal to four. So, only the vector
clock of size 2 is required here in this particular case.
So, finding out the dimension of the partial order is not going to be easy, computationally
is going to be difficult. So, basically posterior legal analysis is required here in this case
to identify the size of the vector clock and it can be optimized and different algorithms
are going to use this kind of concept in reducing the size of the vector clock.
131
Now, the next clock we are going to see is called the matrix time or a matrix clock. So, in
a system of matrix clocks, the time is represented by a set of n x n matrix of non negative
integers.
So, let us go and see the details of the matrix time, a process pi maintains a matrix
mti[1..n, 1..n] where mti [i , i ] denotes the local logical clock of pi and tracks the progress
of computation. So, it is going to be working like a scalar time, like we have seen in the
scalar clock. The other component that is called mti [i , j ] denotes the latest knowledge of
process pi, has about local logical clock of process p j. That means, this particular aspect
will give you something called vector time, the notion of a vector time is given here in
this case.
Now, the third aspect of a matrix time that is mt i j comma k, represents the knowledge
that pi has about the latest knowledge that p j process, has about the local logical clock of
pk. This global information is stored in the local view of process pi. So, the entire matrix
pi denotes the pi local view of the global logical time here in this matrix clock.
132
Now, the 2 rules which matrix clock used to follow to update this clock is R 1 and R 2
we are going to explain it. So, the rule R 1 before executing an event process p i updates
its local logical time as follows. So, it will be done according to by scalar clock that we
have already seen the rule R 2, each message m is piggybacked with the matrix time mt;
when pi receives such a message containing that particular matrix timestamp message
from pj, then p i execute the following sequence of actions. So, it will update its global
logical time as follows.
So; that means, mt i that is pi process for its i-th, row that is a i-th vector it will update as
per the information as per the time, which is as for the matrix which is received by that
particular process pj. So, it will be like updating the vector clock, now the remaining
portion of the matrix of i will be updated like this.
So, basically for (k > 1); that means, for all other rows and also the complete vector of
the basically the rows of other processes are going to be updated as per the information,
which is being obtained from the message which has basically sent its the matrix
timestamp. After doing this updation according to rule R 2 it will execute R 1 and it will
tell you or the message m.
133
So, example one you can clearly see that event e 1 is represented here in the matrix and
this particular element is representing the local clock or the scalar time of process 1.
Now this particular matrix when it sends a message according to the event 1 which is
message sent event, similarly event 3 also will basically how this kind of matrix because
the third row and the third column will basically indicate the scalar time of process P 3,
similarly when they will be received at e 2 of a process P 2. So, now, let us see the details
how it will be basically updating the clock.
So, process P 2 first it will basically modify it is a scalar clock. So, that is in the second
row and the second column. So, it will be the event number 1, now as far as the time
which it has received that will be updated based on the information, based on the local
view of the other process for example, process 1 has its clock 1. So, it will be updated
over here, similarly process 3 has a clock value 1 so, it will be updated. So, this will be
the vector clock kind of updation which we have seen. The remaining part this portion
will be copied here and the last portion also will be copied here. So, the P 2 will have the
view of the P 3 time, similarly the P 2 will have the view of P 1 time and this particular
updation is done according to the matrix clock rule which we have seen.
134
In another example you can see when this particular message is received to P 2. So, P 2
means its vector at the second row, its particular event that is the scalar event is first
updated and then it will update the vector time. So, vector time means the time of one
will be updated over here and 0 means it has not seen about any event from P 3 and this
will be copied over here as far as the third updation, that is the matrix clock which is the
copied the vector of other processes; which P 2 will have the knowledge. Similarly about
the other interactions you can see finally, you can see that the event e 3 which has
complete view of all the events which has happened. So, let us see its vector.
Now, so e 3 means it is about the process P 3, now its event that is at the second event
because e 3 2 if you see it is second event so; that means, it is a scalar time is 2 and its
vector will be 2 comma 3, that means it has seen. So, far the message which is coming
from 2, so 2 mean it is having 3. So, it will be updated over here similarly it is having the
vector 2 so it will be updated. So, the vector clock and scalar clock both are updated,
now as far as the vectors of other processes are concerned. So, it will be copied similarly
this also will be copied over here. So, this is the matrix time and we have seen the
working of a matrix clock, what are the basic properties of a matrix clock that we are
going to see now.
135
So, the vector mti that is the i-th row that is nothing but the complete vector, it will have
the all the properties of a vector clock, in addition the matrix clock have the following
properties; that means, all the processes with which is having the minimum value of k,
which has seen which is basically having the information at the p i-th
So, minimum value is basically t so that means, process pi knows that every other
process that is process pk knows about p is local time has progressed until t, have having
this information there are many applications which will require this information. Which
says that the other processes will no longer require the information from p i, which is
basically before the time t. Hence, they can discard this absolute information in this
particular manner.
136
Now, the next time system we are going to consider is called virtual time, now virtual
time system is a paradigm for organizing and synchronizing distributed system. So, this
particular section we are describing the virtual time and its implementation using time
wrap mechanism. The implementation of a virtual time using time wrap mechanism
works on the basis of optimistic assumption, what is that optimistic assumption time
wrap relies on a general look ahead rollback mechanism, where each process executes
without regard to the other processes having synchronization conflict.
137
If a conflict is discovered in the opportunistic in the optimistic assumption, the offending
process are rolled back to the time just before the conflict and executed forward along
the revised path. Detection of conflicts and roll back are transparent to the user, the
implementation of virtual time using time wrap mechanism makes the following
optimistic assumption that, synchronization conflict and roll backs generally are occurred
really. So, it is not a frequent operation for the rollback or a conflict.
So, virtual time is a global one dimensional temporal coordinates on a distributed

computation to measure the computational progress and to define the synchronization. A
virtual time system is a distributed system executing the coordination with imaginary
virtual clock that uses virtual time. So, virtual times are real values that are totally
ordered by the operation by the relation called less than relation.
So, virtual time is implemented a collection of several loosely synchronized local virtual
clocks, these local virtual clocks move forward to a higher virtual times. However,
occasionally they may move backwards whenever there are conflicts.
138
So, virtual time processes can concurrently communicate with each other by exchanging
message. So, every message is characterized by 4 values in virtual time, name the sender
is there virtual send time is there, name of receiver and virtual receive time is there. Now
virtual send time is the virtual time of the sender when the message is sent, whereas
virtual receive time specifies the virtual time when the message must be received by the
receiver.
139
So, the problem arises when the message arrives at a process late, that is the virtual
receive time of the message is less than the local virtual time at the receiver process
when the message arrives. So, this is going to be handled here in virtual time. So, what
your time systems are subject to 2 semantic rules, similar to the Lamports clock
conditions rule 1 says that virtual send time each message has will be should be less than
the virtual receive time of that particular message, otherwise there will be a problem and
problem will be resolved using the rule (Refer Time: 25:19).
Rule 2 says that: virtual time of each event in the in the process, is less than the virtual
time of the next event in that particular process. This is going to be solved using the
global clock system. So, the above 2 rules imply that the process sends all the message in
the increasing order of what virtual send time and the process receives all the message in
increasing order of the virtual receive time.
Now, it is important and event that causes another should be completely executed before
the cost event can be processed, this is already indicated include 1 and rule 2.
140
So, we are now going to see the characteristics of a virtual time, virtual time systems are
not all isomorphic; it may be either discrete or continuous virtual time may be only
partially ordered. Virtual time may be related to the real time or may be independent of
it, virtual time system may be visible to the programmers and manipulated explicitly as
the values of hidden and manipulated explicitly according to some system defined
discipline, virtual times associated with the events may be explicitly calculated by the
user program or may be assigned by fixed rules.
141
So, now we are going to see the comparison with Lamports logical clock, how it is how
the virtual time is basically the different. So, in the Lamports logical clock and artificial
clock is created one for each process with the unique levels; from totally ordered set in a
manner consistent with a partial order that we have seen. In virtual time the reverse of
the above is done by assuming that, every event is labeled with a clock value from a
totally ordered virtual time scale satisfying the Lamports condition.
Thus, the time warp mechanism is an inverse of Lamports scheme; in Lamports scheme
all the clocks are conservatively maintained. So, that they never violate causality the
process advances its clock as soon as it learns a new causal dependency in the virtual
time, clocks are optimistically advanced and corrective actions are taken whenever
violations are detected.
So time warp mechanism, so in the implementation of virtual time using time warp
mechanism, virtual receive time of a message is considered as a timestamp the necessary
and sufficient condition for collect implementation of virtual time are that each process
must handle incoming message in the time stamp order.
142
Now, that time warp mechanism consists of 2 major parts, the first is called local control
mechanism, the second part is called global control mechanism. Local control
mechanism ensures that the events are executed and the messages are processed in the
correct order, the global control mechanism takes care of global issues such as global
progress, termination, detection I O handling flow control etc; that we have seen in rule 1
and rule 2 and they are implemented in a time warp mechanism.
143
Now, the next part of discussion we are going to see that, if different processors are
having their physical clocks are going to be used in the distributed system then how are
we going to synchronize them and for that we are going to see a protocol which is called
a network time protocol, which does the physical clock synchronization. So, let us begin
this kind of discussion, in centralized system there is only one single clock the process
gets the time by simply using the system call to the kernel.
So, the next process if it gets the time it will get always the higher time because it is
getting from the same clock. So, there is no problem of clock synchronization in
centralized system, how are you in a distributed system there is no global clock or a
common memory, each processor has its own internal clock and its own notion of a time
these clocks, can easily drift seconds per day accumulating a significant error over time
also because different clocks tick at different rates; they may not remain always
synchronize although they may might be synchronized, when they start this clearly poses
serious problem to the application that depends upon the synchronized notion of time.
So, the most practical application algorithm that runs in a distributed system, we need to
know the time in one or more of the following context. Now unless the clocks in each
machine have a common notion of time based queries cannot be answered. So, clock
synchronization is has a significant effect on many problems like secure system fault,
diagnosis recovery and so on.
144
So, clock synchronization is the process of ensuring that physically distributed

processors, how common notion of time due to different clock rates. The clocks at
various sites may diverge with time and periodically clock synchronization must be
performed to correct this clock skew in distributed system, in our day to day life our
wrist watch or our wall clock also loses the synchronization and we do the
synchronization with a universal coordinated time that is UTC.
So, clocks are synchronized to an accurate real time standard, like universal coordinated
time we are not only going to be synchronized, but they have to be synchronized with the
with a global clock or a physical time and that is done through UTC universal
coordinated time. So, clocks that must not be synchronized with each other, but also have
to adhere to the physical time or termed as physical clock. So, physical clock means not
only the synchronized set of clocks, but also they have to be coordinated or synchronized
with the physical time. So, they should be giving the same set of physical time.
145
Now, we are going to see into more detail of the clock synchronization and finally, we
are going to basically see how these concepts are going to be used in network time
protocol. So, clock inaccuracies physical clocks are synchronized to an accurate real time
standard like universal coordinated time it is also called UTC.
However, due to the clock in accuracy is discussed above a timer clock is said to be
working within a specification, where constant rho is maximum is skew rate specified by
the manufacturer; so that means, the clock rate must be bounded by 1+ρ
and lower bounded by 1-ρ . So, if it is working in this particular equation that is equation
number 1, then we have to see that clock is working with the within the specifications.
146
Otherwise, we can see in these situations, if that particular clock rate that is dC/dt = 1
then it is perfect clock, if it is slower than dC/dt < 1 and a fast clock dC/dt > 1. So, the
behaviors of fast slow and perfect clock are shown here in this particular diagram of a
clock inconsistency.
Offset delay estimation methods are used in a network time protocol we are going to see
what these methods are. So, the network time protocol which is widely used for clock
synchronization on the internet, uses offset delay estimation method the design of
147
network time protocol involves, hierarchy that is hierarchical tree of times servers. That
means, universal coordinated time will be synchronized in the form of a hierarchical tree.
That is the primary server the root synchronizes with UTC the next level contains the
secondary servers, which acts as a backup to the primary at the lowest level is the
synchronization subset which has the clients.
Clock offset and delay estimation, in practice source node cannot accurately estimate the
local time on the target node due to the varying messages or the network delays between
the nodes. So, that this protocol implies a common practice of performing several trials
and choose the trial with a minimum delay. Figure 5.4 we are going to show you the
network time protocol timestamps, which are numbered and exchanged between the
peers A and B. Let T 1 T 2 and up to T 4 be the values of four most recent timestamps as
shown in the next figure. Assume clock A and B are stable and running at the same
speed.
148
So, these are the four times we have now shown and we are going to use it for offset and
delay estimation.
Now, a is equal to T 1 - T 3 so this particular portion will become a, that is T 1 - T 3 and

here this portion will become b that is T 2 - T 4 and this is going to calculate the offset
and delay.
So, the clock offset which is shown as a theta is nothing but (a+b)/ 2 and the round trip
delay that is little δ is nothing but a − b. So, each network time protocol message
149
includes the latest 3 timestamp that is T 1 T 2 and T 3, while T 4 is determined upon
arrival, does both peers a and b can independently calculate delay and offset using a
single bidirectional message stream shown here in this figure 5.5.
Now, with this particular figure 5.5 network time protocol, that is called time
synchronization protocol we are going to see. So, a pair of servers in a symmetric mode
exchange the pair of the timing messages, store of data is then built up about the
relationship between the 2 servers. Specifically assume that the each pair maintains
150
appear that is Oi and Di offset and the delay. So, Oi is a measure of theta that is the offset
and Di is the transmission delay of the 2 messages that is little delta, the offset
corresponding to the minimum delay is chosen and that will be basically the Oi
minimum delay is basically the Di.
So, specifically the delay and offset are calculated as follows, assume that the message m
takes time t and message m prime takes t prime to transfer the message.
So, the now the offset between A and B, A’s clock and B’s clock is let us see O. So, A’s
local clock time is A t and B’s local clock time is B t we have A(t) = B(t) + O, then
another equation question number 4 says that Ti −2 = Ti −3 + t + O, with reference to this
particular figure we are going to estimate.
So, same thing that we have seen in the in the previous slides that is being calculated,
now assuming t=t’ and offset O i can be estimated as Oi = (Ti −2 − Ti −3 + Ti −1 − Ti )/2,
similarly round trip Di is estimated as Oi = (Ti −2 − Ti −3 + Ti −1 − Ti )/2 now having
calculated the equation number 6 and 7. So, the 8 most recent these pairs are retained,
means they are calculate they are basically estimated calculated. So, the values of the
value of Oi that corresponds to the minimum Di is chosen estimate of O. So, that we
have already seen.
151
Now the question arises we have seen the physical clock synchronization, we have seen
several logical clocks now the question is whether the physical clock or a logical clock is
used to capture the causality in distributed system, which of these methods is good in a
distributed system? Now in a day to day life the global time that a physical time to
reduce the causality relation is obtained from a loosely synchronized clocks, like
wristwatch and wall clocks. However, in distributed computing system the rate of
occurrence of the events is several magnitude higher and the events execution is several
magnitude smaller.
So, consequently if the physical clocks are not precisely synchronized, the causality
relation between the events may not be accurately captured. So, this is the most
important notion, therefore it turns out that in distributed configuration causality relation
between the events produced by the by the program execution and its fundamental
monotonicity property can be accurately captured by the logical clocks.
So, if the clocks physical clocks are used and physical clocks using network time
protocol, basically they can synchronize up to 10 of milliseconds, but if the number of
events which are several magnitude higher and the event execution time is several
magnitude is smaller, basically it is not basically fit in to that accuracies than physical
clocks even the network time protocol will not be useful for the distributed system
applications. So, we have to see what kind of application we are going to run, whether
152
the physical clock can be used with the network time protocol whether can it solve the
problem or not. But definitely in all cases the logical clocks which basically capture the
fundamental monotonicity property can be captured and can be useful in a distributed
system.
Obviously, the logical clock in all situations it will work perfectly fine for distributed
application and that is why we have seen so much of different kind of the logical clocks,
in this particular part of the discussion.
So, conclusions in this lecture we have discussed about the size of vector clock, the
matrix clock and the virtual time to capture the causality between the events in a
distributed system. Then we have seen how the physical clock synchronization can be
used as a paradigm for organizing and synchronizing the distributed system.
In upcoming lecture we will discuss about global state and snapshot recording
algorithms.
Thank you.
153
Distributed Systems
Prof. Rajiv Misra
Lecture – 06
Global State and Snapshot Recording Algorithms
Lecture 6 Global State and Snapshot Recording Algorithms preface recap of previous
lecture.
In previous lecture we have discussed about the models of distributed computation

causality and general framework of logical clocks that is a scalar a vector and matrix
clock in distributed systems. And how the virtual time and physical clock
synchronization can be used as a paradigm for organizing and synchronizing distributed
systems content of this lecture.
154
In this lecture we will discuss further details about the global states that is consistent and
inconsistent states cuts in a space time diagram, Models of communication in distributed
systems and snapshot algorithm that is given by the Chandy-Lamports algorithm to
record the global snapshot.
Global State Introduction recording the global state of a distributed system on-the-fly is
an important paradigm. The lack of global shared memory and the lack of global clock
and unpredictable message delays in an distributed system makes this problem non-
155
trivial. This lecture first defines consistent global state and discusses issues to be
discussed to be addressed to compute consistent distributed snapshot then the algorithm
to determine on the fly such snapshot is presented in this part of the lecture.
System model which we are going to consider in this algorithm design the system
consists of a collection of n processes p 1 to pn that are connected by a communication
channel, there are no globally shared memory and there is no physical global clock and
the processes communicate by passing messages through the communication channel.
So, the communication channel is modeled by Cij that is the channel which connects
from process i to process j and it is the state that is the channels state is represented by
SCij the actions perform by processes are modeled as 3 different type of events the
internal events the message send event and the message receive events.
So, all the processes are basically a 3 events types that is internal events message send
event and the message receive events for a message mij that is sent by a process i to
process j let send(mij) and receive(mij) denote the send and receive events. So, with
these notations we are going to see some more details.
156
So, at any instant the state of a process i is denoted by LSi is the result of the sequence of
all the events executed by pi till that time. For an event e and a process state LSi event is
in LSi if and only if belongs to the sequence of events that have taken process pi to a
state LSi. For an event e and a process the state LSi e∈LSi if and only if does not belong
to the sequence of event that have taken process pi to LS state.
For a channel Cij the following set of messages can be defined based on the local states
of a process pi and process pj transit. So, transit between LSi and LS j that is the local
state of i and local state of j is nothing, but the message mij this means; that means, the
send of mij is in local state of i and receive mij is not in local state of j.
157
Where it is the message is in transit that is in the channel. So, consistent Global State
definition; the global state of a distributed system is a collection of local state of the
processes and the state of channels notationally global state GS is defined as GS = {Ui
LSi , Ui ,j SCij }; that means, all the state of a state of channels.
So; that means, global state is nothing, but the collection of all the local states and
collection of all the channel states. So, a global state GS is a consistent global state if and
only if satisfies the following 2 conditions that is the condition C1 and the condition C2
condition C 1 says that send(mij) is an event is in LSi that is local state of process i this
implies that the message mij is either in channel state of ij or the message is received mij
at the process GS local state that is LSi. Similarly the condition C 2 says that if this send
event send of a message mij is not in the local state of i this implies that the message mij
is not in the state of a channel recorded and also the received of mij message is not in the
local state.
158
So, we have seen the global state definitions. So, global state of a distributed system
execution you can see in figure 6.2 which is given ahead before that we have to see that
the global state GS 1 consisting of a locally states LS 1 of 1, LS 2 of 3, LS 3 of 3, LS 4
of 2. Let me explain you what does this means. So, LS 1 of 1 indicates that all the event e
1 up to one that has been covered in LS 1. So, in general we have to say that LS i of x is
nothing, but set of all the events of i which has executed or progressed up to x of ed
event. So, this particular notation is basically referenced over here.
So, this set of global state GS 1 which comprises of all these set of local state is
inconsistent because here the state of p 2 has recorded the received of a message m 1 2;
however, the state of the p 1 has not recorded it is send of a message.
159
So, let us see this in the diagram. So, in the red GS 1 is shown as inconsistent. So, LS 1
of 1 will start over here LS 2 of 3; that means will go like this and LS 3 of 3 will come
over here and LS 2 of 4 will basically come over here. So, you can see that this particular
message m 1 2 is causing this particular global state is inconsistent why because in this
global state 1 GS 1 it is called the received of a message is recorded, but send of a
message is not recorded. So, it is an inconsistent global state example.
160
On contrary the global state GS 2 consisting of local states is consistent all the channels
are empty here except C 2 1 that contains m 2 1 that we will see once we come to that
figure. Now global state of a distributed system a global state GS is nothing, but a union
of all local states and union of all general states this is transitless if and only if the state
of the channel all the state of channels are empty; that means, no message is in transit.
So, that global state is called transitless thus all channels are recorded as empty in the
transitless global state a global state is strongly consistent if it is transitless as well as the
consistent which definition we have seen.
Note that in figure 6.2 the global state of a local states this is strongly consistent let us
see in this example. So, GS 2 comprises of LS 1 2, LS 1 2 is here and LS 2 4, LS 2 4 will
come over here LS 3 4 will come over here and LS 2 4 will again come over here. So,
this is GS 2 global state 2 this is a consistent global state why because there is a message
which is crossing this particular a global state; that means, the message who send is
recorded, but received is not recorded. So, this is an example or this is under the
definition of a consistent global state strongly consistent global state LS 1, LS 1 of 2 will
be here LS 2 of 3 will be here, then LS 3 of 4 will be here, then LS 2 of 4 will be here.
So, this particular example does not see any message which is crossing this particular
line or a global state. Hence all the channels when this global state GS 3 is recorded they
are transitless; that means, the state of channels they are all empty and it is a consistent
also why because there is no message which is crossing hence it is called strongly
consistent global state. So, recording the global state of a distributed system is an
important paradigm when one is interest in analyzing monitoring testing verifying
properties of a distributed applications and algorithms the design of efficient methods for
a recording the global state of a distributed system is an important problem.
161
Now, cuts of a distributed system in a space time diagram of a distributed computation

zigzag line joining an arbitrary point on each process line is termed as a cut in a
distributed system. Such a line slices by space time diagram and thus the set of event in a
distributed system is basically partitioned into the PAST events and into the FUTURE
events, past event contains all the events to the left of the cut and future event contains
all the event to the right of the cut for a cut C let past C and future C denote the set of
events in the past and future of C respectively.
So, every cut corresponds to a global state and every global state can be graphically
represented as a cut in a computations space time diagram cut in a space time diagram
provides a powerful graphical aid in representing and reasoning about the global state of
a computation.
162
So, this is an example of the cut in a distributed system. So, this cut basically is a
strongly consistent cut why because there is no message crossing it and all the channels
are transitless it is a consistent and a transitless is called strongly consistent. Consistent
cut where the message whose send is recorded within the cut, but the received is not
record is called consistent cut inconsistent inconsistent cut says that the future event is
crossing into the past event and this is not correct as far as the distributed system is
concerned that is why this cut is called inconsistent cut.
So, whenever there is a cut on the left side will become the past event and on the right
side becomes the future event. So, a space time diagram the cut divides into 2 parts the
set of events which are happening in the past and set of events which are happening in
the future and this is an important to the reasoning of a distributed system.
163
So, these examples are self explanatory now after so much of discussions. So, this is an
inconsistent cut why because a feature event is crossing to the past event.
Similarly, this event this is a consistent cut why because only the past event is crossing
the future event. Hence this is a consistent global state defined by this particular cut this
is inconsistent global state which is defined by this cut C 1. Explanation of the cut in a
distributed system in a consistent cut every message received in the past of a of the cut
was sent in the past of that cut that I have already explained.
164
So, all the messages that cross the cut from past to the future are in transit in the
corresponding consistent global state a cut is inconsistent if a message crossing the cut
from future to the past.
Issues in a global state in a recording of a global state the following 2 issues need to be
addressed issue number one that is I 1 how to distinguish between the messages to be
recorded in a snapshot from those not to be recorded. So, any message that is sent by a
process before recording it is snapshot must be recorded in a global snapshot and this is
the condition number one any message that is send by a process after recording it is
snapshot must not be recorded in a global snapshot and this is the condition number 2 of
the issue number 1.
Now, issue number 2 says that how to determine the instant when a process takes it
snapshot a process pj must record it is snapshot before processing a message mij that was
sent by a process i after recording it is snapshot.
165
So, now we are going to see an example of a money transfer and this will give the
motivation why the global state recording is not trivial. Let S 1 and S 2 be the 2 distinct
sites of a distributed system which maintains a bank account AB respectively a site refers
to a process in this example let the communication channel from site 1 to site 2 from site
2 to site 1 denoted by C 1 2 and C 2 1 respectively.
Consider the following sequence of actions which are also illustrated in the figure 6.3
time t 0 initially the account of a was 600 and account of B was 200 C 1 to channel was 0
empty and C 2 1 is also empty, in time t 1 S 1 initiates the transfer of 50 from account a
to account B, account a is decremented by 50 to 550 and the request for 50 credit to the
account B is sent on channel C 1 2 to C to site S 2 account a 55 and account B 200 C 1 2
50 and C 2 1 is 0.
166
So, let us see this complete scenario of transaction between site S 1 and S 2 in this
particular figure 6.3 that is money transfer example.
So, here you can see that at time t 0 site S 1 having a account of a is having 600 at that
time after time t 0 it has sent a request to transfer 50 from a to B, just after sending this
request the amount of a will be decremented afterwards sending this request and this
request will reach here after t 2 time at this point instant. Now at this point t 2 time after t
2 time the scenario will be like this at the S 2 now you can see that site before time t 3
site S ones transfer of the message was received and it was being updated.
167
Now, as for s a time t 2 site S 2 receives the message to transfer 50 to credit this
particular account. So, now, we can see here if let us say that the local state of a is
recorded a time t 0 and it shows that a is 600 and the then the local state of B is recorded
and that shows that, B is having 120 and the channel state is C 1 2 is 50 and C 2 1 is 80.
So, this if you sum all the money this will become 6 how much 850; that means, 50
extra.
So, 50 extra will appear in the system. So, you can see here the initially both are having a
plus B is equal to 800 to begin with, but this particular message to transfer 50 from A to
B is in the transit.
So, the global state this is recorded and then at time t 2 this is recorded this is recorded in
the sense 120 is basically the amount of B and 50 and 80. So, if you sum them it will
become 850; that means, 50 will be coming extra. So, this kind of recording without any
coordination is inconsistent the reason of inconsistency is that account as state was
recorded before 50 transfer to b using channel C 1 2 was initiated whereas, channel C 1
twos state was recorded after 50 dollar was initiated this simple example shows that
recording of consistent global state of a distributed system is not trivial recording
activities of individual components must be coordinated appropriately; that means, it
requires a particular coordination of time and the model of communication.
168
So, model of communication is very much going to affect the design of the global
snapshot recording algorithm. So, that is why we are again stressing upon or we are
looking back again the same model of communication. Recall that there are 3 different
model of communication in a distributed system which is assumed the first one is called
FIFO the other one is called non-FIFO and casual order co in a FIFO model each channel
acts as first in first out message queue and thus the message ordering is preserved by the
channel in non-FIFO model a channel acts like a set in which the sender process adds a
message and receiver process removes message from it in a random order.
The casual order model the system supports casual delivery of the message satisfies the
following property for any 2 messages mij and mkj which are going to the same
destination it send(mij) and it precedes send(mkj); that means, both the sends both the
message they have the causal relation in the send message then this particular relation
will be respected when they are being delivered that is received of mij is also going to
proceed before received of mkj then it is called casual order model of communication.
169
Now, we are going to see a snapshot algorithm which basically uses the FIFO channel.
So, let me tell you that this is an assumption. So, this same algorithm for other 2 models
that is non-FIFO and casual order will not work. So, it will assume that the model is
FIFO. So, non-FIFO it will not work. So, Chandy-Lamport algorithm Chandy-Lamports
algorithm uses a control message which is called a marker whose role in FIFO system is
to separate the messages in the channel. So, after a site has recorded it is snapshot it
sends a marker along all of it is outgoing channels before sending out any more message,
a marker separates the message in the communication channel into those to be included
in their snapshot from those not to be recorded in the snapshot as far as if you see the
issue I 1 and issue I 2 which we have seen earlier about global snapshot.
So, the process must record it is a snapshot no later when it receives marker on it is
communication channel and this is the issue number 2.
170
So, the algorithm can be initiated by any process by executing the marker sending rule
by which it records it is local state and sends a marker to on each outgoing channel. A
process executes marker receiving rule on receiving the marker if a process has not yet
recorded it is local state it records the state on the channel on which the marker is
received as empty and executes a marker sending rule to record it is local state.
The algorithm terminates after each process has received a marker on all it is incoming
channel. So, all the local snapshot gets disseminated to all other processes and all the
processes can determine the local state.
171
Let us see the algorithm in more detail the marker sending rule for a process i has 2 steps
process i records it is state for each outgoing channel C on which a marker has not been
sent i sends the marker along C before i sends further messages along C marker receiving
rule for a process j on receiving a marker along channel C if j has not recorded it is state
then record by state of C as empty follow the marker sending rule else record the state of
C as a set of messages received along C after j’s state was recorded and before j received
the marker along the C.
So, the Chandy-Lamports algorithm has 2 different rules marker sending rule for a
process i and marker receiving rule for process j, marker sending rule will initiate the
process, i to record it is state and it will also send the marker on each outgoing channel.
Similarly marker receiving rule for a process j has 2 conditions on receiving the marker
along the channel C and if j has not recorded then it will record the state of a channel and
also will record it is state of a process according to the marker sending rule, if the if
process j has already recorded then it will record the state of a channel C as the set of
messages received along C after j’s state was recorded and before j receive the marker
along C.
172
So, first we have to go and see the correctness and complexity of this algorithm and then
we will explain you through an example. So, correctness due to the FIFO property of the
channel this is very important that it assumes a FIFO property of a communication
channel it follows that no message sent after the marker on that channel is recorded in
the channel state thus condition C 2 is satisfied. When a process pj receives the message
mij that precedes the marker on the channel Cij it acts as follows if the process j has not
taken it is snapshot yet then it includes mij in it is recorded snapshot otherwise it records
mij in the state of a channel Cij thus condition C 1 satisfies.
So, this is according to the marker receiving rule and this is basically the marker sending
rule. So, this shows the correctness of the snapshot recording and which is a consistent
global snapshot or global state which is recorded by this algorithm, the complexity the
recording part of a single instance of a algorithm requires of the O(e) messages where e
is the number of edges and the time complexities of the O(d) where d is the diameter of
the network.
173
Now, let us see the properties of the recorded global state the recorded global state may
not correspond to any of the global state that occurred during computations. So, consider
2 possible execution of a snapshot algorithm which is shown in the next figure for the
previous money transfer.
So, here in this particular example we will see 2 different cases of the marker or the
snapshot initiation which is shown by the circle in the first one and the square is another
174
initiation of a the same global snapshot recording algorithm which is shown by the red
dots the first snapshot.
Shows the green dots we will see that in both the cases it will record a consistent global
state in that same previous example.
So, first we will see this example this particular example we will see first and this will
see next. So, site S 1 initiates the algorithm just after t 1 site S 1 records it is state as 550
over here this is site S 1 records and sends a marker to the site S 2 the marker is received
by the site S 2 after t 4 when site S 2 receives the marker it records it is local state that is
because it is receiving the marker for the first time. So, in the marker receiving rule the
first condition will be satisfy. So, it will record it is state as 170 and the state of a channel
C 1 2 will be empty in that case.
And then it will send the marker sending rule along the channel C 1 now when state S 1
receives this marker which is sent by site S 2 it records by state of a channel C 2 1 as 80
why because it has already recorded it is state. So, it will go for the second option in the
marker receiving rule. So, only the state of a channel is recorded and which is shown as
80. So, if you sum them it will be 800 and that is the money which the system has began.
So, hence the 800 amount in the system is conserved in a recorded global state by this
175
In second example and this is shown over here that I have explained. So, here the global
state will be recorded that is a will be recording 550 over here by site S 1 before sending
the marker sending rule and when the market is received. So, it will initiate site S 2 will
initiate the marker receiving rule. So, it is recording for the first time. So, it will be
recorded by state of B as 170 after time t 4. Now then it will also record it is state of
channel C 1 2 as 0 and then it will send the marker to the site S 1, site S 1 on receiving
this particular marker will initiate the marker receiving rule the second condition because
it has already recorded in the snapshot or it has already recorded in state.
So, now it will only record the state of this particular channel and that is nothing, but C 2
1 is basically having the 80 dollar this particular portion will be captured over here. So, if
you sum them it will become 800 that is nothing, but the start condition of the system.
176
Hence the system basically records correctly the consistent global state another example
we are going to see when the markers are shown in the green dotted arrows.
In this example I will explain you the working of a Chandy-Lamports algorithm. Now
here you can see that before sending the message the channel state is recorded at S 1. So,
marker sending rule will record it is channel will is records it is state and then it will send
the marker sending rule marker will be sent on the outgoing channel that is C 1 2.
177
Now, when the marker is received by state 2 then it will record it is state S 2 has not yet
recorded it is state. So, S 2 is going to record it is state as 120. So, 120 is the current
value of S 2 of B and it will also record it is channel as C 1 2 as empty and then it will
send the marker again back to 1. The marker will be received by site S 1 at this end since
it has already recorded it is state. So, it will only record the channel state of 2 1 what it
will record in the channel state. So, it will record the channel state as this amount that is
80 now if you sum them it will become 800 and this is a consistent global state which is
recorded by this particular algorithm and the 2 examples have clearly shown this
particular thing.
Now, properties of the recorded global state in both these possible runs of the algorithm
the recorded global state never occurred in the execution that is in the distributed
execution. This happens because the process can change it is state asynchronously before
markers it sent are received by the other sites and the other sites record their states in the
sense due to the asynchrony of the sender and the receiver processes.
So, the snapshot which is recorded over here may not coincide with the distributed
execution which is taking place, but the system could have pass through the recorded
global state in some equivalent executions the recorded global state is valid state in an
equivalent execution and if the stable property that is a property that persist holds in the
system before snapshot algorithm begins it holds in the recorded global snapshot.
178
Therefore, the recorded global state is useful in detecting the stable properties of a
distributed system.
Now, we are going to see the variants of a global snapshot algorithms which are
available we have seen these global snapshot algorithms as I told you that they are very
sensitive to the communication channel model and the global snapshot algorithm given
by Chandy-Lamport basically uses FIFO model of the channel. So, here it is also already
written that the Chandy-Lamport algorithm requires the channel model to be the FIFO
model. So, Chandy-Lamport algorithm uses the order e messages to record the snapshot
and uses order t time. Now Spezialetti and Kearns they have improved this algorithm
which supports the concurrent initiation. So, that is already following the FIFO model.
Now, then there are algorithms which uses non FIFO model and to use the non FIFO
model it is not trivial it is very difficult to design the snapshot algorithm in a non FIFO
model. So, Lai-Yang Li et al and Mattern there are 3 different algorithms are basically
discussed in the literature. Now these algorithms basically require the concept which will
either delay the message sending of the further messages till the messages which is sent
by the particular process is received by the receiver. So, this delaying can be basically
acknowledged using the acknowledged message or using the piggyback message can be
used to synchronize or to know whether the messages which are sent is being delivered
179
at the other end or not because it is a FIFO model cannot assume that the messages
which are sent are being delivered.
So, basically there are various implementation various algorithms uses various
techniques some are using the acknowledgments. So, delay and once the
acknowledgement comes or the acknowledgement is being Piggybacked to deal with the
asynchrony of the communication channels. Then we will see some algorithms which
uses the causal order that is co model causal order is a basically proper subset of FIFO
model. So, if you assume a causal order it will be even simpler than the algorithms which
we have seen in the Chandy-Lamports; that means, the algorithms need not have to send
the markers on all the outgoing channels. So, require the casual delivery support.
So, channel message contents need not to be known and hence this particular model if it
is assumed then the global snapshot algorithm will become quite even simpler or easy.
So, these are different variants if the variations are in the form of the communication
model. So, we have already covered that; that means, how the Chandy-Lamport
algorithm will be basically or improvements of these algorithms which we have seen.
Conclusion recording global state of a distributed system is an important paradigm in the

design of distributed system and the design of efficient methods of recording global state
is an important issue. This lecture first discussed the formal definition of a global state of
a distributed system and then basically we have discussed the Chandy-Lamport
180
algorithm to record the snapshot of a distributed system in upcoming lectures we will
discuss about the distributed mutual exclusion algorithms.
Thank you.
181
Distributed Systems
Prof. Rajiv Misra
Lecture - 07
Distributed Mutual Exclusion Algorithms and Non-Token Based Appproaches
Lecture 7 distributed mutual exclusion algorithms and non-token based approach.
182
Preface; recap of previous lecture. In previous lecture we have discussed about the global
state; that is consistent, strongly consistent and inconsistent global states, cuts in a space
time diagram, and model of communications. We have also seen the snapshot algorithm;
that is given by Chandy-Lamport and that is called Chandy-Lamport algorithm, to record
the global snapshot.
Content of this lecture: In this lecture, we will discuss about mutual exclusion algorithm
for distributed computing systems; such as non token based approach, quorum based
approach, and token based approach, for the distributed mutual exclusion algorithms
Introduction. Mutual exclusion in a distributed system. So, concurrent accesses to the

processes to a shared resource or a data, is executed in a mutually exclusive manner; that
is one process at a time, is allowed to execute in a critical section. And in the constraints
of a distributed system where there is no shared memory, there is no physical clock, and
the messages are basically, is having unpredictable delays, but it is a finite delay.
So, in that scenario, designing this mutual exclusion, distributed mutual exclusion, is
basically going to be a challenging task. So, we are going to see in the introduction, that
only one process is allowed to execute the critical section at any given time. So, in a
distributed shared memory, the existence of semaphores or shared variables, and a local
kernel system cannot be used to implement the mutual exclusion. Hence we are going to
discuss how the distributed mutual exclusions, are going to be implemented, with the
183
message passing only. So, message passing is the sole means for implementing
distributed mutual exclusion approaches of distributed mutual exclusion.
Distributed mutual algorithms must deal with unpredictable message delays, and
incomplete knowledge of the system state. Three basic approaches for distributed mutual
exclusion, we are going to cover up in this part of the lecture. They are non token based
approach, quorum based approach, and token based approach.
184
Non token based approach consists of; two or more successive rounds of message
exchanges, among the sites to determine which site is allowed to enter into the critical
section next. For example, we are going to cover up two algorithms, they are known as
Lamports algorithm and recartagarwala algorithm for non token based approaches
Quorum based approach: Each site request permission to execute the critical section
from only a subset of sites, not all the sites, and that subset of site is called a quorum. So,
any two quorums contains a common site. So, this common site is responsible to make
sure that only one request executes that only one request executes the critical section at
any time. For example, the algorithms like Meekawa algorithm, Agarwal el Abbadi
algorithm. They are all quorum based algorithm which we are going to cover up.
Then next is the token based approach for designing distributed algorithms.
185
So, here unique token, is also known as the privileged message is shared among the sites.
A site is allowed to enter into the critical section, if it possesses that particular token. So,
mutual exclusion ensured, because there is only one token, and that is unique
So, the token based approaches are Suzuki Kasami broadcast algorithm, Raymonds tree
based algorithms now preliminaries for the system model we are going to, now cover up
now. So, the system consists of n different sites S 1 to S n. we assume that a single
process is running on each side. So, the process at site i is denoted by p i. So, either we
186
can. So, instead of saying sites, we will be now calling it as a process running on that
particular site
The site can be in one of the following three states, requesting for a critical section,
executing the critical section, or neither executing nor requesting for the critical section,
they are idle state. So, in requesting the critical section state, the site is blocked, and
cannot make for the request for the critical section. In idle state, the site is executing
outside the critical section. In a token based algorithm, the site can also be in the state
where the site holding the token is executing outside the critical section, and this is called
an idle token. So, at any instant, a site may have several pending requests for critical
section. A site queues up these requests and serves them one at a time
Requirements of mutual exclusion algorithm: The primary requirement is the safety

property at any instant, only one process can execute the critical section, and this is the
most important property, essential property it is.
The next requirement of a mutual exclusion algorithm is, the liveness property. This says
that this property states that, absence of deadlock and starvation, it should be maintained;
that means, two or more site should not endlessly wait for the message, which will never
arrive, and this is an important property, to ensure the mutual exclusion.
187
Third property is called a fairness. So, each process gets a fair chance to execute the
critical section. Fairness property generally means, the critical section execution,
requests are executed in the order of their arrival in the system, and this order of arrival
is determined by the system of logical clocks, and this is also an important property for
the different distributed mutual exclusion algorithms. So, this the essential, the first one
is the essential property, the other properties are important properties
Now, we are going to see the performance metrics, and these performance metrics are
used to compare the algorithms distributed mutual exclusion algorithms. So, the first
property is, basically performance metric is called message complexity. So, message
complexity is the number of messages, required per critical section execution by the site.
Second is the synchronization delay. So, after a site leaves the critical section, it is the
time required, and before the next site enters the critical section. So, here we can see that,
after a site exits the critical section and a new site when it enters, that particular delay is
called a synchronization delay.
So, when a site exists the critical section, there are certain rounds of message exchanges,
which will prepare the system. So, that the next request or next process can be allowed to
go into critical section, and that is why synchronization delay and some of the distributed
algorithms are basically seen.
188
Third performance matter is called basically the response time. The time interval a
request waits for its critical section execution to be over after its requests message have
been sent out, and that is called response time. So, here when the request message has
been send out, till the site exists the critical section, the entire duration of the time is
called the response time.
Now, fourth time is called, fourth metrics is called system throughput. The rate at which
the system executes the request per critical section is called system throughput. System
throughput can be calculated using a formula; that is 1/(SD+E) means the
synchronization delay plus the average execution time, where SD synchronization delay,
and E is the average critical section execution time, and that basically will compute the
throughput of the critical section execution. So, these are four performance matrics we
have seen, and we are going to use it to compare different distributed mutual exclusion
algorithms.
189
Now, another performance is about high load and low load situation. So, we often study
performance of mutual exclusion algorithm under two special loading conditions; that is
the low load and the high load. The load is determined by the arrival rate of critical
section requests. Under the low load condition there is seldom more than one request for
the critical section, present in the system simultaneously. Under heavy conditions, there
is always a pending request for a critical section at a particular site.
190
So, non token based approaches. Lamports algorithm. So, Lamports algorithm is first in
this particular class of non token based algorithms. So, in Lamports algorithm of
distributed mutual exclusion is using the system of logical clock, which is designed by
the Leslie Lamport.
So, this particular algorithm is a demonstration of use of the Lamports clock, in

designing the distributed mutual exclusion algorithm. So, Lamports algorithm request for
critical sections are executed in the increasing order of the timestamp, and the time is
determined by the system of logical clock. And this will ensure that this is a fair
algorithm. So; that means, this algorithm ensures the fairness property. Fairness property;
that means, the order in which the requests are arriving and they are being served in that
particular order and that is ensured by the using, the use of logical clocks, that we will
see here in this algorithm.
Now, every site Si keeps a queue; that is called a request queue of the process i, which
contains the mutual exclusion request ordered by their timestamps. This algorithm
requires the communication channels, to deliver the message in FIFO order. There are
three type of messages, used in this algorithm, they are called request message, reply
message, and release message. These messages with timestamps also update the logical
clock, that we will see in the algorithm.
191
So, the algorithm; Lamports algorithm for non token based approach, for designing the
distributed mutual exclusion. The first action is, requesting the critical section. So, when
a site Si want to enter the critical section, it broadcasts a request message, and request
message has the parameter, the timestamp of and the process id that is i. So, the request
message contains the timestamp request message of process i, and this message will be
broadcast to all the other sites and places the request on its request queue also.
Now when a particular site S j receives this request message from Si, this Sj places the
site as highs requests, on his request queue that is request queue of j, and also returns a
time stamped reply message back to Si. So, these two actions we have seen; that is to
request for the critical section execution by a particular process i.
Now, second part of this algorithm is about executing the critical section. The site Si
enters the critical section, when the following two conditions hold. The first condition is
says that Si has received a message with the timestamp, larger than t Si from all other
sites. So; that means, Si is or site i is, message is basically having the lowest timestamp.
So, it is having the highest priority, and that is why it is executing inside critical section.
Another condition is L 2 condition, says that site Si request is at the top of the request
queue. Also if you see the request queue, the i is request will be at the head of the queue.
These two conditions are basically ensuring that a particular process i, is executing the
critical section.
192
Now, third part of the algorithm says that; once a particular process finishes the critical
section execution, then it has to release the critical section execution. So, releasing the
critical section, a site Si upon executing the critical section, removes its request from the
top of its request queue, and broadcast a timestamp release message to all other sites . So,
it will send the release message. So, and also it will remove i from the queue, and the
release message will be broadcasted. So, when a site Sj receives the release message
from Si, it removes Si requests from its request queue. So, when a site removes, a request
from its request queue, its own request may come at the top of the queue enabling it to
enter into the critical section, if it is interested for, or if it is requesting to go in a critical
section.
Now, correctness. So, theorem Lamports algorithm achieves mutual exclusion, that proof
goes like by contradiction. Suppose the two sites Si and Sj, they are executing inside the
critical section concurrently at the same instance of a time. So, for this to happen, these
two conditions must hold L 1 and L 2 at both the sites. This implies that at some instant
in time let us say t both Si and Sj have their own requests in the top of the request queue,
and the condition L 1 holds at them.
So, without loss of generality, assume that Si timestamp is smaller than smaller than, the
timestamp of the request j. So, from condition L 1 and FIFO property of the
communication channel, it is clear that at instant t, the request of Si must be present in
the request queue of j. So, this implies that Sj’s request is at the top of its own request
queue, when a smaller timestamp requests Si’s request is present in the request queue of
j, and this is a contradiction. So, hence it is proved that Lamports algorithm achieves the
mutual exclusion.
Another algorithm, another theorem is about Lamports algorithm, is about fairness. So,
Lamports algorithm is fair. So, I have told you in the beginning that, this lampard
algorithm is fair algorithm. The proof goes like this, the proof is by contradiction.
Suppose a site Si’s request has a smaller timestamp than the request of another site Sj
and Sj is able to execute the critical section before Si. So, for easy to execute the critical
section, it has to satisfy the conditions L 1 and L 2, this implies that at some instant in
time t Sj has its own request at the top of its queue, and it has also received the message
with the timestamp, larger than the timestamp of its request from all other sites, but
request queue at a site is ordered by the timestamp, and according to the assumption Si
193
has the lower timestamp. So, Si’s the request must be placed ahead of Sj’s request in the
request queue, and this is a contradiction. So, this basically proves that the Lamports
algorithm is fair.
Lamports algorithm examples. Let us have three sites S 1 S 2 and S 3 participating in the
distributed mutual exclusion. Now site S 1 and S 2 are requesting for the critical section.
So, S 1 will send these arrows, and S 2 also will send or broadcast this message to all
other sites.
Now, S 2, S 1 after receiving. So, these sites are, now S 2 after receiving the message
from S 1 compares with its own timestamp, or a request timestamp, and and particular
process. So, 1 comma 2. So, basically it is found that the incoming request, is having
lower timestamp, so it will send a reply, back to S 1 and similarly S 3 is also neither
requesting, nor it is executing critical section, it will also send the reply. So, S 1 will
receive the replies from all other sites. So, now, S 1 will enter into the critical section.
Now N site S 1 exists the critical section then it has to send the release message. So,
release message will be sent to all other sites, and after receiving the release message,
site S 2 will basically, will found itself on the top of the queue
Now, after receiving the reply message also from S 1, the S 3 can go into the critical
section . So, if you see the performance of Lamports algorithm for each critical section
execution, Lamports algorithm requires; N minus 1 request messages, N minus 1 reply
messages, and N minus 1 release messages. So, the Lamports algorithm required 3 N
minus 1 messages per critical section invocation.
The synchronization delay of this algorithm is t. So, that you can see that once a
particular process S 1 exists the critical section, then it requires this message exchange;
that is the release message to flow, and that will take only t time till S 2 can go into the
critical section again. So, the synchronization delay in this particular algorithm is given
by t.
Optimization of this algorithm in Lamports algorithm reply messages can be omitted in

certain conditions. for example, if site Sj receives a request message from Si, after it has
sent its own request message with a timestamp higher than the timestamp of site Si’s
request then site Sj, need not send the reply message to the site Si why, because and Sj’s
timestamp is lower. So, it is having higher priority. So, this is because when Si’s Si
194
receives the site Sj’s request with the timestamp higher than its own, it can conclude that
site Sj does not have any smaller timestamp requests, which is still pending. With this
optimization Lamports algorithm require that, between 3 N minus 1 to 2 N minus 1
messages per critical section invocation.
Now, the next algorithm which we are going to discuss, in the non token based approach
is called the Recart Agarwala algorithm. The Recart Agarwala algorithm assumes the
communication channel are FIFO also. In this algorithm same assumption is there, the
algorithm uses only two types of messages; request and reply. So, this particular
algorithm is not going to use the release message. The process sends a request message
to all other processes, to request their permission to enter the critical section.
A process sends a reply message to a process, to give its permission to that particular
process. So, processes used Lamport style logical clocks to assign the timestamp, to the
critical section request, and timestamps are used to decide the priority of the request.
Each process p i maintains data structure, which is called a request default array RDI, the
size of which is same as the number of processes, or the sites in the system.
So, initially the RDI’S for all INJ, they are equal to, they are initialized to zero. So,
whenever p i differs the request by pj, it sets RDI of j is equal to 1, and after it has send a
reply back to pj, it resets again RDI of j is equal to 0. Let us see the use of this default
replies in this Recart Agarwala algorithm.
Description of the algorithm. So, the first step of the algorithm is requesting the critical
section. So, when a site Si want to enter the critical section, it broadcasts timestamp
request message to all other sites. now when Sj receives a request message from site Si,
it sends a reply to the site Si, if the site Sj is neither requesting nor executing critical
section, or if the site Sj is requesting and Si’s request timestamp is smaller than Sj’s own
timestamp, then also it will send a reply; otherwise the reply is deferred, and Sj will set
the RD parameter of i is equal to 1.
Now, second step of this algorithm is executing the critical section site Si enters the
critical section, after it has received the reply message, from every site it sent a request
message. When a site Si exits the critical section, it sends all the default reply messages.
So, for all j’s, if RDI of j is equal to 1, then it sends our reply message to Sj and sets RD
parameter to 0. Note that when site receives a message it updates its clock using the
195
timestamps, which are basically piggy bagged in the messages . So, when Si takes up our
request for critical section for processing, it updates its local clock and assigns the
timestamp to the request.
Now, correctness: Recart Agarwala algorithm achieves mutual exclusion. So, proof is by
contradiction. Suppose two sites Si and Sj are executing in the into the critical section
concurrently. Si’s requests has higher priority than the request of Sj. clearly Si’s Si
receives Sj’s request, after it has made its own request. thus Sj can concurrently execute
the critical section with Si, only if Si’s, Si returns a reply to Sj ,in response to Sj’s request
before Si exits the critical section; however, this is impossible, as per the algorithm is
concerned, because Sj’s request as the lower priority. So, it will differ it. So, that is why
both the assumption, that Si and Sj both are executing critical section, is contradicted.
Therefore, Recart Agrwala algorithm achieves mutual exclusion.
Recart Agarwala algorithm example. Assume that site S 1 and S 2 are requesting for the
critical section, and they will send the message to all other processes.
Now, you can see that when S 1 will receive this particular message, from S 2 and the
timestamp of S 2, is basically higher than the timestamp of S 1. So, S 1 is having higher
priority. So, it will differ it, and put in the data structure RD of i the value of j will
become 1.
Now, after receiving the replies from all other sites, these are the reply messages, the site
S 1 will go into the critical section, when site Si S 1 exists the critical section, then it will
send the replies to the differed messages. So, basically the request which was deferred, it
will send the reply to that particular message.
Now, S 2 which is also interested to go in a critical section. Now at this point of time it
has received the replies from all other message, and it will enter into the critical section.
Performance of this algorithm, for each critical section execution Recart Agarwala
algorithm requires only two types of messages; that is request and reply, and N minus 1
number of request messages, N minus 1 number of reply messages. So, total number of
messages required per critical section execution in Recart Agarwala algorithm is 2 N
minus 1 messages only
196
Synchronization delay is the same, as you can see in this particular picture that, once site
S 1 has come out of a critical section that will take t time to send the default messages,
then only the next process, which has requested to go in a critical section, can basically
enter into critical section. So, the synchronization delay s t is basically t that we have
expressed here in this algorithm
So, the conclusion; Mutual exclusion is fundamental problem in a distributed computing

system, where concurrent accesses to the shared resource or a data, is serialized. Mutual
exclusion in a distributed system requires not only one process be allowed to execute the
critical section, at any given time, and also ensures the fairness at the same point of time.
So, in this lecture, we have discussed about the concepts of distributed mutual exclusion,
and we have also seen the non token based approaches; like Lampert algorithm and
Recart Agarwala algorithm, to basically achieve distributed mutual exclusion. So, in the
upcoming lecture, we will discuss about the other schemes, which are quorum based
schemes, and token based approaches.
Thank you.
197
Distributed Systems
Prof. Rajiv Misra
Lecture – 08
Quorum Based Distributed Mutual Exclusion Algorithms
Lecture 8 Quorum based distributed mutual exclusion algorithms. Preface: Recap of

previous lecture. In the previous lecture, we have presented the concept of distributed
mutual exclusion algorithm for distributed computing system. We have also discussed
about non token based approaches that is Lamports algorithm Ricart Agrawala algorithm.
In distributed mutual exclusion we have seen that the mutual exclusion problem is
fundamental to the distributed systems.
Now, this particular mutual exclusion ensures the concurrent axis of processes to the
shared resources or the data is serialized. So, this particular problem states that, only one
process is allowed to execute the critical section at any point of time. So, in distributed
system the shared variables or the semaphores or a local variable through the kernel can
be allowed because there is no common memory in the distributed system.
198
Content of this lecture. In this lecture we will discuss about Quorum based approaches
for the distributed mutual exclusion algorithm. Also we will discuss few approaches
namely Maekawas algorithm and Agarwal El Abbadi Quorum based algorithms Quorum
based approaches.
Introduction: In the Quorum based approach each site requests permission to execute the
critical section from a subset of the sites that is the quorum; that means, all the sites are
not required to be to be taken permission to execute any critical section. And this is the
199
deviation from the previous lectures where all the sites are required to be taken into
account before request is being granted to execute the critical section.
So, the Quorum based. So, the quorums are formed in such a way that when 2 sites
concurrently request access to the critical section at least one site it receives both the
request and this site is responsible to make sure that only one request executes the critical
section at any point of time.
Quorum based mutual exclusion algorithms. Quorum based mutual exclusion algorithms
are a different in the following 2 ways.
So, a site does not request permission from all other sites, but only from a subset of the
sites. And that subset of the site is called a request set of a site. So, the request set of the
sites are chosen such a way that for i and j. So, in the request set of i and request set of j
for any i j between 1 to n. So, the intersection is not null, that is consequently every pair
of sites has a site which mediates conflicts between that pair.
So, a site can send out only one reply message at a time. A site can send reply message
only after it has received the release message from the previous reply message. Since the
algorithms are based on the notion of coteries and quorums.
200
We next described the idea of coteries and quorums a coteries C is defined as a set of
sets, where each set g that is in coteries C is called a Quorum the following properties
hold for the Quorum in a coterie.
So, the first property is called intersection property. So, for every Quorum g, h which is
in coterie. So, the intersection of these quorums g ∩ h ≠ ∅. For example, the sets 1 2
3, 1 2 3 and 2 5 7 cannot be Quorum in a coterie, because the first and third sets do not
have a common element. Minimality property there should be no quorums g and h in
coterie C such that g is the superset of h, for example, the sets 1 2 3 and 1 3 cannot be
cannot be the quorums in a coterie because the first set is a superset of the second one.
201
So, there are 2 properties which we have seen is required here for the Quorum to be
qualified and the coterie intersection property and minimality property coteries and
quorums can be used to develop algorithm to ensure mutual exclusion in a distributed
environment a simple protocol works as follows. Let A is a site in a Quorum capital A
and if the site a wants to invoke mutual exclusion it requests permission from all sites in
it is Quorum A.
So, every site does the same to invoke the mutual exclusion due to the intersection
property. Quorum A contains at least one side that is common to the Quorum of every
other site. These common sites send the permission only at one site at a time. Thus
mutual exclusion is guaranteed. Note that minimality property ensures efficiency rather
than correctness. So, this particular aspect you can see that, if let us say site i and site j,
they are requesting to go in a critical section. Their request set that is called Ri and their
request set Rj.
So, during many were using intersection property there must be some site, let us say Sk
which is common in their request set. And this will mediate if both they both Si and Sj
want to enter into the critical section. So, this Sk will mediate, why because it will
release only one reply or a 1 permission to ensure the mutual exclusion. So, intersection
property will ensure here to guarantees that only one side at a time is allowed to execute
into the critical section.
202
Now, another property which is called a minimality property, this ensures the efficiency
rather than the correctness. So, for example, the size of the size of the request set of a
particular process i, that is called Ri basically ensures the minimality property and
basically the size is basically is such a small. So, that it will be efficiently or with the
minimum number of message this particular mutual exclusion is ensured.
So, to have this mutual exclusion with a simple and a efficient manner, we will see the
Maekawas algorithm. So, Maekawas algorithm was first Quorum based mutual exclusion
algorithm that request set for the sites. That is called the Quorum in a Maekawas
algorithm are constructed to satisfy the following 4 conditions. The M1 condition says
that the request set of any 2 processes or any 2 sites is not null. That is the intersection
property.
Now another condition says that. So, every set is a member of a request set. Now another
property says that the size where request set is K. And another property enforces that any
site Sj is contained in K number of Ris. So, Maekawas use the property of projective
planes and showed that N = K (K − 1) + 1 and this gives out that the quest set can be of
the size root N.
So, here we can see that the size of a request that is reduced. Here in Maekawas
algorithm and that is why only subset of the sets will is only required to be taken to be
sent the request condition to go in the critical section.
203
Condition M1 and M2 are necessary for the correctness whereas, M3 and M4 provide
other essential features to the algorithm. Condition M3 states that the size of the request
sets of all the sites must be equal implying that all the sites should have to do equal
amount of work to invoke the mutual exclusion.
So, this particular M3 will ensure that particular property that all the request sets they
have to do equal amount of complication and the work. Condition M4 enforces that
exactly the same number of site should request permission from any site implying that all
the sites have equal responsibility in granting the permission to the other sites. That also
guarantees the other desirable feature of this mutual exclusion algorithm given by the
Maekawas.
204
Now we have to describe the algorithm. So, a site Si executes the following steps to
execute the critical section. First step is requesting the critical section. A site Si request
access to the critical section by sending a request time messages to all the sites in it is
request set. Now when Sj receives the request(i) message, it is sends a reply message to
Si provided it hasn’t sent a reply message to a site. Since it is a receipt of the last release
message otherwise it queues up the request type for later consideration that is that will
conclude the requesting the critical section that is part one.
The second part of the algorithm is executing the critical section site. Si execute the
critical section only after it has received the reply message from every site in Ri.
205
(Refer Slide Time: 11:03).
Releasing the critical section, that is after the execution of the critical section is over site
i sends a release i message to every site in it is Ri. When a site Sj receives the release
message from the site Si, it is sends a reply message to the next site waiting in the queue
and deletes that entry from the queue, if the queue is empty then the site updates it is
state to reflect that it has not sent out any reply message since the receipt of the last
release message.
206
Now, the correctness Maekawas algorithm achieves mutual exclusion the proof goes by
the contradiction. Suppose 2 sites i and j they concurrently execute the critical section, Si
and Sj. They basically are executing the critical section. This means that site Si received
the reply message from all the sites in the request set Si and concurrently Sj they also
receives a reply which are from all the sites in Rj.
Now, you we know that during the using the intersection property, there is a common site
Sk which is basically following this condition. That Sk must have sent a reply message to
both Si and Sj concurrently because this is the common site. So, it is responsible to
ensure the mutual exclusion if it has sent that reply message to both the sites; that means,
it is allowing 2 particular sites to go in a critical section concurrently which is
contradiction.
Now, the correctness since the size of the request set is root N, an execution of the
critical section requires root N request message root N replying message root N release
message. So, total 3 root N messages for critical section is required by the Maekawas
algorithm. Now the synchronization delay of this algorithm is 2T. This is because after
site Si exists the critical section it first releases all the sites in Ri and then one of those
sites sends a reply message to the next site.
So, the release message and a reply message T times it will take total 2T times it will
take then the next to allow.
207
The next waiting process to go in a critical section, hence the synchronization delays 2T
the Maekawas algorithm suffers from a deadlock. So, Maekawas algorithm can deadlock
because a site in site is exclusively locked by the other sites and the requests are not
prioritized by the timestamp assume 3 sites S1 Si Sj and Sk simultaneously in both
mutual exclusion.
Suppose the request set of i and j, if we take intersection and that is a common site is Sij,
similarly between j and k the common site is jk and k and i the common site is the Sk.
Now consider the following situation. Now Sij has been locked by Si forcing Sj to wait
at Sij. Sjk has been locked by Sj forcing Sk to wait at Sjk. Now third one, Sk I has being
locked by Sk forcing Si to wait at Ski.
So, there will be set of waiting processes which are waiting for each other for this
particular to unlock this state represents a deadlock of consisting of 3 sites Si Sj and Sk.
208
So, how to handle the deadlock in Maekawas algorithm? Maekawas algorithm handles
deadlock by requiring a site to yield the lock, if the timestamp of it is request is larger
than the timestamp of some other request waiting for the same lock.
The sites suspect a deadlock and initiate message exchange to resolve it. Whenever a
higher priority message request arrives and waits at a site because the site has sent a
reply message to a lower priority request.
209
This basically is identified and used to break the deadlock now this is handled using 3
kind of messages. Failed message, failed message from site i to j indicates that Si cannot
grant Sj request because it has currently granted permission to a site with a higher
priority request. So, in that case this failed message will indicate that this it is not
possible to grant the permission.
Now, another type of message for handling deadlock is called inquire. And inquire
message from i to j indicates that Si would like to find out from the site Sj if it has
succeeded in locking all the sites in it is request set. Third kind of message is called yield
message yield message from site i to j indicates that Si is returning the permission to Sj
to yield to a higher priority requested Sj.
Now handling the deadlock Maekawas algorithm and handles deadlock as follows when
a request time stamp i from site i blocks at site Sj because Sj has currently granted the
permission to site.
210
Sk then Sj sends a failed message to Si if sis requests has a lower priority otherwise Sj
sends inquire message to site Si. This we can see from by illustrating diagram. So, here
in this diagram we can see that when a request from i blocks at a site, Sj because Sj has
currently granted permission to site Sk, then Sj send a failed message to Si if Si is
request is lower priority otherwise Sj sends an inquire message to site Si.
Now, if this particular request time stamp or a time stamp based basically of request is
basically higher, than time stamp of this particular request K. Then basically this is of
lower priority message. So, it will basically generate a field message here in this
particular case. Now if the incoming request of i is having higher priority higher priority
compared to the priority of K message.
so; that means, it is time stamp of i is less than time stamp of K. Then in that case the Si
will send an inquiry message to Sk to review this previously send the reply message.
211
In response to the inquire message site Sj site Sj will send the yield message Sk sents the
yield message to Sj. After receiving the yield message by Sj will assume that as if it has
been released by Sk.
Then consequently site Sj will send a grant message to the top of the request site at Si in
the queue.
212
And hence the deadlock is basically resolved in using these 3 messages. Now the next
algorithm which we are going to basically cover up is by Agarwal El Abbadi Quorum
based algorithm. So, Agarwal and El Abbadi algorithm developed a simple and efficient
mutual exclusion algorithm by introducing a tree quorums.
So, they gave a novel algorithm for constructing tree structure quorums. So, here we are
going to understand what is a tree structure quorums. This concept is given here in this
algorithm the mutual exclusion algorithm is independent is independent of the
underlying topology of the network and there is no need for any other communication
facilities like multi casting; however, such a facility will improve the performance, the
mutual exclusion algorithm here given by Agarwal El Abbadi algorithm assumes that the
sites in the distributed systems are organized into a structure and such as the tree
structure grid or the binary tree.
Now, there exists it is also assumed that there exists a routing mechanism to exchange
the messages between different sites in the system.
213
In this structured organization of the topology Agarwal El Abbadi Quorum based

algorithm uses a tree structured quorums. All the sites in the system are logically
organized into a complete binary tree. For a complete binary tree with the level K, we
have 2k +1 – 1 sites where it is root at the top level and the leaf at the level 0. We can
understand by this example. So, this is a complete binary tree now it has. So, here the
leafs are at the level 0. This is level 1 and this is level 2.
Now here we have how many sites? 2k +1 – 1 sites. So, here the total number of level is 2,
2 + 1 that is 3, 3 - 1 is there is total 7. 7 different nodes are participating in this particular
case. And the root is at the level K. The number of sites in a path from root to the leaf is
equal to the level of the tree (K+1).
So, here this means that this the number of sites in a path. From root to the leaf is equal
to the total level that is K plus 1 and K over here is 2 plus 1 that is 3. So, 1 2 3, 1 2 3
different sites will be there in a particular path. And that is nothing, but equal to the order
log n. So, a path in a binary tree like here 1 2 3 is a sequence that is a 1 a 2 ai, ai plus 1
such that ai is a parent of ai, ai plus 1.
214
Algorithm for constructing the tree structured quorum. So, we have seen the Quorum is
nothing, but a path in a tree structured quorum. So, how using algorithm this is going to
be constructed in this Agarwal El Abbadi algorithm that is done, through an algorithm
which is given over here. So, in this particular algorithm it uses 2 functions. The first one
is called grant permission and the second function is called get Quorum function.
So, get Quorum function will give you a path and path is nothing, but the quorum, tree
based quorum. And this is basically you see there is a recursive process; that means, in
the tree it will start from the root it will go to the left child to it is left child and so on
recursively it will try to find out the tree. Or it will go on the right side also. So, if the
path is on the right side of the tree. So, it will in a recursive manner it will traverse the
tree and will fetch you the path.
Now, the nodes when while it is traversing and collecting the nodes in a path. So, every
node has to give a permission true. So, that is why grant permission is required. Now if
the permission is not granted as a true; that means, there is some failure or some other
region. Then that particular node is basically considered as a failed node. And then again
this particular algorithm will be collecting up the path in that situation that is when some
of the nodes are failed; that means, their grant permission is false. And if the leaf node is
basically failed then basically it will be unsuccessful. So, that is we are going to see in
this particular explanation.
215
So, the algorithm for constructing the tree based tree structured Quorum uses 2 functions
called get Quorum tree and grant permission that is of the site. And assumes that there is
a well defined route for the tree get Quorum is a recursive function. There is I told you
that takes the tree node x as a parameter and calls get Quorum for it is child provided that
grounds grant permission access to that is what we have just explained.
Now the algorithm tries to construct the quorums in a way that each Quorum represents
any path from root to the leaf. That is in this case no failure Quorum is any set a 1 a 2
216
and so on up to ak, where a 1 is the root and ak is basically the leaf. And between any 2
nodes ai is the parent of ai plus 1 if it fails to find such a path the control goes to the else
block of that algorithm which is specifies that the failed node x is substituted by 2 paths
both of which starts at the left and right node that we will explain. The sets that are
constructed using this algorithm are called tree quorums.
Now, the analysis of this algorithm the best case scenario the algorithm of the algorithm
takes order of log n sites to form the tree quorums. So, that particular size was also there
in the Maekawas algorithm that we have seen to form. So, there are certain cases we are
even in the event of failures order log n sites are sufficient to form the tree quorums.
Now if the site that is a parent of a leaf node fails then the number of sites that are
necessary for the Quorum is still Ologn thus the algorithm requires a few messages
relatively fault free environment it can tolerate the failure up to up to n−O(log n) sites
and still form a tree.
So, in the worst case the algorithm requires the majority of the sides to construct the tree
quorums and the number of sites is the same for all the cases faults and no faults. The
worst case tree Quorum size is determined as O((n+1)/2) by induction.
217
Now, we are going to see the example of a tree based quorum. So, here this is a complete
binary tree consisting of 15 sites.
Now, here if we see any path this path is nothing, but a tree quorum. So, the first tree
Quorum will be 1 2 4 8 the second tree Quorum will be this one and third Quorum will
be like this and so on. So, all these cases is listed and there are 8 different ways. We can
find out a path from root to the leaf and they become a tree Quorum in this example.
218
Now, when a particular node is failed so; that means, failures are also introduced here in
while constructing the quorums.
Now, if the node 3 is failed. So, it will be a partition into 2 different sub trees. So, these
quorums where the structure is not partition or correct they are going to be the same. So,
4 different quorums are same as previously we have seen. Now these on the right side of
a tree when node s is failed we are going to construct in this form. One followed by the
root of the I split that is 6.
So, the root of the main tree, the root of that particular sub tree, which is partitioned and
12 1 6 12 1 6 12 then it will take the root of the other partition 7 14. So, both will be
taken up then it will take 1 6 12 then 7 and 15 then 1 6 then 13 and then 7 14 1 6 13 and
7 15. So, all these cases are listed over here. So, when 3 node fails then also the quorums
are being formed. 8 quorums are formed in this particular case.
219
Now, property of graceful degradation; Graceful degradation in the sense means that if
some nodes are failed yet you can form the quorums. Since the number of nodes from a
root to the leaf in an N node complete tree is log n the best case for the Quorum
formation that is the least number of nodes needed for the Quorum is log n. So, when the
number of nodes failures is greater than equal to log n the algorithm may not be able to
form tree structure quorum.
So, as long as the number of sites failures is less than log n the tree Quorum algorithm
guarantees forming of the Quorum and it exhibits the property which is called graceful
degradation mutual exclusion algorithm.
220
We are now going to explain the algorithm site s enters the critical section as follows.
Site s sends a request message to all other sites in the structured Quorum it belongs to
each site in the Quorum stores incoming requests in the request queue ordered by their
time stamps a site sends a reply message indicating, it is consent to the critical section
only to the request at the head of it is request queue having the lowest time stamp.
If the site s gets ready reply message from all the sites in the structure Quorum, it
belongs to it enters into a critical section.
221
So, this we can understand using this example which is illustrated here. So, here we can
see that there are 3 different sites S1 S2 and S3. We assume that S2 and S3 they are the
part of the tree quorums of site S1.
Now the next step is site S1 sends a request message to the it is tree Quorum sites that is
S2 and S3. Then these particular requests when it comes they will be recorded in it is
basically the request queues of the site 2 and site 3. And which indicates that site 1 is
basically is in the head of the queue; Head office requests queue having the lowest
timestamp.
So, it will give the reply it will send the reply back of these particular requests which are
at the head of the queue. Site S1 when it gets when it receives the replies from all it is
sites in the tree quorum. So, now, s I S1 will enter into a critical section because it has
taken up it has being granted the permission to enter into critical section, when S1 exists
from a critical section it will send the relinquish message to it is tree quorums. That is to
S2 and S3. Now S2 and S3 after getting the relinquish message it will remove it from it
is from it is request queue.
Now, if a new request arrives with a time stamp is smaller than the request at the head of
the queue and inquire message will be will be sent to the process, whose request is at the
head of the queue and wait for a yield or relinquish message. So, this is going to be the
dealing with the deadlocks just like we have seen the Maekawas algorithm. So, inquire
222
message is sent to the process, whose request is the head of the queue and waits for the
yield or the relinquish message.
Now, when site S3 receives the inquire message it acts as follows. If S3 has acquired all
of it is necessary utilize to access the critical section, then it is simply ignores the inquire
message and proceeds normally and sends relinquish message after it exists the critical
section.
223
Now, if S3 has not yet collected enough replies from the from it is Quorum then it is
sends a yield message to the inquiring site.
When a when a site gets a yield message it puts the pending request on behalf of which
the inquire message was sent at the head of the queue like here. Head of the like head of
the queue and sends the reply message to the requester this particular reply message was
sent now.
224
So, after getting all the replies site S1 will enter into critical section, now correctness
proof of this algorithm. Mutual exclusion is guaranteed because the set of quorums
satisfy the intersection property.
Consider a coterie C is consist of the quorums 1 2 3. Consider a coterie C which consists

of the quorums 1 2 3 2 4 5 4 1 6. Suppose not 3 5 6 1 2 enter in critical section they send
the request to the sites 1 2. They send the request to the site. So, 3 will send the request
to the site 1, and it will also send the request to the site 2. And site 5 will send to 2 and 4
site 5 will send to 2 and it will send to 4. And site 6 will send to 1 and 4.
Now, suppose site threes requests arrives at site 2. So, this arrives before the request of
site 5. So, site 2 will grant the permission to site s threes request. So, site 2 will grant the
permission and so site 2 will grant the permission. And it will reject to the site 5 requests.
Similarly, site 3’s request arrives site one before site 6. Here you see that this arrives
before site 6. So, it will send a grant and this will reject to the site 6.
So, here we can see that 5 and 6 did not get the consent from all the sites in the quorum.
So, only site 3 get the consent only site 3 get the consent from all the quorums and others
do not get the Quorum to enter. So, site 3 alone gets the Quorum from all the sites and it
enters into a critical section hence the mutual exclusion is achieved.
225
Conclusion: In this lecture we have discussed about Quorum based approaches for
solving the distributed mutual exclusion. That is by given by Maekawas algorithm and
Agarwal El Abbadi Quorum based algorithms. There exists a variety of quorums and a
variety of ways to construct quorums. For example, Maekawas used the theorem the
theory of projective planes to develop quorums of size root N, and Agarwal El Abbadi
algorithm uses the tree structured quorums. In upcoming lectures, we will discuss about
token based approaches given by Suzuki Kasamis broadcast algorithm and Raymond’s
tree based algorithm.
Thank you.
226
Distributed Systems
Prof. Rajiv Misra
Lecture - 09
Token Based Distributed Mutual Exclusion Algorithms
Lecture 9: Token Based Distributed Mutual Exclusion Algorithms.
Preface recap of previous lecture, in previous lecture we have discussed about quorum
based approaches that are given by Maekawas algorithm and Agarwal El Abbadi quorum
based algorithms for distributed mutual exclusion. Content of this lecture: in this lecture
we will discuss about the token based approaches namely Suzuki-Kasami’s Broadcast
Algorithm and Raymond’s Tree-Based Algorithms.
227
So, token based algorithms uses a unique token and this unique token is shared among
the sites, holding the token can allow to go into a critical section repeatedly until it sends
this token to some other sites.
Numerous token based algorithms are available and they differ in the method, how the
site carries out the search for the token in the network. So, token based algorithms use
sequence numbers instead of timestamps, sequence number distinguishes between old
and current requests. The proof of correctness in token based algorithm is trivial, but the
issues which are challenging in design of token based algorithms for distributed mutual
exclusions are, the freedom from starvation, freedom from deadlock and the detection of
token loss and the regeneration of the last token based approaches.
228
So, token based algorithms a unique token is shared among the sites and site is allowed
to enter into critical section, if it possesses the token based algorithms use sequence
numbers instead of timestamps. This sequence number used to distinguish between an
old and the current requests for the token. The correctness proof of token based
algorithm that they enforce mutual exclusion is trivial because the algorithm guarantees
mutual exclusion so long as the site holds the token during the execution of the critical
section.
In this lecture we will discuss about two token based approaches namely Suzuki-
Kazami’s Broadcast Algorithm and Raymond’s Tree-Based Algorithms.
229
Suzuki kazamis broad cast algorithm in Suzuki-Kazami’s broadcast algorithm, here if a

site wants to enter into a critical section it does not have the token, then it broadcasts a
REQUEST message for the token to all other sites. So, a site which possesses the token
sends it to the requesting sites upon the receipt of the REQUEST message. If the site
receives a REQUEST message when it is executing the critical section, it sends the token
only after it has completed the exhibition of the critical section.
230
This algorithm efficiently addresses the following 2 design issues; how to distinguish an
outdated requests message from a current REQUEST message, as I told you it uses the
sequence numbers to basically answer this particular question.
The second issue here is how to determine which site has an outstanding REQUEST for
the critical section, so after the site finished the critical the execution of a critical section,
it must determine what sites have an outstanding REQUEST for the critical section.
So, that token can be dispersed to them and to do this, the algorithm uses different data
structure and variables that we are going to describe now in details.
So, the first issue; that means how to distinguish an outdated REQUEST message from
the current REQUEST message, is addressed in the following manner. So, a REQUEST
message of a size Sj has the form REQUEST(j,n), where n is the sequence numbers
which indicates that site Sj is requesting it is nth CS or a critical section execution. So, a
site Si keeps an array of integer called RNi[1…N], where RNi [j] denotes the larger
sequence number received in the REQUEST message, so far from site Sj.
So, all the sites keeps this particular variable that is RN of all the sites, when a site Si
receives a REQUEST(j,n) from process j it sets it is REQUEST number variables RNi [j]
max(RNi [j], n). So, when a site Si receives the REQUEST(j, n) message, the REQUEST
is outdated if RNi [j]>n . So, this condition that means, if RN i of j denotes the critical
231
section earlier REQUEST and if the current REQUEST that is sn is basically less than or
a previous or it is n, then it will be designated as an old REQUEST for critical section.
So, this particular inequality will ensure that or will basically keep whether check that
REQUEST is whether current one or it is an outdated and it uses the sequence number n
of the requesting message, the second issue is addressed in the following manner.
So, the second let me remind you second issue says that how to determine whether the
site has an outstanding REQUEST for the critical section. So, this is called a second
issue; that means, when a particular critical section is idle or a process or a site exists the
critical section. Then it has to know which of this one sites are basically having the
pending requests and this is called second issue and it is address in the following manner.
So, the token consists of Q of requesting sites that is Q and an array of integers LN[1..N],
where LN j is the sequence number of the REQUEST which site as a has executed most
recently.
So, after executing it is critical section a site Si updates LN[i]:=RNi [i] to indicate that, it
is REQUEST corresponding to the sequence number RN of I has been executed. Now at
a site Si if RNi [j]=LN[j]+1, this inequality then site Sj is currently requesting the token.
So, to summarize these 2 points so first thing is the token, is nothing but it consists of a
Q of requesting sites plus it also contains an array of an array of integers that is called
LNi is the sequence numbers of the requests with site as they executed most recently.
232
So, array of requests executed that is indicated by LN, so, these 2 information is
basically makes this particular token. So, when if RN that is the REQUEST coming from
j if it RNi [j]=LN[j]+1, then it has a pending than the site is currently requesting for a
token. So, this will be the indication that a site that these are the set of sites, which are
basically having a pending REQUEST for the token.
So, this comes there comes the algorithm has gives different steps, the first is step of this
algorithm is about requesting for a critical section. So, requesting the critical section first
step says that if requesting site Si does not have the token, then it increments it is
sequence number that is RNi [i] and sends the REQUEST(i, sn), there is a sequence
number message to all of the sites. So, sn is updated value of RN i of i, now when a site
Sj it receives this particular message it sets RNj [i] to max(RNj [i],sn). now if sn has the
idle token, then it sends the token to Si if RNj [i]=LN[i]+1, that is it is having the pending
REQUEST for the token.
Now, the next step is of the algorithm is executing the critical section, Si execute the
critical section after it has received.
233
The token third step is about releasing the critical section, having finished the execution
of critical section the sight Si takes the following actions. It sets LN[i] element of the
token array equal to RNi [i]. For every side Sj whose id is not in the token queue, it
appends it is id in the token queue if RNi [j]=LN[j]+1. If the token queue is nonempty
after the above update, Si deletes the top side id from the token queue sends the token to
the site indicated by the id. So, this we can see through this illustrative example.
234
So, initially there are 5 sites and they are now requesting also RN[j] denotes the
sequence number of the latest REQUEST from a process j and LN[j] denotes the
sequence number of the latest visit to the critical section for a process j.
So, just see that so site 1 site over here in site 0, here is the requesting for the critical
section entry. So, it is modifying when the message is flowing to basically 1 and 2.
235
So, 1 and 2 has received this particular REQUEST and the Q will be updated as 1 and 2
they are having the pending request. So, in this way the algorithm proceeds and the token
is finally being delivered.
Now, the correctness of this algorithm mutual exclusion is guaranteed because there is
only 1 token in the system and the site holds the token during the critical section
execution.
236
Theorem a requesting site enters the critical section in a finite time to the REQUEST, the
token REQUEST message of a site Si reach the other site in a finite amount of time.
Since one of these sites will have token in a finite time, site Si is the REQUEST will be
placed in the token queue in a finite amount of time. Since there can be at most N-1
requests in front of this REQUEST in the token queue. So, site as I will get the token
eventually and execute the critical section in a finite amount of time.
237
Performance, no messages is no message is needed and the synchronization delay is 0 if
the site holds the idle token at that time at the time of it is request. So, if a site does not
hold the token when it makes a REQUEST, the algorithm requires N messages to obtain
the token. Hence the synchronization delay is idle in the first case or the synchronization
delay is at most T, where n numbers of messages are required Raymond’s tree based
algorithm.
This algorithm uses a is spanning tree of the network to reduce the number of messages
exchanged per critical section execution, hence this is an optimized algorithm as far as
number of messages are concerned, but it uses a structure that is basically the spanning
tree.
So, it assumes that the spanning tree of the network is available, now the network is
viewed as a graph that is a spanning tree of the network is a tree that contains all n nodes.
So, the algorithm assumes that the underlying network guarantees the message delivery,
so that means, it is a reliable communication channel it assumes. So, all the nodes of the
network are completely reliable. So, there is no assumption of the failures, it is assumed
that the nodes are reliable the algorithm operates on a minimum spanning tree.
238
So, by spanning tree is basically the minimum spanning tree which is required in the
algorithm of the topology.
The algorithm assumes the network node to be arranged in an uprooted tree structure, the
next figure shows the spanning tree of 7 nodes and the message between the node
traverse along the undirected edges of the tree.
A node needs to hold the information about and communicate only to it is immediate
neighbors, similar to the concept of token used in token based algorithm this algorithm
239
uses the concept of privilege. Only one node can be in a position of the privilege called
the privileged node at any time, except when the privilege is in transit from one node to
another node in the form of the privileged message.
So, any point of time at most one node holds the privileged and called the privileged
node. So, when there are no nodes requesting for the privilege, it remains in the
possession of the node that last used it.
Now, the variables which are used here in remains algorithm are as follows. So, the first
variable is called the holder variable, now holder variable will impose a directed tree or
an undirected graph or undirected tree which we have seen. So, each node maintains a
holder variable that provides the information about the placement of privilege in relation
to the node itself. A node is stores in it is holder variable the identity of the node that it
thinks has the privilege or leads to the node having the privilege.
So, for 2 nodes x and y, if HOLDERX = Y, then we could redraw an undirected edge
between x and y as the directed edge. So, holder variable will impose or will basically
impose the directed structure on top of that. So, for instance if a node G holds the
privilege, in the next figure we will show and the figure can be redrawn with logically
directed edges as shown in figure 9.3.
240
So, here we can see that the previous figure which shows you the undirected graph. So,
using holder variable this will basically induce directed structure of a tree and this
basically directions are pointing towards the node which is currently holding the token.
So, if C is a node which is the neighbor of a node which is holding the privilege token
will have an edge and D can direct it is edge towards the node who in turn knows the
privileged node. So, that way all the nodes will basically place as far as G is concerned
which is holding the token or the privileged node will basically put self in holder of G.
So, holder of G is nothing, but the self and the holder of D is nothing but C, holder of C
is nothing but G, holder of B is nothing but C, holder of F is nothing but B, holder of E is
nothing but A and holder of a is nothing but B. So, these particular holder variables will
induce a directed structure on an undirected spanning tree.
241
Now, suppose node B that does not hold the privilege wants to execute the critical
section. So, B sends the REQUEST message to the holder of B that is C, which in turn
forwards the REQUEST message to the holder C that is G. The privileged no G is no
longer needs the privilege, sends the privileged message to it is neighbor C, which made
a REQUEST for the privilege, on behalf of B a resets the holder variable. Holder of G is
equal to C. Node C in turn forwards the privilege to node B, since it has requested the
privilege on behalf of B. Node C also resets. So, basically the tree will look like in this
particular manner.
242
Now, here the node C forwards the privilege message on behalf of G to the requested
node that is B and node G no longer needs the privilege. So, it sends the privilege to C
and will change it is holder variable, holder of G will become now C and holder of C will
become as B. So, these arrows will be changed in this manner and this will now contain
the privileged node.
So, I have explained and the same thing is basically illustrated in this particular example.
243
Now, another example shows that the node which is currently holding the token is
basically the token holder that is the route that is S 1, now side S 5 sends a token request.
So, it will send first to S 2 and S 2 on behalf of S 5 will send a token REQUEST to S 1.
So, S 1 if it is not using the privilege at that point of time, then it will and basically that
REQUEST from S 2 is at the top of it is REQUEST queue, then it will send the token to
S 2.So, holder of S 1 is also need to be changed to S 2.
Now, S 2 will send the token to S 5 and the holder of S 2 is also changed to S 5 and S 5
becomes the new token holder.
244
So, the data structures used in Raymond’s algorithms, at each node the following
variables are used; which are summarized as follows.
Variable name Holder will contain either the self, if the token or the privileges with the
node itself or the identity of one of the immediate neighbors using variable is true or
false. So, using variable indicates if the current node is executing the critical section. So,
if the node currently executing into the critical section, it is using variable will be flagged
as true. REQUEST Q is a FIFO queue that contains self or the identities of the immediate
245
neighbors as the elements. REQUEST Q of a node consists of the identities of those
immediate neighbors that have requested for privilege, but have not been sent the
privilege. ASKED is true or false, indicates if the node and sent a REQUEST for the
privilege.
So, the data structures the value self is placed in the REQUEST Q, if the node makes a
REQUEST for the privilege for it is own use. The maximum size of the REQUEST Q of
a node is the number of immediate neighbors + 1 for self. ASKED prevents the sending
of duplicate requests for the privilege and also makes sure that the REQUEST Qs of the
various nodes do not contain any duplicate elements.
246
So the Algorithm, the algorithm consists of the following routines. So, first routine is
assign privilege, make REQUEST. Assign privilege this routine sends a privilege
message, a privilege node sends a privilege message; if it holds the privilege but is not
using it, it is REQUEST Q is not empty and the elements at the head of it is REQUEST
Q is not the self.
Assign privilege the situation where self is at the head of the REQUEST Q may occur
immediately after the node receives a privilege message. So, the node will enter into the
247
critical section after removing self from the head of the REQUEST Q. If the id of another
node is at the head of REQUEST Q, then it is removed from the queue and a privilege
message is sent to that node. Also the variable ASKED is set to false, since the currently
privileged node will not have sent a REQUEST for the privilege message.
Make REQUEST this is a routine which sends the REQUEST message. An unprivileged
node sends a REQUEST message if; it does not hold the privilege, it is REQUEST Q is
not empty, that is required the privilege for itself or on behalf of one of it is immediate
neighbors nodes and it has not send the REQUEST message already. Make REQUEST
the variable ASKED is set to true to reflect the sending of the REQUEST message, the
make requests routine makes no change to any other variable.
248
The variable ASKED will be true at the node when it has sent the REQUEST message to
an immediate neighbor and has not received the response. A node does not send any
REQUEST message, if ASKED is true at that node. Thus the variable ASKED makes
sure that unnecessary REQUEST message are not sent from the unprivileged node. This
makes a REQUEST Q of any node bounded, even when operating under heavy load.
Below we mentioned 4 events that constitute the algorithm. So, the first event since that
are not wishes to execute the critical section. So, this particular aspect basically is
249
handled by the algorithmically. Algorithmic functionality such as Enqueue REQUEST Q
and self assign privilege and make requests. A node it receives a REQUEST message
from one of it is immediate neighbors x and Enqueue requests Q and x. A node receives a
privileged message, then in that case holder will become to self and assign the privilege
and make requests. A node exists the critical section, using will become false and assign
privilege and make request.
Events a node wishes the critical section entry; if it is the privileged node and the node
could enter the critical section using assign privilege routine. Otherwise, it sends the
REQUEST message using MAKE REQUEST routine. A node receives a REQUEST
message from one of it is immediate neighbors if the node is the current holder it may
send the privilege to the requesting node using the assign privilege routine. Otherwise it
forwards the REQUEST using make requests. Routine the assign amount receives the
privilege message.
250
The assign privilege routine could result in the execution of the critical section at the
node or may forward the privilege to another node. After the privilege is forwarded, the
make REQUEST routine could send a REQUEST message to reacquire the privilege, for
the pending REQUEST at this node. A node exists the critical section; on exit the critical
section this node may pass the privilege on to a requesting node using the ASSIGN
PRIVILEGE for pending request.
251
So, this particular picture will represent or illustrates the message over taking, this
algorithm does away with the use of sequence numbers. The algorithm works such that
the message flow between any 2 neighboring node is sticks to a logical pattern. So, if at
all the message over taking occurs between node A and B, it can it can occur when
privilege message is sent from node A to node B, which is then very closely followed by
the REQUEST message from node. To such a REQUEST such a message overtaking will
not affect the operation of the algorithm. You can see through this particular example so
if node A and B. So, if node A is sending the it can node A is sent to node B to
REQUEST and such that the previous recent REQUEST is basically arriving late and so
basically here this. So, it has sent the privilege and then immediately a wants that
privilege to be used so, it will send the request.
So, at the side B the REQUEST will arrive first, although it is not having the privilege.
So, it will be put in the queue of requests Q of B. So, once the privilege arrives at B. So,
B will become a privilege node so it becomes a privilege node, then basically it will
check the REQUEST Q and the A will be on the head of the queue it will be Q and it will
send the reply, it will send the privilege back again 2 A. So, even this kind of example
where this is called a message over taking is not going to affect the working of this
algorithm.
252
So, when a node B receives the privileged message from A after receiving the REQUEST
message, it could enter the critical section or it could send the privilege message to an
immediate neighbor at the head of the REQUEST Q which need not be node A. So, the
message overtaking does not affect the algorithm that I way explain.
Now, correctness the algorithm provides the following guarantees; mutual exclusion is
guaranteed, then deadlock is impossible, starvation is impossible. Mutual exclusion,
ensures that at any at any instant of time, not more than 1 node holds the privilege
message, hence the mutual exclusion is granted. So, whenever a node receives a
privileged message, it becomes privileged. Similarly, whenever a node sends a privilege
message, it becomes unprivileged. Between the instants one node becomes unprivileged
and another node becomes privileged, there is no privileged node. Thus, there is at most
one privileged node at any point of time in the network deadlock is impossible.
253
So, when the critical section is free, one or more nodes want to enter the critical section,
but are not able to do so, a deadlock may occur. This could happen due to the any of the
following scenarios.
A privilege cannot be transferred to a node because no node holds the privilege. A node
is in possession of privileges unaware that there is other node requiring the privilege. The
privileged message does not reach the requesting unprivileged node. So, all 3 conditions
which basically possibly will get to a deadlock is not possible here in the algorithm, what
is taken care by the algorithm; hence there is deadlock is impossible.
254
So, these are basically the scenarios in which none of these 3 conditions will arise, which
will basically the reason to become the deadlock in the algorithm.
So, scenario 1 says that scenario 1 can never occur in this algorithm because nodes do
not fail in the message are not lost. This is the junction of the algorithm; the logical
pattern established using the holder variable ensures that, a node that needs the privilege
sends a REQUEST message either to a node holding the privilege or to a node that has a
path which leads to the token. The scenario to also cannot happen that is node in the
possession of the privilege is unaware that, there are other nodes requiring the privilege.
The series of requests messages are enquired in the REQUEST Q of various nodes, such
as REQUEST Q of those nodes collectively provide the logical path for the transfer of
privileged message, from the privilege node to the requesting unprivileged nodes. So,
scenario 3 can never occur, hence the starvation is impossible in this algorithm.
255
Third, another aspect is called starvation. Starvation is also impossible when a node A
holds the privilege and another node B requests for the privilege, the identity of node B
or the id’s of the proxy nodes of node B will be present in the REQUEST Q of various
nodes in the path connecting the requesting node to the currently privileged node. So,
depending upon the position of id of the node B in those REQUEST Q, node B will
sooner or later receive the privilege. Thus once the node B’s REQUEST message reaches
the privileged node A node B, is sure to receive the privilege. So, hence the starvation is
also impossible.
256
Now, cost and performance analysis of the algorithm, in the worst case the algorithm
requires 2 * (longest path, length of the tree) that number of messages per critical section
entry. This happens when the privilege is to be passed between the nodes at either ends
of the longest path of a minimum spanning tree. The worst possible that network
topology of this algorithm is where all the nodes are arranged in a linear fashion and the
longest path will be N-1 and thus the algorithm will exchange 2*(N – 1) message per
critical section.
However, if all the node generate equal number of REQUEST messages for the privilege,
the average number of message messages needed per critical section entry will
approximately in 2N/3 because the average distance between requesting node and a
privileged node is (N + 1)/3.
The best topology for the algorithm is the radiating a star topology. The worst case cost
of this algorithm for this topology is O(logK −1N). So, trees with higher fan outs are
preferred over radiating star topologies. The longest path length of such trees is typically
of O(log N). Thus on an average, this algorithm involves the exchange of O(log N)
messages per critical section execution.
Under the heavy load, the algorithm exhibits an interesting property; as the number of
nodes requesting that the privileges increases, the number of message exchange per
257
critical section entry decreases. Hence in the heavy load, the algorithm requires a
message exchange of only 4 messages per get those section entry.
Now, this particular comparison of this different distributed mutual exclusion algorithm,
we have seen the approaches which are called Lamport and Ricart agrawala algorithm.
They are non token based algorithms and here the synchronization delays they was T and
the message is required in lamport was 3(N-1), Ricart agrawala was basically able to
complete this in 2(N-1). Now in Lamports algorithm basically the priority or the fairness
is achieved through the time stands and Ricart agarwala algorithm uses implicit release
message strategy.
Now, the another class of algorithm which we have seen is based on quorum based
algorithms, here the number of messages are to be reduced why because the permissions
are not taking from everyone, but from a quorum or a subset of those sets. There are 2
algorithms we have seen Maekawa algorithm and Agarwala El Abbadi algorithm. Both
algorithms they take the spawns time 2T, but the messages were quite reduced here in
this case. So, messages in the low load 3* √N and in agarwala algorithm log N.
Now, another class of algorithm we have seen for mutual exclusion is called token based
algorithms, in token based algorithm today we have seen 2 different algorithm Suzuki
kasami and Raymond tree algorithm. These algorithms require the synchronization delay
Sd and this T(log N) / 2, different messages now Suzuki kasami algorithm and Raymond
258
tree algorithm are based on the token. So, the node which is holding a token basically
will be. So, that case there is only single token. So, mutual exclusion guarantee is trivial,
but all other aspects we have seen for the working of the algorithm, that it is they should
be deadlock free starvation free and the token loss is to be taken care of and regeneration
of token also is to be taken care of so and how the token are to be searched it depends
upon increment tree, that is in the tree structure finding token is quite easy. So, this is
Suzuki kasami algorithms say broadcast algorithm.
So, the messages are sent to world. So, all these algorithms are there depends upon
different applications, within which application which algorithm is going to be useful.
So, based on these parameters or performance parameters these algorithms are selected
and used for different applications. Now conclusion mutual exclusion is a fundamental
problem in distributed computing system, where concurrent access to the shared resource
or a data is serialized. In distributed mutual exclusion algorithms we have discussed 3
types of algorithms non token based, quorum based and token based approach.
In upcoming lectures we will discuss about consensus and agreement algorithms.
Thank you.
259
Distributed Systems
Prof. Rajiv Misra
Lecture – 10
Consensus and Agreement Algorithms
Lecture 10, Consensus and Agreement Algorithms.
Preface, recap of previous lecture - in previous lecture we have discussed about mutual
exclusion algorithms for a distributed computing system such as non-token based
algorithms, quorum based algorithms and token based algorithms.
260
Content of this lecture, in this lecture we will discuss about consensus and agreement
algorithms. This lecture first covers different aspects of consensus problem and then
gives an overview of what forms the consensus are solvable under different failure
models and different assumptions on synchrony and asynchrony. Also it covers the
agreement in the category of synchronous message passing system with failures and
asynchronous message passing system with failures.
261
Introduction, agreement among the processes in a distributed system is a fundamental
requirement for a wide range of applications. Many forms of coordination require the
processes to exchange information to negotiate with one another and eventually reach. A
common understanding or an agreement before taking and application specific actions a
classical example is the commit decision in the database system where in the process
collectively decide whether to commit or abort a transaction that they participate in.
In this lecture we will study the feasibility of designing algorithms to reach agreement
under various system models and failure models and we are possible examine some of
the representative algorithms to reach an agreement.
Now, we are going to see some of the fault models. So, let us classify the faults. So,
based on the component that failed the components which are basically failing in the
distributed systems are classified as either the program or a process or a processor or a
machine or a link or a storage. So, they are prone to the failures and we assume that in
this discussion that one of these components if it is failing how the algorithms are
basically going to lead to a consensus or agreement which is required to run the
applications. So, based on the behavior of faulty component, so the fault models different
fault models we will be classified as follows.
So, the crash, crash fault there we assume that the system just halts after the problem.
Fail stop, fail stop will crash with some additional conditions, crash is means that it will
262
halt. So, omission means it will fail to perform some by step; that means, some of the
step will omit. Byzantine fault behaves arbitrary is the most general fault model. Timing,
it violates the timing constraints. These are the different faulty components and they are
basically used to model down failure models.
Now, classification of tolerance type of tolerance masking the system always behaves as
per the specification in the presence of faults; that means, faults are masked. Non
masking that system may violate specification in the presence of faults should at least
behave in a well defined manner.
Fault tolerant system should specify the class of faults tolerated what tolerance is given
from each class. So, these are basically some of the requirements.
263
Now, as far as different models are concerned different assumptions which we are going
to use in this particular algorithm design the assumptions are we have to assume some
failure models, so we have to see what are the different failure models possible. Then we
have to make an assumption about synchronous and asynchronous type of
communication.
Then another resumption we have to make about its network activity sender
identification channel reliability authenticated versus non a authenticated message and
agreement values. These are the different assumptions which we have to make in the
algorithm we are going to discuss one by one in the next coming slides.
So, before that let us see the problems of an agreement in this particular figure which is
shown who are here G 1 is basically this particular problem is about byzantine general
problem that is why it is basically G, it is represented as G. So, here there are four
general and they are leading the troops and that is why they are the group 1, group 2,
group 3 and group 4 different generals they are located in the byzantine city and this
particular hill is located in Istanbul nowadays.
So, at four different sites where they will not be able to see each other and located on the
hill these four different byzantine generals they have to take the decision either to attack
or not to attack. And if they take the decision simultaneously then they are going to win
if they are not going to take decision simultaneously then basically they are going to
264
lose. Now some of these particular byzantine generals are the traitors and they will
communicate whether to attack or not to attack they have to communicate with the help
of message. So, here the message are nothing, but they are communicated through the
messengers and the messengers are the messages are lost means if these messengers are
caught by the army camp. So, here this G 1 G 2 G 3 G 4 they are byzantine generals and
they are basically representing their troops at four different locations and these
communication links are nothing, but they are the messengers they are able to they will
be sending or they will be communicating with these different generals and once they are
communicated securely then they have to follow the action whether to attack or not to
attack.
.So, here you can see that there are some generals which are traitors also. So, depending
upon how many generals traitor they have to take the decisions - 0 and 1, 0 means no
attack and 1 means attack. So, let us assume that, so after the set of message exchanges
G 4 will receive the messages 0 then 1 and 1 and G 3 will receive the messages from the
other generals as 1 then 0 and 0 and G 2 will receive the messages as one then 0 and 0
and G 1 will receive the messages as 1 then 0 and then 0. Now can we just see that the
messages which they received if it is coded in some form; that means, neither means it is
very difficult to take this particular decision why because, so here it is 1 0 0 here it is 1 0
0. So, you can see that G 1 G 2 and G 3 they have reached to the same values, but as far
as G 4 is concerned G 4 has a different value.
So, if let us say it is interpreted as G 1 G 2 G 3 will have the same value and the majority
of it is 0s that is not to be attacked. As far as G 4 is concerned if we go by this then it will
have 1 and let us say that it is having decision to attack. So, basically here we can
conclude that this is basically a traitor and other non traitor that is the generals they are
basically lead to a common value and that is not to attack. So, this basically values are
arrived at after several message rounds. So, in still this particular problem is unsolved
why because here G 1 G 2 G 3 they reach to common value and G 4 is not. So, everyone
is not agreeing on a common value. So, the most of the applications they have to
basically depend upon how the agreement is to be arrived at agreement of same value or
agreement or set of values. So, these are the problems and this app has a wide
applications and we are going to see how this particular agreement problem is going to
be solved using algorithms and in what are the models and what are the system models
265
failure models where we are going to solve them. So, we are going to see these
assumptions one by one.
A failure model, a failure model specifies the manner in which the component of a
system may fail. There exist a rich class of well studied failure models the various failure
models are fail stop, crash, receive omission, send omission, general omission and
byzantine or malicious failures.
Now here in failure models among n processes in the system, we represent it using n
small n and we can also assume that there are at most f different processes can be faulty.
So, a faulty process can behave in any manner allowed by the failure model which
basically is to be assumed in the particular problem setting.
266
The types of failure models. So, the first type is called fail stop in this model a properly
functioning process may fail by stopping the execution from some instance thenceforth,
additionally other processes can learn that the process has failed this is called fail stop.
Crash, crash failure model in this model a properly functioning process may fail by
stopping to function from any instance thenceforth unlike the failed stop model other
processes do not learn of this crash. Third one is receive omission failure model a
properly functioning process may fail by intermittently receiving only some of the
messages sent to it or by crashing. Send omission a properly functioning process may fail
by intermittently sending only some of the messages it is supposed to send or by
crashing.
267
General omission a properly functioning process may fail by exhibiting either or both of
send omission and receive omission failures. Byzantine or malicious in this model a
process may exhibit any arbitrary behavior and no authentication techniques are
applicable to verify any claims made.
Now, the next assumptions in these algorithms are called about the synchronous oblique
asynchronous computations Synchronous computation a process runs in a lock step
268
manner that is a process receives a message sent to it earlier performs the computation
using those message and send the message to other process.
So, this is basically lockstep which the process runs in this particular discrete sequence.
So, steps of synchronous computation is called around. A synchronous computation,
computation does not proceed in this strict lockstep manner the process can send receive
messages and perform the configuration at any point of time there is no bound in these
delays of the messages which is assumed in a synchronous communication.
Third type of assumption is called about network connectivity; system has full logical
connectivity that is the processes can communicate with a any other process by direct
message passing. Fourth assumption is about sender identification a process that receives
a message always moves the identity of the sender. Fifth channel reliability the channels
are reliable and only processes may fail this is very important assumptions. So, the
failure about processes we are going to or processes we have we are assuming under
different failure models. So, this will simplify our study about different algorithms in this
setting.
269
Authenticated versus non authenticated messages: In this part of the discussion we will
be dealing only with the unauthenticated messages, with unauthenticated message when
a faulty process relays the message to other processes it can forge the message and claim
that it was received from another process and it can also tamper with the content of the
received message before relaying it. Using authentication via the technique such as
digital signature it is easy to solve the agreement problem because if some of the process
forges the message or tampers with the content of the received message before relaying it
the recipient can detect the forgery or tampering. So, we are going to use only the
unauthenticated messages as per the algorithms are concerned.
270
Agreement variable, the agreement variable may be a boolean or a multivalued and need
not be an integer when studying some more complex algorithms we will assume it as
boolean variable. The simplifying assumptions does not affect the result of other data
type, but helps in abstraction while presenting the more details of the insight of the
algorithm.
Performance aspects of the agreement protocol, few performance matrix matrices for the
agreement protocols are as follows first is the time that is number of rounds needed to
271
reach an agreement message traffic that is the number of messages exchanged to reach an
agreement, then storage overhead amount of information that needs to store at the
processor during the execution of the protocol.
Problem specifications. So, the problem is specification under this agreement and
consensus algorithms are as follows - the first one is byzantine agreement problem here
the single source has the initial value. So, it has three conditions the agreement that is all
non faulty processes must agree on the same value. Validity, if the source is non faulty
than agreed upon value by all non faulty processes must be the same as the initial value
of the source. Termination each non faulty process must eventually decide on a value.
Consensus problem - all processes have an initial value agreement all non faulty
processes must agree on the same that is a single value. Validity, if all the on faulty
processes have the same initial value then agreed upon value by all non faulty process
must be that same value. Termination, each non faulty process must eventually decide on
a value. So, may be that you have seen the difference if you have not noticed then again I
am underlining it.
In byzantine agreement only one source has the initial value, in consensus problem all
the processes not only single, but all the process have the initial values and in both of the
cases they have to decide on a particular value, in both the cases they have to decide on a
particular value. So, the only the problem setting is different here in consensus all the
272
process have their initial values we are in magenta in agreement only the single source
has the initial value.
Third type of problem is called interactive consistency problem here all the processes
have the initial values. So, agreement all non faulty process must agree on the same array
of values. This you have to notice that this is the difference they are the earlier two
methods they have to agree on a single value here they are agreeing on a set of values
that is an array of values validity. If a process i is non faulty and its initial value is vi then
all non faulty processes agree on vi as ith element of the array. If pj process j is faulty
then the non faulty processes can agree on any value for A[j].
Termination each non faulty process must eventually decide on an array. So, the array of
a particular element shows that this value if they agree if it is a non faulty then they have
to agree on that initial value and if it is a faulty then they have to agree on any value. So,
this particular array will be built for n different processes 1 and so on up to n and so at
the termination all non faulty process must decide on this particular array this array will
be exchanged at all the non faulty. So, this is called interactive consistency problem.
273
Among three problems there is an equivalence. So, three problems defined above are
equivalent in the sense that a solution to one of them can be used as a solution to the
other two problems this equivalence can be shown using a reduction of each problem to
the other two problem. For example, if a problem A is reduced to a problem B, then the
solution to a problem we can be used to solve the problem A in conjunction with the
reduction.
Formally, the difference between the agreement and the consensus is that an agreement
problem a single process has the initial value like in a byzantine agreement whereas, in
consensus problem all the processes have the initial values. However, and they are
equivalent; that means, if you know how to solve a byzantine agreement problem then
basically the solution of that byzantine agreement problem can be used to solve the
consensus problem and interactive consistency problem.
274
So, overview of the results, the following table gives an overview of the results and the
lower bound on solving the consensus problem under different assumptions. It is worth
understanding the relation between the consensus problem and the problem of attaining
common knowledge of the agreement value. For the no failure case consensus is
attainable that we will see in the next table.
Further in the synchronous system common knowledge of the consensus value is also
attainable whereas, in the asynchronous case concurrent common knowledge of the
consensus is attainable in no fault situation that we can see in this particular in this
figure.
275
So, if the failure model is no failure and in the synchronous model the agreement is
attainable and common knowledge is also attainable. If it is a synchronous message
passing and a shared memory system then agreement is attainable and also the
concurrent common knowledge is attainable, but if the failure model is a crash fault or a
crash failure then agreement is attainable in the synchronous system and f the number of
faulty processor is assumed to be less than the total number of processors. And it will be
resolved that is agreement can be attainable in the (f + 1) of a lower bound number of
rounds; that means, minimum (f+1) round is required to basically achieve the agreement.
However, in a synchronous model even for the crash fault or a crash failure model
agreement is not attainable at all. If the fault model is byzantine then the agreement is
attainable in the synchronous system where . And this particular
agreement is achieved in the lower bound of (f + 1) number of round where f is the
faulty number of nodes.
However in a synchronous model the agreement under byzantine failure is also not
attainable because if it is not attainable in the crash fault then obviously, it will not be
attainable in the byzantine fault.
276
So, consensus is not solvable in a synchronous system even if one processor can fail by
the crashing. So, the next figure for that shows how a synchronous message passing
system and a shared memory system deal with trying to solve the consensus problem.
Now, since you know that if possibility of the consensus problem under asynchronous
system, so what we can do is basically we can solve a variance of this consensus problem
in this model that is in a synchronous system and that is shown here. So, under message
passing system a variant of the consensus problem is called k set consensus, epsilon
277
consensus renaming and reliable broadcast that we will see in the next few slides in the
more detail about them.
Now, the weaker consensus problem in synchronous system we are just trying to
understand what these problems are a terminating reliable broadcast it is states that a
correct process always gets a message even if the sender crashes while sending it. k-set
consensus it is solvable as long as the number of crash failures f < k the parameter k
indicates that non faulty processes agree on a different values as long as the size of the
set where our values agreed upon is bounded by k that is called k set consensus.
Approximate agreement like k set consensus approximate agreement also assume the
consensus value is from multi value domain; however, rather than restricting the set of
consensus values to a set, set of size k it says that epsilon approximate agreement
requires that the agreed upon value by the non faulty processor is within epsilon of each
other.
Renaming problem it requires the processes to agree on necessarily a distinct value. A

reliable broadcast a weaker version of a reliable terminating broadcast, namely the
reliable broadcast in which the termination condition is dropped is solvable under the
crash fault.
278
To circumvent the impossibility results the weaker variants are defined in the next table.
So, reliable broadcast failure model if it is crash fault it requires the over the condition
n > f similarly k set consensus. Then basically the value of k > f and k < n. Epsilon
agreement crash failures this particular model is working n ≥ 5f + 1. Renaming up to f
fail stop processes it can sustain and n ≥ 2f + 1and for the crash fault f ≤ n - 1.
279
Agreement in synchronous message passing systems with failures: Consensus algorithm

for crash failures message passing synchronous system. So, algorithm given in the next
slide it gives the consensus algorithm for n processes where up to f processes where f < n
may fail in a fail stop failure model. Here the consensus variable x is integer value; each
process has initial value xi. If up to f failures are to be tolerated than algorithm has f+1
rounds, in each round a process i sense the value of its variable xi to all other processes if
that value has not been sent before.
So, of all the values received within that round and its own value xi at that start of the
round the process takes minimum and updates xi occur f + 1 rounds the local value xi
guaranteed to be the consensus value.
280
Let us understand it using this particular algorithm that is explained earlier. So, we can
understand that if let us say that we have three different processors and let us say one is
faulty. So, f = 1 here in this case. So, the agreement requires f + 1 that is equal to two
rounds. If it is faulty let us say it will send 0 to 1 process and 1 to another process I, j and
k. Now, on receiving one on receiving 0 it will broadcast 0 over here and this particular
process on receiving 1 it will broadcast 1 over here. So, this will complete one round in
this one round and this particular process on receiving 1 it will send 1 over here and this
on the receiving 0 it will send 0 over here. So, you can see that here it will receive it will
receive 0 and 1 and this will receive 0 and 1, this will also receive 0 and 1 and the
minimum if you take this will be having 0 0 and 0.
So, if let us say that it sends 1 back to him if it receives one then let us say it receives one
in that case. So, it is 0 0 and 1. So, this is round number 1. So, similarly if you take
another round of these particular messages and if this is having value 1 this is having
value 0 this is having 0 if they exchange with each other and take the minimum then all
of them will have the value 0 and they will reach to a consensus after two rounds.
281
So, the complexity of this particular algorithm is it requires f + 1 rounds where f < n and
the number of messages is O(n2 )in each round and each message has one integers hence
the total number of messages is O((f +1)· n2) is the total number of rounds and in each
round n square messages are required.
Now, the next most important algorithm in today’s lecture is given here that is consensus
algorithm for byzantine failures this algorithm is given by Leslie Lamport this is called
Shostak Pease Lamport algorithm this is for byzantine failures. So, it is assuming a
282
synchronous system and the failure model is byzantine and we are going to see this
So, in this algorithm the model which is assumed is saying that there are total n processes
and f can be the faulty the communication is reliable and it is fully connected topology
receive always knows the identity of the sender and here the fault model is assumed as
very general fault model that is byzantine and the synchronous system in each round a
processor receives a message performs computation and sends. So, synchronous
concentration model is assumed here.
So, solution of the byzantine agreement is first defined by first defined and solved by
Leslie Lamport. So, Lamport Shostak Pease algorithm it is. So, Pease shows that show
that fully connected network it is impossible to reach an agreement if the number of
faulty processor f exceeds (n-1)/ 3 that is f ≤ (n-1)/ 3
283
So, let us take this particular example whether this example shows that the condition
where f < (n – 1) / 2 is violated over here; that means, if f = 1 and n = 2 this particular
assumption is violated; that means, (n – 1) / 2 is not 1 in that case, but we are assuming 1
so obviously, as per the previous condition agreement byzantine agreement is not
possible and we can see in this particular example.
Here P 0 is faulty is non faulty and here P 0 is faulty so that means P 0 is the source, the
source is faulty here in this case and source is non faulty in the other case. So, source is
non faulty, but some other process is faulty let us say that P 2 is faulty. So, P 1 will send
because it is non faulty same values to P 1 and P 2 and as far as the P 2s concerned it will
send a different value because it is a faulty.
284
So, this agreement will not be reached here in this particular example. So, same example
we can see that agreement is not possible.
Now, agreement is possible when f = 1 and the total number of processor is 4. So,
agreement we can see how it is possible we can see about the commander P c. So, this is
the source it will send the message 0 since it is faulty. So, it will send 0 to P d 0 to P b,
but 1 to pa in the first column. So, P a after receiving this one it will send one to both the
285
neighbors, similarly P b after receiving 0 it will send 0 why because it is not faulty,
similarity P d will send after receiving 0 at both the ends.
So, if we take these values which will be received here it is 1 and basically it is 0 and this
is also 0. So, the majority is basically 0 here in this case here also if you see the values 1
0 and 0. So, the majority is 0 and here also majority is 0. So, in this particular case even
if the source is faulty, it will reach to an agreement, reach an agreement and that value
will be agreed upon value or agreement variable will be equal to 0.
Similarly, here in this example also let us assume that this is there are some other node is
faulty in that case. So, for example, you will receive 0 and this receives yeah this is
faulty in this case why because after receiving 0 after receiving 0 it will send 1 and 1 in
both the cases. So, we will see that in the same manner this also we will reach to an
agreement and. So, as far as Lamport substrate Pease condition is concerned if the
condition is satisfied with the number of faulty processor then the agreement is possible.
So, the algorithm is called as the Lamport’s for as take Pease algorithm this algorithm
also known as oral message algorithm OM(f), f is the number of faulty processors. So,
the number of processors n ≥ 3f+1. The algorithm is recursive and the base of the
recursion that is OM(0) says that the source process sends its values to each other
process. Now each process uses its value, value it receives from the source if no value is
received the default 0 is assumed.
286
Now, recursion, recurse procedures of this algorithm OM, OM(f), f > 0 then the first
steps is that source process sends its values to each other process now for all for each i
let vi be the value the process i receives from the source now default 0 if the value is not
received. Process ix as the new source and initiate algorithm OM(f-1) where it sends the
value vi to each of n - 2 process.
Now finally, after this particular for loop is over in step number 2 then fourth step says
that for each i and j, not i which is not equal j is not equal to i j is not equal to let v i be
the be the be the process i received from the process j in step number 3, process i uses,
the values using the majority function we want to v n-1 the function majority computes the
majority value if exist otherwise it uses the default value as 0. So, this majority function
is application dependent.
287
So, this basically will be the algorithm which is applied under the byzantine fault model
and under the synchronous communication model, system model.
So, as we have seen that f = 1, n = 4 the same algorithm we have applied in the previous
slide. Now, let us see how many messages, how many rounds will be there. Now as you
know that it will require how many two rounds will be required. So, in each round the
total number of messages, total number of messages in each round will be three for
example, a source will send to its neighbors. Similarly, in the second round now each of
these will send to its corresponding neighbors. So, basically 6 messages in second round
and 3 messages in the first round, so total nine messages will be required here and the
total number of number of rounds will be the 2 that is f + 1 = 2 and the messages will be
will be s = (n – 1) + (n-1)* (n-2), that is n-1 means 4-1 that is 3 and 3+6 that is 9 total 9
massages will be there.
So, if we generalize this particular total number of messages. Now if f is equal to, if f is
equal to 2 then n is equal to 3 f plus 1 that is equal to 7.
288
So, 7 is required and for the number of faulty processes are 2. So, this is particular model
or the in this particular case byzantine agreement is possible. And how many number of
rounds it is going to take? f + 1 that is 2 + 1 that is 3 rounds. So, here just see that in this
particular picture 3 rounds are shown. And let us understand that in the first round the
source will send the message to all other neighbors except itself. So, it will send to 2 3 4
5 6.
Now, every node will do this way; that means, we have to see how the node 4 is now
doing it. Node 4 after receiving this v 1 it will send to its neighbors. So, its neighbors
other than the previously defined neighbors or previously received messages or not to be
included in that. So, it will send the message for 2 3 5 6 7, now among them to will in
turn will send to its neighbors other than from where it has earlier previously received
the message that is 3 5 6. So, third round when every node is complete the algorithm will
finish and let us analyze how many number of messages will be communicated over
here.
289
So, the number of, so total number of rounds as you have seen is equal to 3 and the
messages required will be equal to (n − 1) + (n − 1)(n − 2) + (n − 1)(n – 2) (n − 3) and
if you see because this is round number 1, this is round number 2 and this is round
number 3 different messages. So, total number of messages we have to sum total number
of messages which are basically communicated in each round. So, if you sum them it
comes out to be 156 different messages. So, we have seen two examples where f =1 and
f=2, now we are going to see a more complicated example where f = 3.
290
When f = 3 the total number of nodes will be 10 3f + 1 = n and the number of rounds will
be f + 1 that is 3 + 1 that is 4.
So, just see that in this particular figure number of rounds are 4 and as we have seen in
the previous example that up to round 3 we have seen similarly the round 4 will also
continue that every node will send the message to its basically neighbors and will not
send to the other neighbors who have previously sent the message to that particular node.
So, this example as written over here only one branch of the tree is shown for the
simplicity what you can do is you can explore the complete examples at your end.
Similarly, if we see the message complexity number of rounds as I told you is equal to
f + 1 that is 3 + 1, 4 and the message total number of message will be (n − 1) + (n − 1)(n
− 2) + (n − 1)(n − 2)(n-3) So, and the fourth round will take (n − 1)(n − 2)(n-3)(n-4).
So, total number of messages if you count they comes out to be 3609 messages in this
particular example.
291
So, we can generalize these particular formula and so number of rounds is very simple
calculation that is f + 1 rounds, but as far as the messages are concerned it goes to an
exponential that is why it is called exponential algorithm because the exponential of the
order messages are required hear in this particular example, in this particular.
Exponential amount of space and messages are required.
292
Now, agreement in asynchronous message passing system with failures impossible to

result for consensus problem Fischer, Lynch and Paterson, FLP impossibility result it is
called. So, in 1985 this particular impossibility result is called FLP impossibility result
Fischer showed that the fundamental result on the impossibility of reaching an agreement
in a synchronous system it states that it is impossible to reach consensus in an
asynchronous message passing system even if a single message single process is has a
crash failure, this result popularly known FLP impossibility to result has a significant
impact in the field of designing distributed algorithm in a failure susceptible systems.
293
So, that is why weaker versions of consensus problem in a synchronous system a
synchronous model of communication is basically devised and these are called
terminating reliable broadcast k-set consensus, epsilon approximate agreement, renaming
problem and reliable broadcast. Terminating reliable broadcast problem, a correct
process always gets a message even if the sender crashes while sending it.
So, validity, if the sender of the broadcast message m is non faulty than all the correct
process eventually deliver m.
294
Next problem is called reliable broadcast problem reliable broadcast problem is without
terminating condition RTB requires that eventual delivery of the message even if the
sender fails before sending it in this case the null message needs to be sent. In reliable
broadcast this condition is not there, in reliable in RTB requires the recognition of the
failure even if no message is sent. So, is solvable under the crash failure of the O(n2).
Now, applications of this agreement algorithm are many, so some of them are listed over
here as first application is called as a fault tolerant clock synchronization distributed
system required physical clock to be synchronized. So, agreement protocol may help
them to reach a common clock value.
Atomic commit in distributed database system. So, distributed databases sites must agree
on whether to commit or to about the transaction, agreement protocol may help to reach
a consensus.
295
Conclusion, consensus problems are fundamental aspects of distributed computing

because they require inherently distributed processes to reach an agreement or a
consensus and which is essential in many of the application. So, this lecture covers
different forms of consensus problem, then gives an overview of what forms the
consensus are solvable under different models, failure models and different computation
models.
296
Then we have covered the agreement n the following categories synchronous message
passing system with failures use the fault model fail stop and byzantine models. We have
seen two different algorithms a synchronous message passing system with failures we
have we have shown the FLP impossibility condition in this problem setting, it is
impossible to reach the consensus in this model. Hence several weaker versions of the
consensus problematic terminating reliable broadcast, reliable broadcast are considered.
In upcoming lectures we will discuss about check pointing and (Refer Time: 48:00).
Thank you.
297
Distributed Systems
Prof. Rajiv Misra
Lecture – 11
Checkpointing and Roll back Recovery
Lecture 11 checkpointing and roll back recovery preface recap of previous lecture.
In previous lecture, we have discussed about different forms of consensus problem, then
it gives an overview of what forms the consensus are solvable under the different failure
models and different assumptions on the synchrony and asynchrony we have also
covered the agreement in the category of synchronous message passing system with
failures asynchronous message passing system with failures content of this lecture.
298
In this lecture, we will discuss about basic fundamentals and underlying principles of
checkpointing and roll back recovery and also discuss different roll back recovery
schemes that are checkpointing based roll back recovery and log based roll back
recovery schemes.
Introduction; the fault tolerance and error is a manifestation of the fault that can lead to a
failure in any system that is shown over here.
299
So, a failure recovery is about either the backward recovery or a forward recovery in
forward recovery the repair the erroneous part of the system state and which is very
difficult to predict and maintain at any point of time. So, another way of failure recovery
is the backward recovery in backward recovery, you have to restore the system state to a
previous error free state and it is done using operation based method and another is
called state based method and state based method is also called checkpointing of league
logging methods.
So, here in our discussion we are basically looking up the backward recovery procedures
for failure recovery roll back recovery algorithms, they restore the system back to a
consistent state after a failure.
Now, they will achieve the fault tolerant by periodically saving the state of a process
during the failure free execution they also treat the distributed application as a collection
of a process that communicate over the network.
So, in the roll back recovery schemes we will see how we have to restore the system
using the states which we save at a particular instance of time either through the
checkpointing or through the logging method or including both of them. So, checkpoints
checkpoints is a save state of a process. Now, why is the roll back recovery of a
distributed system is complicated, the messages here in distributed system induce inter
process dependencies during the failure free operation.
300
So, this particular dependencies need to be maintained at the time of recovery and that is
why it is not so easy and it becomes a complicated now another aspect here in roll back
recovery is called roll back propagation. So, the dependencies may force some of the
processes that did not fail to roll back and this is called roll back propagation. So, it is
not only the roll back of a affected process, but all the processes which are dependent to
that affected process has to roll back and this will cascade this role backing of different
process and that is called roll back propagation
So, this particular fan one of roll back propagation is called domino effect each process
takes its checkpoint independently then the system cannot avoid the domino effect.
And this scheme is called independent or uncoordinated checkpointing. So, the technique
that avoids the domino effect are coordinated checkpointing roll back recovery here the
processes coordinate with them to take their checkpoints and such that the checkpoint
form a system wide consistent state globally.
Now, another type of or another technique is called communication induced

checkpointing roll back recovery, here it forces each process to take checkpoints based
on the information piggybacked on the application that is called the communication
induced. So, it has 2 different type of this one checkpointing one is in the normal
checkpointing the other one is the post checkpointing.
301
Now, another type of checkpointing is called roll back recovery log based roll back
recovery it combines the checkpointing with logging of non deterministic events relies
on the piecewise did not mistake assumptions. Now preliminaries these preliminaries are
useful to see when we will discuss the details of checkpointing and roll back recover. So,
in this preliminary the first definition is about the local checkpoint.
So, all the processes save their local states at a certain instant of time. So, a local
checkpoint is nothing, but a snapshot of a state of a process at a given instance
assumptions a process stores all local checkpoints on the stable storage.
302
So, a process is able to roll back to any of its existing local checkpoints. So, the
terminologies which are used here to represent the local checkpoint is represented by Cik
here i is the ith process and the kth local checkpoint kth local checkpoint which is taken
at process i and that is represented by Cik Ci 0 is basically the process i takes the
checkpoint Ci 0 before it start the execution.
Now, consistent states a global state of a distributed system is nothing, but a collection of
individual states of all participating processes and the states of a communication channel
303
now this global state is a consistent global state a global state that may occur during the
failure free of execution of a distributed computing and if a processes states reflects a
message receives, then the state of the corresponding sender must reflect the sending of
the message; that means, in a consistent global state a; if a message is received then its
send is also received if that is maintained then it is called consistent global state.
So, the global checkpoint is a set of local checkpoints one from each process and a
consistent global checkpoint is having the similar definition as the consistent global
state; that means, a global checkpoint such that no message is sent by your process after
taking its local checkpoint that is receive by another process before taking its checkpoint.
These are basically can be expressed and can be understood using these examples.
So, here we can see that this particular recovery line which is shown as the dotted line
shown as dotted line is forming a consistent checkpoints and is also a consistent state.
So, in this consistent in state you see that this message is sent; send is recorded, but
received is not recorded and there is no message whose received is recorded, but send is
not recorded. Hence, it is a consistent state in consistent states here you can see that this
the receipt of a message is recorded in this particular checkpoint, but its send is not
recorded in any of the collection of the checkpoints hence this collection of checkpoints
is basically an inconsistent state.
304
Now, interaction with the outside world a distributed system of an interacts with the
outside world to receive the input data or deliver the outcome of the computation outside
work process OWP; a special process that interacts with the rest of the system through
the message passing a common approach save each input message on the stable storage
before allowing the application to process it and parallel bars is a system that an
interaction with the outside world to deliver the outcome of these particular messages.
305
So, there are different types of messages which we have to see under the preliminaries.
So, in transit message’ message that have been sent, but not yet delivered we will explain
in the next figure. So, the last message is the message whose send is done, but receive is
undone due to the roll back and that is called the lost message delayed messages;
messages whose receive is not recorded, but the receiving process was either down or the
message arrived after the roll back that we will see orphan messages the messages with
receive recorded, but message send not recorded and these orphan message do not arise
if the process roll back to the consistent global state.
Duplicate messages arises due to the message logging and replaying during the process
recovery let us understand it in this particular diagram here the first case is about the
messages which are in transit. So, m 1 and m 2. So, m 1 is this particular message and m
2 is this particular message this is in transit not yet delivered and this is m 2 also here m
5 is also under transit.
Lost message is m 1; why m 1 is lost message, this message is lost why because there is
a failure and due to the failure if it roll backs to the previously consistent state or a
continuously consistent checkpoint state, then this particular receipt of this particular
message will not be there message will or the process will not be there to receive it.
Hence this is called a lost message due to the process not available to receive it due to
the failure or maybe some other region.
306
Now, another type of message is called a delayed message. So, let us say that if the
message is in transit and this P 1 recovers from this particular checkpoint number eight
onwards and then basically it will receive the same message then that particular message
is called basically the delayed message, if the message is basically arrived late, then it is
called the delayed message.
So, basically these are the example of a delayed message orphan different assumptions
on m 4 and m 5, this is m 4 and m 5, they are duplicated messages why because you see
the dotted line is a recovery line if the system roll backs. So, you can see that let us take
the example of m 4 first. So, m 4 is send and its received is also basically recorded.
Now, when it has to roll back then from the log this particular message sending of m 4
message will be replayed again resulting in another copy of the message m 4 will be
generated and here on message are on process P 4 that same copy will basically be
resend again and will result into a duplicate messages here. Similarly, here it is a failure
m 5 is in transit, it will be received after some time, then again this particular message m
5 send of a message is replayed again. So, m 5 will be duplicated. So, these are basically
the examples of different kind of messages here.
So, there are different issues in failure recovery. So, the issues are like that that these
there are 3 different processes Pi, Pj and Pk; they will take their local checkpoints for
example,. So, P 1 Pi will take these local checkpoints similarly the checkpoints which is
307
taken by Pj is basically represented here similarly the process Pk will take its own local
checkpoints and there is a message from a to j which are which are flowing.
Now, here when a system is failed now it has to be restored in a consistent state. So,
consistent state would be to basically draw a recovery line or to find out a consistent
global checkpoint state. So, here after the failure it has to roll back from this particular
checkpoint. So, if this particular roll back is taking place then this send of this particular
H message will be undone in that case and this if it is undone then this particular
message H will result in to an message which is called a orphan message.
So, because of the orphan message the process Pj also has to roll back and it will not be
rolling back at Cj 2, but it has to roll back at Cj 1. So, the recovery line is basically now
following for pin Pj similarly as far as Pk is concerned Pk will not basically be able to
roll back at Ck 2 why because of the same domino effect why because the this message
H will become an orphan message and based on this particular message. So, it has to roll
back at this point. So, this will form a recovery line and this is shown over here in this
particular example. So, that consistent are restored global consistent state and this is also
called a recovery line.
So, the issue here is to find out among the set of checkpoints that particular consistent
global checkpoint which basically will restore the system with a minimum loss is
basically a quite challenging process and that we are going to see through an algorithm
how we are going to achieve it.
308
So, the roll back of a process i to the checkpoint Cij created an orphan that I have
explained orphan message i is created due to the roll back of Pj that also have explained
messages C, D, E, F are potentially problematic why because message C is a delayed
message; message D is a lost message.
So, the lost message can be handled by having the processes to keep message log of all
the send messages message e and f they are called delayed orphan messages they are
more difficult to handle. So, after resuming execution from their checkpoints the
processes will generate both these particular messages. So, these are basically going to
be different issues in the failure recovery and it requires coordinated or uncoordinated
checkpointing its also requires a logging to replay the messages and also basically how
to recover all these issues are quite intricate quite complex and need to be understood
here in this particular concept.
309
Now, we are going to explain the domino effect domino effect is nothing, but it is a roll
back propagation or a cascaded roll back. So, here domino effect cascaded roll back
which causes the system to roll back to too far in the computation even to the beginning
in spite of all the checkpoints here if you see that if the process, P 2 is failed at this point,
then basically it has to roll back to the previous checkpoint and this particular previous
checkpoint will basically trigger on why because this particular message m 6 will
become orphan message once, it is basically rolled back will trigger the process P 1 also
to roll back.
So, the roll back of P 2 will basically ensure that P 1 also has to roll back now since P 1
has roll back. So, this particular message will become orphan and hence the P 1 also has
to roll back. So, this will become a recovery line P 0 also has to roll back. So, this
particular rolling back of P 2 which triggers the roll back of P 1 as well as the roll back of
P 0 from these on these checkpoints is basically called a cascaded roll back and this is
called domino effect why because domino effect is now you can see that it is due to the
orphan messages.
So, orphan messages usually basically trigger the domino effect and basically it is
nothing, but a cascaded roll back we have seen through this particular example. Now
another problem is called the problem of a Livelock.
310
So, Livelock case here where a single failure can cause infinite number of roll backs; so,
Livelock problem may arise when the process roll back to its checkpoint after the failure
and request all other affected process also to roll back.
So, in such a situation if the roll back mechanism has no synchronization it will lead to a
Livelock problem as described in this particular example. So, the difference between
domino effect and a Livelock here is that the dominant effect is a cascaded roll back as
far as Livelock is concerned it is an infinite roll backs. So, let us take an example of the
311
Livelock problem. Now here in the case one if you see that process is failed at this point
here if it is failed, then it has to roll back to its previous checkpoint which is shown as y
one if its rolling back to the previous checkpoint, then it will basically result in result into
an orphan message that is called m 1 and this orphan message in turn will trigger process
x also to roll back.
So, process x will roll back to its previous checkpoints now once process x roll backs.
So, this particular n one will become orphan then in that case and the previously send
message if it is arrived here as a duplicate message because if it is restart from here. So,
this particular sending of the message m 1 will be replayed and that event will become n
2. So, basically here; so, this particular message will arrive here and this is an orphan
message why because it sent is undone and it is going to be received over here and this
will trigger the second roll back.
So, if this second roll back happens then again this step number one will be repeated and
both step number one and 2 will be repeated infinitely and this will lead to a Livelock
problem and this is all explained here. So, the above sequence can repeat indefinitely and
this is called basically the Livelock. So, in the recovery the how this algorithms are going
to handle domino and the Livelock problems is going to be a difficult task; so, that we
will see in the further slides.
312
Different roll back recovery schemes roll back recovery schemes can be classified into 2
different types one is the checkpointing based roll back recovery scheme the other is log
based roll back recovery schemes. So, checkpointing roll back recovery schemes are also
classified into 3 different types; the first one is called uncoordinated checkpointing, the
second one is called coordinated checkpointing, third one is called communication
induced checkpointing as far as coordinated checkpointing is concerned that is also
further classified into a blocking versus non blocking checkpointing and communication
induced checkpointing is also divided into 2 types that is called model based and index
based checkpointing.
Log based is also classified into 3 different types pessimistic logging optimistic logging
and casual logging checkpointing based recovery schemes overview checkpointing based
recovery schemes are different type as we have seen in the previous slides.
Now, we are going to touch upon each of them the first one is called uncoordinated
checkpointing here each process takes its checkpoints independently second coordinated
checkpointing process coordinate their checkpoints in order to save a system wide
consistent state this consistent set of checkpoints can be used to bound the roll back.
Third communication induced checkpointing, it forces each process to take checkpoints

based on the information piggybacked on the application messages, it receives from the
other processes besides the normal checkpoints; whether it is coordinate or
313
uncoordinated it also includes the forced checkpoints and this will basically further
optimize the checkpoints in communication induced checkpointing.
So, uncoordinated checkpointing each process has autonomy in deciding when to take
checkpoints advantage the lower runtime overhead during the normal course of
execution. Now it has many disadvantages, the first disadvantages that this
uncoordinated checkpointing may lead to a domino effect during the recovery so; that
means, dominoes are not avoided in uncoordinated checkpointing, second disadvantage
recovery from the failure is slow why because the processes need to iterate to search a
consistent set of checkpoints and that is going to take lot of time.
Now, each process maintains a multiple checkpoints and periodically invoke garbage
collection algorithm and this is another disadvantage of uncoordinated checkpointing
process not suitable for the application with frequent output commits and this is another
disadvantage. So, the processes record the dependencies among their checkpoints caused
by the message exchange during failure free operation and tracking these dependencies
are very important in restoring the system state after the after the failure.
314
Now, how this particular dependency is to be dragged that we can see here in this kind of
in this example or in this illustrative example. So, direct dependency tracking technique
assume each process Pi is starts execution with an initial checkpoint Ci 0 here, Ci 0 and
Cj 0, this is the initial checkpoint each process only there are 2 processes here in this
example in the system. So, they will take the initial checkpoints.
Now, then we are going to define an checkpoint interval checkpoint interval is

represented by capital Ii,x; that means,. So, i,x is basically the interval between the
checkpoints Ci,x-1 to Ci,x this interval is represented by Ii,x. So, i,x that is checkpoint
interval is the interval between the 2 checkpoints of a process i between x-1th and xth.
Now, when a process Pj receives a message m during its interval checkpoint interval that
is capital Ij,y it recast the dependency from Ii,x to Ij,y. So, this particular dependency
ensures that between these intervals the flow of message basically creates a dependency
between these 2 processes and once this dependency is saved and saved onto a stable
storage and when the Pj takes its basically checkpoint Cjy.
So,. So, these dependencies are basically tracked and they are stored. So, that at the time
of recovery in the uncoordinated checkpointing, they will be used to basically eliminate
or avoid the domino effect another type of checkpointing is called coordinated
checkpointing.
315
Coordinated set pointing are of 2 types that we have seen in the previous slides the first
one is the blocking checkpointing.
So, after a process takes a local checkpoint to prevent the orphan messages, it remains
blocked until the entire checkpointing activity is complete so; that means, while the
checkpointing is taking place, it is not allowed to send further messages hence it is called
a blocked state the disadvantage here the complication is blocked during the
checkpointing state another type of checkpointing is called here non blocking
checkpointing in a coordinated checkpointing; the processes need not stop their
execution while taking the checkpoints the fundamental problem here in coordinated
checkpointing is to prevent a process from receiving application messages that could
make the checkpoint inconsistent.
316
So, here the examples checkpoints inconsistency message m is sent by P 0 after receiving
a checkpoint request from a checkpoint coordinator assume m reaches P 1 before
checkpoint request the situation results in inconsistent checkpoints in the checkpoint.
So, this we can see in this particular illustrative example. So, as in as we were talking
about that that this is the initiator of a checkpoint. So, it will send this checkpoint
request. So, this particular request reaches at this instance to P 0 and P 0 will take its
checkpoint that is C 0 x.
317
And after that it will send a message to P one. So, P 0 will send a message to m and then
the initiator process will come at this instant. So, basically the clocks are not
synchronized or the communication channels are basically having sufficient delay. So,
that all the checkpoints are not basically carried out simultaneously, but in between these
2 checkpoints these 2 processes checkpoint just see that there is a flow of message.
So, what is to be done over here is that. So, as soon as the checkpoint message is
received and P 0 is taking a checkpoint, it will send an basically at another message to
the next particular process that is its P 1 and this P 1 message will arise or will reach
before the initiators message could reach and this particular P 0 message will initiate P 1
to take a checkpoint at this instant.
So, this particular message which P 0 will send afterwards will receive after P 1 has
taken checkpoint and this is the correct state. So, the solution as you know that it is a
FIFO channels. So, if the FIFO channels are there then this problem can be avoided by
preceding the first message post checkpoint message on each channel by a checkpoint
request forcing each process to take a checkpoint before receiving the first post
checkpoint message and that is being explained.
Now, third type of checkpointing is called communication induced checkpointing there

are 2 types of checkpoints here in this category the autonomous and force checkpoint
that I have basically explained you before. So, in a communication induced
318
checkpointing, piggybacks protocol related information on each application message. So,
this particular checkpointing information is piggybacked on the application messages.
So, the receiver of each application message is using this piggybacked information to
determine it has to take a forced checkpoint to advance the global recovery line the
forced checkpoint must be taken before the application may process the content of the
message in contrast with the coordinated checkpointing no special coordinated messages
are exchanged.
So, autonomous checkpoints are the normal checkpoints and forced checkpoints are sent
through the message which will basically as mentioned over here, it will advance the
global recovery line.
So, it will further optimize and optimize the recovery process 2 types of communication
induced checkpointing is there; one is called model based checkpointing, the other one is
called index based checkpointing in a model based checkpointing the system maintains
the checkpoints and the communication structures that prevents the domino effect are
achieved some even stronger properties index base checkpointing the system uses an
indexing scheme for the local and forced checkpoints such that the checkpoints of the
same index at all the process forms a consistent state.
319
Log based roll back recovery schemes log based roll back recovery schemes it combines
checkpointing with logging of non deterministic events. So, it relies on piecewise
deterministic assumptions which postulates that all non deterministic events that the
process executes can be identified and that the information necessary to replay each
event during the recovery can be lost in the events determinants.
So, by logging and replaying the non deterministic events in their exact original order the
process can deterministically recreate its pre failure state even if this state has not been
checked pointed. So, non deterministic events are the receipt of the event the receipt of a
message. So, in this particular example; so, the message which is received at this end its
cannot be determined cannot be known when it is going to be received, but send of a
message or a start of a process P 0 that is basically called deterministic events.
So, important thing is for these deterministic. So, this non deterministic event they are
they are deterministic parameters are to be basically found out and to be stored. So, this
is called the non deterministic and send of a message is a deterministic event.
320
So, log based roll back recovery makes use of deterministic and non deterministic events
in the computation deterministic and non deterministic events as I explained you in the
previous slide, the non deterministic events can be the receipt of a message from another
process here like this process i and process j. So, this is the receipt of a message is a non
deterministic.
A message send event is not a non deterministic event. So, send is basically deterministic
the execution of a process P 0 here.
321
When it is start is a sequence of 4 intervals here it is shown. So, it is I start of a event and
then. So, here the execution of P 0 is the sequence of 4 deterministic event. So, this is the
deterministic event one this is the deterministic event 2 and this is deterministic even 3-4
deterministic event the first one starts with the creation of a process while remaining 3
starts with the received of a receipt of a message this is 1 this is 2, this is 3 and this is 4;
3 starts with the receipt of a message m 0, m 3, m 0, m 3 and m 7 respectively.
So, send event of the message m 2 is uniquely determined by the initial state of the
process P 0 and by the receipt of a message m 0 and therefore, it is not a non
deterministic event. So, so; that means, with the relation to the start of a process you can
understand or you can determine the instance when this send of this particular message is
send.
Hence, it is not a non deterministic it is a deterministic event no orphan consistency

condition let e be a non deterministic event that occurs at a process P. So, depend e is
basically the set of processes that are affected by a non deterministic even e this set
consists of P and any process whose state depends on the even e according to the
Lamport’s happened before relation.
Then Log€ the set of processes that have logged a copy of e’s determinants in their
volatile memory stable a predicate that is true if each determinant is logged on the stable
storage always no orphan condition.
322
So; that means, if this condition is satisfied then there is no possibility of any orphan
message. So, it says that if it is; the dependencies are not stored in the stable then
basically it is stored in the in the log or it is logged in the stable storage.
Log based recovery schemes differ in the way that determinants are logged on to the
stable storage and there are of 3 types pessimistic logging optimistic logging and the
casual logging pessimistic logging the application has to block waiting for the
determinants of each nondeterministic even to be stored on stable storage before the
effects of that message can be seen by the other process or the outside world.
Optimistic logging the application does not block and the determinants are spooled to the
stable storage asynchronously casual logging no failure free overhead.
323
And simpler recovery are combined by striking a balance between optimistic and a
pessimistic logging pessimistic logging protocols assume that the failure can occur after
any non deterministic event in the computation.
However in reality the failures are variants are rare synchronous logging if an event has
not been logged on the stable storage then no process can depend on it stronger than
always no orphan condition.
324
So, here this is the example of a pessimistic logging suppose P 1 and P 2 fails here at this
particular instance then it will restart from checkpoints B and C it will restart from B and
B and C and roll forward using their determinant log to deliver again the same sequence
of the messages as in the pre failure execution once the recovery is complete both the
processes will be a consistent with the state of P 0 that includes the receipt of a message
m 7 from P 1.
So; that means, using the logging, it will be taken care of optimistic logging processes
log determinants asynchronously to the stable storage optimistically assume that the
logging will be complete before a failure occurs do not implement always no orphan
condition. So, to perform roll back correctly optimistic logging protocol track casual
dependencies during the failure free execution optimistic logging protocols require non
trivial garbage collection schemes pessimistic protocols need only keep most recent
checkpoints of each process whereas, optimistic we need to keep track of multiple
checkpoints and that will require more memory for storing the checkpoints.
325
So, in optimistic logging consider the example suppose P 2 fails before the determinant
of for m 5 is logged here is logged onto the stable storage determinant because
determinants is this one logged on a stable storage process P 1 then becomes an orphan
process P 1 then becomes and must roll back to undone the effects of receiving the
orphan message m 6 the roll back of P 1 further forces P 0 to roll back to undo the effects
of the receiving message m 7 why because this will become an orphan message. So, it
has to roll back why because its send is not recorded because it has to roll back at this
point. So, this is called optimistic logging.
326
Casual logging combines the advantage of both pessimistic and optimistic logging at the
expense of more complex recovery protocol.
So, this particular example shows the casual logging checkpointing and recovery
algorithms.
Koo Toueg coordinated checkpointing algorithm; Koo Toueg in 1987 proposed a

coordinated checkpointing and recovery technique that takes a consistent set of
checkpointing and avoids domino effect and Livelock problems during the recovery.
327
So, we will see this is the algorithm which will basically use to solve the problem of
domino effect and Livelock and through the coordinated checkpointing and it includes 2
part; the first is called checkpointing algorithm.
The other part is called recovery algorithm in checkpointing algorithm assumptions are
that the algorithm assumes that the channels are following FIFO that is and it is also
assumed that it is end to end protocol and communication failures do not partition the
network and there is a single process for initiation and no process fails during the
execution of the algorithm.
So, with this assumption the checkpointing algorithms are basically as follows. So, there
are 2 kinds of checkpoints, here in this algorithm the first is called permanent the another
one is called tentative checkpoints permanent checkpoint is the local checkpoint a part of
a consistent global checkpoint tentative checkpoint a temporary checkpoint become
permanent checkpoint when the algorithm terminates successfully.
328
Now, checkpointing algorithm; it has 2 phases in phase one the initiating process takes a
tentative checkpoint and requests all other processes to take a tentative checkpoints now
every process cannot send the message after taking the checkpoint all process will
finally, have the single same decision do or discard all processes will receive the final
decision from the initiating process and act accordingly in the second phase.
Now, correctness there are 2 reasons either all of or none of the processes will decide and
take the permanent checkpoint. So, no process sends a message after taking a permanent
checkpoint hence this basically these 2 conditions basically ensures the correctness
optimization may be that not all of the processes need to take the checkpoints. So, if no;t
change since the last checkpoints; so, that particular checkpoint that that point no change
since the last checkpoints. So, there is a possibility of optimization here and in the
checkpointing phase.
329
Now, another part of this algorithm is called roll back recovery algorithm restore the
system state to a consistent state after the failure with the assumptions have the single
initiator checkpoint and roll back recovery algorithms are not invoked concurrently. So,
roll back recovery has 2 phases. So, the first phase says that initiating process send a
message to all other process and ask for the preference that is restarting to a previous
checkpoints all need to agree about either do or do not.
330
So, the initiating process, then send the final decision to all the process and all the
process x accordingly after receiving the final decision for the roll back. So, this
algorithm we can explain what this algorithm is now doing we can explain through this
example here suppose process P want to establish a checkpoint at P 3; this will record
that Q one was received from the process Q to prevent Q 1 from being orphaned Q must
also take the checkpoint as well after sending the Q one it has to take a checkpoint thus
establishing a checkpoint at P 3 forces Q to take a checkpoint to record that Q one was
sent.
So, this basically we can say that it is the Q 2 this checkpoint this P 3 will force the
process Q to take a checkpoint. So, an algorithm for such coordinated checkpointing has
2 types of checkpoint the tentative checkpoint and a permanent checkpoint that I am
explained. So, the process P first records its current state in a tentative checkpoint and
then sends a message to all other process from whom; it has received a message since
taking its last checkpoints.
So, call this set of such processes which are required to take a checkpoint and basically it
will be coordinated checkpointing as far as this particular process concerned.
So, the message tells each process the last message that P has received from it before
tentative checkpoint was taken if that particular message was not recorded in a
331
checkpoint by Q to prevent from being orphaned Q was asked to take a tentative
checkpoints and this is basically done here in this case.
Now, as far as roll back recovery there is a possibility of optimization and also there is a
correctness that it will resume from a consistent state take this example if there is a
failure it will roll back or it will basically select x 2 as the checkpoint a consistent
checkpoint this will basically trigger that this particular message is orphan and this will
also lead why to be taken into a checkpoints. So, this recovery line is like this.
Now, here in this case this particular algorithm although has basically using its roll back
able to basically able to found out the recovery line, but you can see here that z from the
previous checkpoint there is no event no activity happened. So, there is no need of
basically roll back of the process z and that is the optimization. So, may not to recover all
the processes in some of the processes did not change anything from the last checkpoint
and from the last checkpoint here the events are happening. So, they have to roll back,
but process z does not have to roll back.
So, there is a possibility of optimization of this particular algorithm of the number of roll
backs processes who are required to be roll back.
332
Now, few other algorithm for checkpointing and roll back recovery which are
summarized over here juang and venkatesan algorithm for asynchronous checkpointing;
so, we have seen the previous algorithm quasi asynchronous checkpointing; this is an
asynchronous checkpointing and recovery and Manivannan and Singhal; they are quasi
synchronous checkpointing algorithm and Peterson and Kearn’s algorithm is based on
the vector time and Helary and Mostefaoui; they is they have given the communication
induced checkpointing algorithm.
333
Conclusion roll back recovery achieves the fault tolerance by periodically saving the
state of a process during the failure free execution and restarting from the saved state on
the failure to reduce the amount of the last computation and this we have seen in several
algorithms happening. So, there are 3 basic approaches for checkpointing and failure
recovery uncoordinated coordinated and communication induced checkpointing and for
log based we have seen pessimistic optimistic and casual logging.
So, over last 2 decades, checkpointing and failure recovery algorithm has been very
active area of research and several checkpointing and failure recovery algorithm have
been proposed in designing the distributed applications. In this lecture, we have
described Koo Toueg coordinated checkpointing algorithm and given the overview of
other algorithm in upcoming lectures, we will discuss the deadlock detection in the
distributed system.
Thank you.
334
Distributed Systems
Prof. Rajiv Misra
Lecture – 12
Deadlock Detection in Distributed Systems
Lecture 12; deadlock detection in distributed systems.
335
Preface recap of previous lecture in previous lecture, we have discussed about
checkpointing and rollback recovery and also discussed different rollback recovery
schemes that is checkpoint based and log based rollback recoveries.
Content of this lecture in this lecture, we will discuss about basic fundamentals of
deadlock detection in the distributed systems also discuss different classes of algorithms
such as path pushing edge chasing diffusion computation and global state detection a
methods for designing the distributed deadlock detection in distributed systems.
Introductions; deadlocks, deadlocks are the fundamental problem in the distributed

system the process may request resources in any order which may not be known a priori
and a process can request resource while holding others, if a sequence of allocation of
resources to the processes is not controlled the deadlock can occur. So, a deadlock is
state where a set of processes requests resources that are held by the other processes in
the set that is the deadlock.
336
So, we are going to now define the system model which we are going to use in the
deadlock detection algorithm a distributed program is composed of a set of n
asynchronous processes p 1 to p n that communicates by message passing over the
communication network without loss of generality we assume that each process is
running on the different processor.
The processor do not share a common global memory and communicates only by passing
messages over the communication network also; there is no physical global clock in the
system to which the processes have instantaneous access the communication medium
may deliver the messages out of order messages may be lost messages, may be garbled
or they may be duplicated due to the timeout and retransmission processors may fail and
communication link may go down all different possibilities can happen.
337
Because there are different set of processors and processes which are separated
geographically and connected by the communication network and they exchange they
communicate by passing messages through the communication network which may have
these kind of problems.
So, the following assumptions are made the systems have only reusable resources first
second assumption is processors are allowed to make only exclusive access to the
resources third; there is only one copy of each resource a process can be in one of the 2
states that is running or a blocked.
In a running state, a process has all the needed resources and is either executing or is
ready for the execution in the block state, a process is a waiting to acquire some
resource, then we are going to define a data structure which is called a wait for graph
which is used here in deadlock.
338
A state of the system can be modeled by directed graph which is also called as a wait for
graph WFG in a wait for graph nodes are the processes and there is a directed edge from
a node P i to P j, if p i is blocked and is waiting for P 2 to release some resource, then
there is an edge. So, we can express wait for graph using P i and P j. So, this is the
directed edge from P i to P j if process P i is blocked and this is waiting for P j to release
the resources. So, this kind of graph is called wait for graph.
A system is deadlocked if and only if there exists a directed cycle or a knot in the wait
for graph. So, for example, we can have another process P k now this P j is now waiting
for P k and P k is waiting for P i. So, here there is a cycle in this particular graph. So,
here the cycle means a set of waiting processes, this is a cycle this is a knot why because
there is no; there is no outgoing edge out of this particular cycle and if any edge let us
say P l is like this, then also it is a knot.
So, this is an example of a cycle or a knot and this set of waiting process is a system of
deadlock depending upon the model of resources for which they are waiting. So, these
particular models we are going to consider in the next slides.
So, this is an example of a wait for graph; now here the process P 11 of site 1, there is an
edge to P 21, this edge is talking about and P 32. So, this particular process P 11,
simultaneously waiting for 2 resources which can be released or which can be given by
process P 21 and process P 32.
339
Now, if P 21 is also waiting for P 24 and P 24 is also waiting for P 54 and P 54 is waiting
for P 11. So, by transitively we can say P 21 is also basically waiting for P 11, then it will
form a cycle and all the 4 processes involved in the cycle, you used in the deadlock
depending upon the request model now this is a cycle since this particular edge is
outgoing. So, this cannot be a knot.
Now, preliminaries deadlock handling strategies there are 3 strategies for handling the
deadlock that is deadlock prevention deadlock avoidance and deadlock detection
handling of deadlock becomes complicated in distributed system where because no site
has an accurate knowledge of the current state of the entire system and because every
inter site communication involves a finite and unpredictable delay.
340
So, this particular scenario of the distribution or a distributed network in a distributed

system makes the deadlock a bit difficult.
Now, another method another strategy for ending deadlock is called deadlock prevention
is commonly achieved either by having a process that acquire all the needed resources
simultaneously before it begins the execution or by preempting a process which holds the
needed resources this approach is highly inefficient why because it is not always possible
to acquire all the resources which are needed for the execution or it will be very costlier
to preempt a process and release the resources and allocate the resource for the
requesting process in order to continue. So, deadlock prevention is highly inefficient
inapplicable or impractical for the distributed environment.
341
Now, another method of handling deadlock is the deadlock avoidance deadlock

avoidance in distributed system is method deals with that the resources is granted to a
process if the resulting global system state is safe note that global state includes all the
processes and resources of the distributed system. So, here the resources are allocated in
such a way that the resulting global state is safe so; however, due to several problems
deadlock avoidance is also impractical in a distributed system.
Deadlock detection requires examination of the status of process resource interactions for
the presence of a cyclic wait. So, deadlock detection in a distributed system to seems to
be best approach to handle the deadlock why because it is becoming a convenient in this
model
342
Issues in deadlock detection; so, that is why we are going to discuss only one strategy
that is deadlock detection; why because it is most practical in the distributed system
scenario to handle the deadlocks deadlock handling using the approach of deadlock
detection entails addressing 2 basic issues the first one deals with that the detection of the
deadlock and second deals with the resolution of detected deadlocks. So, detection of
deadlocks involves addressing 2 issues the first one is the maintenance of wait for graph
WFG and second is searching this wait for graph for the presence of a cycle or a knot.
343
So, different algorithms will follow different schemes for these 2 methods correctness
criteria a deadlock detection algorithm must satisfy the following 2 conditions to ensure
the correctness of the algorithm.
First is the progress the algorithm must detect all the existing deadlocks in a finite
amount of time in other words after all wait for dependencies for a deadlock have formed
that algorithm should not wait for any more events to occur to detect the deadlock.
Second condition for correctness of a deadlock is called safety that is the algorithm
should not report the deadlock which do not exist and that condition is called phantom
deadlocks or a false dreadlocks. So, safety ensures that it always report the correct state
of the stable state that is called a deadlock and the second condition of correctness is the
progress; that means, all the existing deadlocks must be identified in a finite amount of
time.
344
Now, resolution of the detected dreadlocks; the deadlock resolution involves breaking
the existing wait for dependencies between the processes to resolve the deadlock, it
involves rolling back one or more deadlocked processes and assign their resources to the
other blocked process so that they can resume the execution in the distributed systems.
Models of deadlocks distributed system allows different kind of resource requests so;
that means, they are represented by different model a process might require a single
resource or a combination of resources for its execution.
A hierarchy of request models are described as follows the first resource request is the
single resource model in single resource model a process can have at most one
outstanding request for only one unit of resource. So, this is the maximum out degree of
a node in wait for graph in this single resource model is one? So, the presence of a cycle
in a wait for graph shall indicate that there is a deadlock. So, cycle in this particular
single resource model will indicate that the system is in deadlock.
And model in and model a process can request for more than one resource
simultaneously and the request is satisfied only after all the requested resources are
granted to the process that is why it is called and model the out degree of a node here in
wait for graph for and model can be more than one, the presence of recycle in a wait for
graph indicates a deadlock in and model since in a single resource model a process can
345
have at most one outstanding request the end model is more general than a single
resource model.
Let us consider an example consider an example of a wait for graph in figure one and
consider it as an AND model. So, P 11 has 2 outstanding resource requests that we have
seen over here. So, P 11 shall become active from ideal only after both the resources are
granted to P 11.
346
So, there is a cycle which corresponds to the deadlock situation why because in and
model if there is a cycle; that means, it is a deadlock that is the process may not be in a
part of cycle it still be a deadlock.
Now, another example is over here that this example says that a process may not be in a
cycle yet it is in deadlock for example, P 44 is not in a cycle, but P 44 is in deadlock
situation. So, it is not part of a cycle is deadlock, but it is deadlocked.
The or model in the or model a process can make requests for an numerous resources
simultaneously and the request is satisfied if any one of the requested resources is
granted presence of a cycle in a wait for graph in or model does not imply a deadlock in
or model.
So, let us consider the example the same example here all the nodes are or node then the
process P 11 is not in a deadlock why because if P 33 has finished its execution and
releases the resources it will make the P 33 active.
So, the once the resources are releases by released by P 33; it can be allocated to P 32.
So, P 32 will also become an active after the resources are located.
So, after finishing P 32, this particular resource can be can be given or can be allocated
to P 11. So, P 11 since it is an OR model. So, once that this particular resource is
allocated. So, P 11 will become an active. So, it becomes an active, it can start its
347
execution and it will break the cycle. So, in the OR model the presence of a knot will
indicate a deadlock basically here this particular condition is only a cycle not a knot. So,
the presence of a knot in or model indicates the deadlock and OR model is a mixed
model. So, it is a generalization of the 2 models and or model and or model a request
may specify any combination of AND or in the request resource.
For example, in AND OR model a request for multiple resources can be of the form x
plus y or z to detect the presence of a deadlock in such a model there is no familiar
construct of a graph theory for that use, the wait for graph hence the deadlock is detected
using its stable property.
So, a deadlock in AND OR model can be directed by repeated application of the test for
the or model deadlock to find out the stable property and why because if the deadlock is
nothing, but finally, the stable satisfying the stable property.
Now, another model is called P out of Q models. So, another form of and or model is
called P out of Q model which allows the request to obtain any k available resources
from a pool of n resources it has same expressive power as at as and or model we have
seen earlier; however, P out of Q model lends itself to a much more compact formation
of a request. So, every request in a P out of Q model can be expressed in the form of
AND OR graph and vice versa note that and requests for P resources can be stated as P
out of P; that means, all P resources are required that is the AND model and the OR
348
model request for the P resources can be stated as 1 out of P that is an OR model. So, P
out of Q can be expressed in these 2 forms of OR and n model.
Unrestricted model in unrestricted model no assumptions are made regarding the

underlying structure of the resource requests only one assumption that the deadlock is
stable is made hence this is a most general model.
This model helps separate concern about the properties of the problem that is stability
and the deadlock are separated from the underlying distributed computations.
349
Classification of distributed deadlock detection algorithm classification of distributed

deadlock detection algorithms Knapp’s classification Knapp has classified the distributed
deadlock detection algorithm in 4 classes. The first one is path pushing, second one is
edge chasing, third one is diffusion computation, fourth one is global state detection;
each of these classified methods to detect the deadlock is basically different algorithms
are listed over here and these algorithm use this particular strategy that we will see that is
4 different strategies which will classify different algorithms distributed algorithms they
are listed as path pushing edge chasing diffusion computation and global state detection
let us see these strategies one by one.
Path pushing algorithms in path pushing algorithm distributed deadlock detection are
detected by maintaining an explicit global wait for graph the basic idea is to build global
wait for graph for each site of the distributed system.
350
In this class of algorithms at each site whenever a deadlock computation is performed, it

sends its local wait for graph to all the neighboring sites after the local data structure of
each site is updated this updated wait for graph is then passed along to the other sites and
this procedure is repeated until some site has sufficiently complete picture of the global
state to announce the deadlock or to establish that no deadlocks are present.
The name path pushing here in this algorithm is used because the local data structure is
sent along the path to different processes connected by the communication network
hence the name is path pushing algorithm so; that means, the local wait for graph which
is constructed by a particular node is sent along the path and once they are collected at all
the ends. So, basically one node will get a complete picture of some site will get a
complete picture of the global state and it will announce the deadlock are established that
there is no deadlock present this feature of sending around the paths of the global wait
for graph has led to the term path pushing algorithms . So, the algorithms which uses this
strategy is called basically classified in these particular methods that is path pushing
algorithms that we will see later on.
351
Edge chasing algorithms in edge chasing algorithms the presence of a cycle in a

distributed graph structure is to be verified by propagating a special message called
probes along the edges of the graph probe messages are different than the requests and
reply messages of the computation.
The formation of a cycle can be detected can be deleted by a site if it receives the
matching probe matching probe sent by it previously. So, whenever a process that is
executing receives a probe message it discards the message and continues; that means, if
a particular process which is an active process and if it receives a probe. So, it will just
discard this particular message why because the deadlock involves only the set of
blocked processes and if the process is working that is not a part of the deadlock hence it
discards the message and continues its current execution.
Only the blocked processes propagate the probe message along their outgoing edges. So,
the main advantage of edge chasing algorithms is that probes are of fixed size and they
are is small in size hence the overhead of message size is very very minimal and that is
the advantage of edge chasing algorithms.
Diffusion computation based algorithms in diffusion computation based distributed

deadlock detection algorithms deadlock detection computation is diffused through the
weight for graph of the system these algorithm make use of echo algorithms to detect the
352
deadlock this computation is superimposed on the underlying distributed computation
hence no separate execution for deadlock detection is taking place.
So, if the computation terminates the initiator declares the deadlock to detect a deadlock
a process sends out query message along all the outgoing edges in the wait for graph
these queries are successively propagated that is diffused through the edges of the wait
for graph. So, when a blocked process receives first query message for a particular
deadlock detection initiation it does not send a reply message until it has received a reply
message of for every query it sent. So, for all subsequent queries for this deadlock
detection initiation, it immediately sends back a reply message the initiator of the
deadlock detection detects the deadlock when it receives a reply from every query it has
sent out.
353
Global state detection based algorithms global state detection based deadlock detection
algorithm exploit the following facts the first is that consistent snapshot of a distributed
system can be obtained without freezing the underlying computation that we have seen in
Chandy Lampert’s algorithm. Now if a stable property holds in the system before
snapshot collection is initiated, this property will still hold after the snapshot is available
or it will be captured in the snapshot; therefore, distributed dead locks can be detected by
taking a snapshot of the system and examining it for the condition of the deadlock.
Deadlock detection algorithms now we are going to discuss deadlock detection algorithm
and that is given by Mitchell and Marritt and this is called Mitchell Marritt algorithm and
this particular algorithm is based on edge chasing approach which we have discussed in
the previous slides; so, Mitchell Merritt’s algorithm given in 1984 assumes a single
resource model detects the local and global deadlocks each process has assumed 2
different labels private and public each label is accountant the process id guarantees only
one process will detect a deadlock and that is why this particular method is popular.
354
Send tokens and control information on the same socket and make use of FIFO
guarantees, no synchronization mechanism is required in this algorithm this algorithm
belongs to a class of edge chasing algorithms where the probes are sent in the opposite
direction of the edges of the wait for graph when the probe initiated by a process comes
back to it the process declares a deadlock only one process in the cycle detects the
deadlock this simplifies the deadlock resolution this process can abort itself to resolve
the deadlock.
355
Each node in the wait for graph has 2 variables and they are called label private label and
a public label private label is unique to the node at all the time though it is not a constant
and the public level which can be read by the other processes and which may not be the
unique each process is represented as u / v; that means, the private and the public label
where u and v are the public and the private labels respectively. So, initially private and
public labels are equal a global wait for graph is maintained and it defines the entire state
of the system the algorithm is defined by 4 state transition shown in the next figure
where z is nothing, but an increment of u v which yields a unique label which is greater
than both x and y labels that are not shown to change.
356
Now, another state is called a block state this will block will create an edge in a wait for
graph that we will see in the next slide. Now 2 messages are needed one resource request
and one the message back to the blocked state to inform it if the public label of the
process is waiting for.
Now, another state is called an activate state activate state denotes a process has acquired
the resource from the process it was waiting for now; the next is another state is called
the transmit state. So, transmit propagates the larger label in the opposite direction of the
edges by sending the probe messages.
357
Let us see all these 4 different state transitions in this particular picture. So, this is the
block activate transmit and detect you will explain. So, whenever a process receives a
probe which is less than its public label it, then simply nodes that probe detect means that
probe with the private label of some process has returned to it indicating a deadlock the
above algorithm can be easily extended to include priorities where whenever a deadlock
occurs the lowest priority process gets aborted and hence resolves the deadlock.
Now, we have to see the algorithm in more detail, the first steps is that every node has a
public and a private label both are non decreasing additionally no 2 nodes ever have the
same private label node x begins to wait on node y node x updates its public label to be
max of x y plus 1 when a node discovers that a node; it is waiting on has a larger public
label than its own it replaces its value of its public label with a larger one.
358
This algorithm has the effect of circulating successively larger public labels in the
reverse order of the corresponding wait for graph if a deadlock truly exists, then the node
will eventually see its own public label on the process for which it waits.
Let us see all these 4 different state transitions in this particular picture. So, this is the
block activate transmit and detect you will explain. So, whenever a process receives a
probe which is less than its public label, it is trivial to modify this algorithm. So, that
upon detection could reveal the smallest channel capacity.
359
Let us understand this algorithm through an example. So, just see that here this particular
node has basically the label both are same u and v. Now when it will block, then this will
this label public label will increase that is x and y; you have to say max of these 2 a plus
one. So, earlier it was one and it was 3. So, max was 3 plus 1 and that is equal to 4. So, 4
is now relabeled here and this edge will come in the block process. Similarly you just see
that if you want to add an edge. So, you have to take the maximum 4 and 5 is 5 plus one
that is that is 6. So, this will be 6 in the next slide we will see that 6 5.
Now, if you want to add a label like this then 3 and 6 is 6 plus one that is basically the
seven. So, seven will be made over here. Now these labels will basically move on the
opposite direction of these edges of this is a wait for graph now these labels will move in
the opposite direction of these edges. So, that these labels get updated in the next step of
this algorithm will show. So, it will transmit seven and will get updated, it will again
transmit seven that is going in the opposite direction to take the higher values of the label
and finally, when that same label 7 will come back to the same process; this process will
understand that this system is in a state of deadlock because it has detected a cycle.
360
Now, message complexity of this algorithm if we assume that a deadlock persist long
enough to be detected the worst case complexity of the algorithm is s(s-1)/2 transmit
steps where s is the number of processes in the cycle.
Now, other algorithms which deals with distributed deadlock detection are summarized
here in this particular table; now as you can see that all these algorithms uses different
strategies of the algorithm distributed algorithm design that is all these strategies we have
discussed in previous slide that is either they are path pushing strategies of distributed
algorithm design edge chasing diffusion computation and global state detection.
361
So, all these algorithms will basically follow that principles which we have covered for
example, in a path pushing; it will generate a local wait for graph and push along it path.
So, that a particular single node will have the complete information or a complete global
state and then using that global state it will. So, based on that scenario these algorithms
are basically designed similarly there is an edge chasing one example of edge chasing
algorithm which we have seen similar algorithms are also using with a different resource
request models.
362
Diffusion computation is an algorithm which is using or which is defined on an or
request model. Now finally, a global state detection algorithm is given for resource a
request model which is P out of Q model. Conclusion out of the 3 approaches to handle
deadlocks deadlock detection is the most promising in the distributed systems detection
of deadlock requires performing 2 tasks first maintaining the wait for graph and second
searching the wait for graph to detect whether there is a situation to detect for a cycle or
knot which will in turn depend model to say that whether there is a deadlock or not
distributed; deadlock detection algorithms can be classified in 4 different classes that we
have seen as push path pushing edge changing decision computation and global state
detection.
So, in this lecture, we have discussed one algorithm that is the Mitchell and Merritt
algorithm for a single resource model and that particular algorithm is based on the
technique which is called in edge chasing in the upcoming lectures we distributed shared
memory.
Thank you.
363
Distributed Systems
Prof. Rajiv Misra
Lecture – 13
Distributed Shared Memory
Lecture 13 distributed shared memory; preface, Recap of previous lecture.
364
In previous lecture we have discussed the basic fundamentals of distributed deadlock
detection and different classes of algorithms such as path-pushing, edge-chasing,
diffusion computation, and global state detection to basically see the distributed deadlock
in distributed systems. Content of this lecture: In this lecture, we will discuss about the
concept of distributed shared memory as well as provide different ways to classify
distributed shared memories and consistency models of distributed shared memory and
also discuss Lamport’s Bakery algorithm for shared memory mutual exclusion.
Introduction: Distributed shared memory is an abstraction provided to the programmer of

a distributed system. It gives impression of a single monolithic memory, as in the
traditional von Neumann architecture. Programmers access the data across the network
using only read and write primitives, and they would as they would do it in uniprocessor
system. Programmers do not have to deal with send and receive communication
primitives and also ensuring the complexity of dealing explicitly with synchronization
and consistency in the message passing model. So, all the intricacies will be bypassed, if
the programmers are given a high level abstraction which is called a distributed shared
memory.
365
So, distributed shared memory abstractions. They communicate they provide the
abstraction to the programmers so that they can communicate using read and write
operations in the shared virtual space. No send and receive primitives to be used by the
application, under the covers, send and receive used by the distributed shared memory
manager, here locking is too restrictive and also need the concurrent access. So, with the
replica management, problem of consistency arises. So, weaker consistency model that
weaker than von Neumann architecture is required in this particular scenario.
Let us understand this particular picture to understand the place or the placement of a
distributed shared memory in the system architecture. Now, we see that in this particular
figure, we have seen that every process or every processor has its memory within it. So,
out of this particular memory which is available with the processor in the system, some
part of the memory is basically assigned for the distribute shared memory and remaining
will be used as the local memory. This particular memory which is now spared by the
different processors will be managed by a module which is called a memory manager.
And this particular memory manager will give a complete one view of a monolithic
memory, that is called a shared memory and that is realized using memory manager in
the system.
So, that was an architecture, so here the placement of the shared virtual memory you can
see is lying over here in the distributed shared memory and this memory manager and
366
now basically the processes of the application. They communicate with the memory
manager through two different constructs one is called invocation, the other is called
response.
So, invocation and response are basically the primitive to access the memory as if it is
there in the von Neumann architecture.
The advantage of distributed shared memory: It shields the programmer from doing the
send receive primitives; that means, the programmers have to only use the read and write
primitive. Regarding realization of read and write primitives on the distributed system
that is the send and receive will be completely abstracted. Single there will be a single
address space and it simplifies the passing by reference and passing complex data
structure. So, once a single address space is available, the programming becomes easier
and the constructs like passing by reference and passing complex data structure will
become a convenient for the programmer to use the distributed shared memory.
It exploits the locality-of-reference when a block is moved. Distributed shared memory

uses simpler software interfaces, and cheaper off-the-shelf hardware hence cheaper than
dedicated multiprocessor systems are realized. No memory access bottleneck, here as no
single bus large virtual memory space is available. Distributed shared memory programs
are portable as they use the common distributed shared memory programming interface.
367
The disadvantages of distributed shared memory: here programmer need to understand

the different consistency models, to write the correct programs.
Distributed shared memory implementations use asynchronous message-passing, and

hence cannot be more efficient than the message-passing implementations. By yielding
control to the distributed shared memory manager, software programmers cannot use
their own message-passing solutions. So, it is an abstraction and the programmer has to
work in that APIs are at interfaces which is being provided by the distributed shared
memory.
368
Issues in implementing distributed shared memory software. So, there are several issues
in implementation and we will be touching upon in more details in this slides.
So, some of the issues which are listed over here as semantics for concurrent access must
be clearly specified, semantics of replication, location for the replication, for
optimization and the it will it should reduce the delays, and number of messages to
implement the constructs. Similarly the data is replicated or cached, this particular aspect
is a decision aspect in the design part remote access by the hardware or a software, this
also is a design aspect and this will be a major issue caching oblique replication whether
it is controlled by the hardware or software, is also a design issue and this will be dealt
for different applications. Distributed shared memory controlled by the memory
management software, operating system, and language compilers.
369
Now, this particular chart or a table matrix this will give a comparison of early
distributed shared memory systems. Now the type of shared memory systems you have
seen in the or we are seeing that in this matrix, that there is a single bus multiprocessor
and multiprocessors and paged base shared memory, shared memory, shared variable, a
distributed shared memory, shared object, distributed shared memory. So, these are a
different type of distributed systems and they use a different kind of caching methods.
Some are using the hardware and the some are using the software and remote access also
some are doing by hardware the others are realizing using software.
370
So, all these comparison will show that distributed shared memory requires a lot of
system level intricacies and depends upon different applications, how efficiently they are
going to exploit hardware versus software, replication versus a caching. So, these are the
major issues which we are going to see and different consistency models. So, that
distributed memory will be realized as single monolithic memory at the time of access
using read and write. How that is all done we are now going to see the memory
consistency model, because this model is now given to the programmer, and programmer
will see this particular model and write a programs.
These models how they are implemented we will see in this particular discussion.
Memory consistency model.
Memory coherence; memory coherence is the ability of the system to execute memory
operations correctly. Assume n processes and s i the memory operation per process P i.
So, also assume that all the operations issued by the process are executed sequentially
and pipelining is disallowed. So if we see this particular figure 13, it shows the
sequential invocation and responses in a distributed shared memory. So, in this particular
model, one thing we have to understand is that; there is the interaction between the
process and the local memory manager, the placement we have shown you in the
previous slide.
371
Now, the process will through the operations, it will do the invocation for the shared
memory used and this particular invocation in turn will make a call to the local memory
manager. Local memory manager will basically handle these invocations, through the
internal details of that we are going to see and provide the response to the operation. So,
rest of the internal intricacies are hidden from the programmer or is being abstracted only
in the form of invocation and response. So, these particular every invocation will lead to
a different memory operations.
Now, you see that there are so many number of operation, simultaneously at the same
point of time are issued on the distributed systems. So, basically each processor will have
its own memory operation. So, many memory operations will be overlapping or a non-
overlapping and so many number of permutations are possible which one is basically the
correct one or a and which one is not allowed or a not correct. So, it depends upon
different memory models that we are going to see. And this memory model is basically
useful to the programmer to design the correct application or programs.
So, now we are going to see the memory coherence. Observe that there are total numbers
of so many possible interleavings. So, s i is basically the memory operations so many
possible permutations are there. So, memory coherence model defines which
interleavings are permitted. So, as you see that not all permutations or not all
interleavings are allowed in the system. So, some are basically allowed, so the
372
interleavings which are permitted only they will be captured by the model. So, memory
coherence model will define those interleavings which are permitted.
Traditionally, read returns the value which is written by the most recent write. So, most
recent write is ambiguous with the replica and the concurrent accesses. So, distributed
shared memory, consistency model is a contract between the distributed shared memory
system and the application programmer.
So, different consistency models are used by different scientists that we can list out here.
Different consistency models are listed as a sequential consistency model by Lamport,
linearizability model by again Lamport, a PRAM model and linearizability, slow
memory, weak consistency, a release consistency, sequential consistency and so on
casual consistency model. These models consistency models are important, why because,
they give as an abstract to the programmer and programmer will use this model for.
373
So, let us go in more detail of these consistency models, because they are the most
important features of the distributed shared memory. The first model is called strict
consistency, it is also called a linearizability, it is also called as atomic consistency. So,
strict consistency model says that any read to a location is required to return the value
written by the most recent write to that location as per the global time reference. So,
basically here there are two important thing is that, whenever a read is issued it has to be
dependent on the most recent write, second issue; we have to see that this particular
dependency has to be linked in the global time scale or a global time frame.
So, all the operations appear to be executed atomically and sequentially. All the a
processors see the same ordering of the events, which is equivalent to a global-time
occurrence of non-overlapping events. So, here in this strict consistency, the association
of the read to the most recent right and also the global time reference is going to be very
very important notions.
374
Now, conditions of linearizability. More formally, a sequence of invocation and response

is linearizable, if there is a permutation sequence of adjacent pairs of corresponding
invocation and response even satisfying. First condition for every variable v, the
projection of sequence prime on v, denotes sequence prime v, is such that every read
returns the most recent write that immediately preceded it. So, this condition we have
seen that the read has to be preceded with the most recent write on a global scale or in a
global reference. Second part is says that, if the response of operation 1 occurred before
the invocation of operation 2 in the sequence, then operation 1 occurs before operation 2
in the sequence prime; that means, in the globally scale.
So, if the operation 1 happened before operation 2. So, in the global scale it should
reflect this happen before deletion and this is the condition number 2 and this has to be in
a reference to the global time frame. So, condition 1 is specifies that every processor sees
a common order of sequence prime of events that and that or and that in this order the
semantics is that read returns the most recent in completed write value. Condition 2
specifies that the common order must satisfy the global time order of events that is the
order of non overlapping operations in the sequence must be preserved.
375
Strict sequence strict consistency or linearizability: example we can see over here is that,
and in this particular figure the execution is not linearizable because the read by P 2 here
gives the value 0; although the most recent write of x is 4, although it is taking that
particular value not the most recent, but the old value where x was 0 in that case so P 2;
So, here we can see that the P 2 issues the write P 2 begins after the write(x, 4), so this
particular read happens after the write. And so basically this read is not as shared with
the most recent right.
Hence, this is not linearizable. This example shows that it is not linearizable; however, it
is sequentially consistent. What is sequentially consistent? We will explain in a minute;
in a next slide. Hence, the permutation this permutation or the ordering that is in
sequence prime satisfying condition 2 above on the global time order does not exist. So,
out of two condition; condition number 1 and 2 defined earlier, the it violates condition
number 2 hence it is not linear is not linearizable.
376
So, this particular example in figure 13.5 the execution is linearizable. So obviously, we
can see over here that this particular read of x is drawn out of the most recent write. So,
here the value 4 is written and the read is also basically able to fetch the same value or
that value is basically now available whatever recent write has done similarly for y. So, y
in most recent y has written nearly two and that is available to the read which is
following the right. So, the this particular both are read operations so, hence it is
linearizable.
And, it is also sequentially consistent. So, here it is written that it is consistent with the
real time occurrence and that is write(y, 2) and write(x, 4), read(x, 4) and read(y, 2) is
basically the sequence, and that is why it is linearizable? And this sequence is following
the global timeframe or the real time occurrence. Hence this permutation sequence time
satisfies the condition 1 and condition 2; hence it is very strictly consistent and
linearizable.
377
The implementation of linearizability we can see here, requires two aspects to be taken
into an account. The first aspect which we have seen is how to associate the read with the
most decent write, the second one is how to evolve a global time reference. Although,
there is no global clock which and in the distributed system and also there is no common
memory. So, in spite of these two absence, we have to provide the global time frame
reference to all the events which is occurring. Linearizability is implementation is a
challenge.
So, let us see how the linearizability is implemented. So, as I mentioned simulating
global time axis is expensive. Assume full replication is available, and a total order
broadcast support is also available. Total order broadcast will be used here in
implementation of linearizability. Now, here when the memory manager receives the
read and write from the applications, it will issue a total order broadcast the read or write
request to all the processes processors. So, it will await its own request that was
broadcast perform the pending response as follows; when if the case is read, then it will
return the value from the local replica; if it is write then it will write to the local replica
and return acknowledgment to the application.
Now, when the memory manager receives the total order broadcast, that is; write x value
from the network. Then it will write the value to the local replica x. Then the memory
manager receives a total order broadcast from read x value from the network they will
378
not do any response. So, here you can see that either it is read or write in both the
operations, it will issue a total order broadcast why it is issue a total order broadcast is to
evolve a global time reference implementation.
So, for read operation, whenever a memory manager system wide receives a total order
broadcast, they do not perform any action that we have seen in the algorithm. Then why
is the broadcast necessary? The reason is this. If the read operations do not participate in
total order broadcast, they do not get totally ordered with respect to the write operation
as well as with respect to the other read operations. Hence, the read is to be associated
with the most recent write is realized, because of this total order broadcast of read as well
as write operations.
379
The example you can see over here in this particular figure is that, when I write issues a
total order broadcast this message will reach P k earlier than P j. So, if it reaches P k
earlier and then a read is issued to read this variable x which is written by the most recent
write. So, that value is available whatever is recently written value for x.
However; for P j the read is happening before, because the total order broadcast is
receiving at a later point of time. Hence, it is going to read the old value although, this is
happening after this read even then it is able to only view the old values, because new
value is not available hence it is a violation of linearizability. So, that is why the read
operation have to basically participate in total order broadcast that I explained you.
380
The next consistency model is called sequential consistency. Sequential consistency is

specified as follows: The result of any execution is the same as if all operations of the
processors who are executed in some sequential order. The operations of each individual
processor appear in this sequence in the local program order. So, any interleaving of the
operations from different processor is possible. But all processors must see the same
interleaving. Even if the two operations from different processors do not overlap in a
global timescale, they may appear in a reverse order in a common sequential order seen
by all the processors.
So, here one thing we have to understand, that sequential consistency model is going to
evolve a sequence or a some other sequence and that sequence should be visible to all the
processors.
381
So that order we are going to see how we are going to evolve in sequential consistency.
So, here implementation of a sequential consistency model which is a weaker than the
linearizable model or a strict consistency model it is weaker model. So, it only here you
can see that only write participate in a total order broadcast. And reads do not because:
all consecutive operations by the same processors are ordered in the same order, the read
operations by different processors are independent of each other; and to be ordered only
with respect to the write operations.
382
Direct simplification of linearizability algorithm is we are going to show you. So,
sequential consistency using local reads. So, here when a memory manager P i receives
read or write from the application in sequential consistency model we see that, what it
will do? It will form the two cases read and write; if read then it will return the local
replica. And for the write operation, if it want to write the value to the variable x, then it
will issue a total order broadcast to all the processor including itself. When this memory
manager at P i receives the total order broadcast from j from the network, then it will
write the value to the local replica. And if it is the same process then it will send the
acknowledgment to the application.
So, here we see that only the write will issue the total order broadcast and read basically
is not required by, because it is a weaker model than strict consistency model we have
seen all linearizability model. So, this algorithm issues the locally issued writes get
acknowledgment locally read are delayed until the locally proceeding writes have been
acknowledged or locally issued writes are pipeline.
So, this is an improvement using local writes. Now, the next consistency model for
distributed shared memory is called casual consistency. Causal consistency is also a
weaker model compared to the sequential consistency model.
In sequential consistency all write operation should be seen in a common order that we
have seen that after issue the write, then a total broadcast message is performed total
383
order after the write operations. So, all the write operation should be seen in a common
order in the sequential consistency. Now for causal consistency only causally related
write should be seen in a common order. So, causal relation for a shared memory system:
At a processor, local order of events is the causal order and write usually precedes read
issued by another process if the read returns the value written by the write. The transitive
closure of the above to order is causal order. Total order broadcast for the consistency
sequential can also provide the causal order in the shared memory.
So, here we can see that, in this example the execution is sequentially consistent, hence it
is causally consistent, why? Because casually consistency is a weaker than the sequential
consistency. Here in this example you see both P 3 and P 4 see the operations at the P 1
and P 2 in a sequential order, hence in the casual order. So, that P 3 and P 4 the
operations are basically the read operations of value x; here by P 3 so x value is written
two which is available over here, similarly this x is writing 7 and this particular order is
followed, why? Because, they are they are sequential order as well as causally related
order.
Similarly, x 4 and x 7 so x 4 here x 4 it is there and x 7 is there. So, they are causally
related as well as and so it is sequential consistency as well as; causal consistency model
which is being taken in this example.
384
So, this example shows that the execution is not sequentially consistent, but causally
consistent; that means, causally consistent is a weaker model this particular example will
show, and whereas, the sequential consistency is not followed. So, here we can see that
both P 3 and P 4 see the operations at P 1 and P 2 in a casual order, because the lack of
causality relation between writes by P 1 and P 2 allows the values written by the two
processors to be seen in different order of the system.
The execution is not sequentially consistent because there is no globally satisfying

contrary contradictory ordering requirement by reads and write. So, here you can see that
that the causal ordering is achieved, in the sense that if we see the read operations of P 3
and read operations of P 4. So, here the first read is able to read 7 and the second read is
able to read the value 2. So, they are different processors as you see. So, as far as casual
dependency is concerned they are satisfying as far as x 4 x is concerned it is getting 4
and then 7. So, causal consistency is allowed, but sequential consistency is not there,
why? Because, here you can see that a first 7 is read and then 2 is read and here 4 is
basically read and then 7 is read.
So, the ordering of ordering is cannot be organized as per the sequential consistency.
Hence, it is a casual consistency, but not a sequential consistency.
385
So, this example shows that it is not even a casual consistency. So, you will see a weaker
model than causal consistency which is called a PRAM model; where that PRAM
consistency will be there, but not causally consistent. So, casually consistent, why it is
not there? So, you can see that x is basically 2 and then here x is basically read as 7. Now
here x is read at 7 and x is then read at 7. So, this causal relation is violated here in this
particular order.
So, 4 is proceeding 7, because it is happening with this particular relation is violated

here. Hence, this is not causally consistent, but we will see another weaker model where
it is PRAM consistent.
386
So, PRAM full form is called pipelined ram model or a processor consistency model, it is
also called as a pc that is the processor consistency. That is the consistency at local level.
So, only the write operation issued by the same processor are seen by the others in the
order they were issued, that writes from different processors may be seen by the other
processors in different order.
So, here the ordering by ordering of writes by the same processor is there unlike in
casual ordering where the ordering of writes between different processors are also, if
they are casually related there that also is a glaze and force. So, here processor
consistency or a pipeline ram or a PRAM consistency model is a weaker form of a causal
consistency model. Now this PRAM can be implemented using a FIFO broadcast.
387
Another consistency model is slow memory. So, only write operation issued by the same
processor and to the same memory location must be seen by the others in that order.
Slow memory, but not the PRAM so; obviously, we see that there is another weaker
model that is called a slow memory model.
Now, after seeing so many number of consistency models starting from very strict
consistency model, that is called the linearizability model or a strict consistency model.
Then we have seen that a weakening of this model, we realized a another sequential
388
consistency model. And further weakening it we have obtained a causal consistency
model weakening; further casual consistency we have obtained PRAM model pipelined
ram model, weakening this pram model also we have received a slow memory model and
we are slow memory there is no consistency model is also there; that means, consistency
model is not assumed. So, this particular weakening will enforce a strict hierarchy of the
memory consistency models which is shown here in this picture.
Synchronization based consistency model. So here we are not going to see the
synchronization based consistency model. The first one is called weak consistency. So,
consistency conditions apply only to the special synchronization instructions and for
example, barrier synchronization. Non-sync statement will be executed in any order by
various processors. Example, weak consistency, release consistency, entry consistency.
So, weak consistency all the writes are propagated to the other processes, and all writes
done elsewhere are brought locally, at the sync instruction. Accesses to the sync variables
are sequentially consistent. Access to the sync variables is not permitted until all writes
elsewhere have completed. No data access is allowed until all previous synchronization
variables. Accesses have been performed.
Drawback: cannot tell whether the beginning access to the shared variable enter critical
section, or finished access to the shared variable that is exit critical section.
389
Two types of synchronization variables: acquire and release. Release consistency acquire
indicates CS is to be entered. Hence all writes from the other process should be locally
reflected at this instruction. Release indicates access to the critical section is being
completed. Acquire and release can be defined on a subset of the variables. Lazy release
consistency propagates the updates on-demand, and not in PRAM way.
So, entry consistency each ordinary shared variable is associated with a synchronization
variable lock or barrier.
390
Now, we are going to see a shared memory mutual exclusion algorithm which is given
by the Leslie Lamport and it is also called the Bakery algorithm. So, Lamport proposed
the classical bakery algorithm for n-process mutual exclusion in the shared memory
system. The algorithm is so called because it mimics the action that the customers follow
in a bakery store. A process wanting to enter critical section picks a token number that is
one greater than the elements in the array choosing from 1 to n.
So, processors processes enters the critical section in the increasing order of the token
numbers. In case of concurrent accesses to choosing by multiple processes, the processes
may have the same token number obtained. In this case, a unique lexicographic order is
defined on a tuple token and pid, and this will give a total order and this dictates the
order in which the processes are entering the critical section, the algorithm for process i
is given in the next slide. The algorithm can be shown to satisfy three requirements of the
critical section problem the first is mutual exclusion, bounded waiting and progress.
This is the bakery algorithm Lamports and process Bakery algorithm for shared memory
mutual exclusion. So, here we can see that this introduces the timestamp or this ordering.
So, mutual exclusion the role of line 1 e wait for others timestamp choice to stabilize and
the use of timestamp is to order them according to the priority.
391
So, highest priority will be allowed to go into a critical section and this will ensure the
mutual exclusion. Bounded waiting means P i can be overtaken by the other process at
most once the progress means lexicographic order is the total order process with the
lowest timestamps will enter the critical section that becomes the high priority.
So, a space complexity here the lower bound of n registers time complexities of the order
n time of Bakery algorithm Lamports fast Mutex algorithm takes of the order one in the
absence of the contention; however, it compromises unbounded waiting it uses write read
write write a write and a read. So, write(x) followed by read(y), then write(y) followed
by read(x), this sequence necessary and sufficient condition to check for the contention
and safe safely enter the critical section
392
There are few other algorithms in this particular problem. The another algorithm as I as I
mentioned that is called fast mutual exclusion algorithm, the two process mutual
exclusion, modified mutual exclusion algorithm for to process, concept for wait freedom.
Conclusion; Distributed shared memory is an abstraction whereby distributed programs

can communicate with the memory operations that is through read and write as opposed
to using message-passing intricacies. So, in this lecture we have discussed the concept of
distributed shared memory and we have also seen several consistency model like
393
linearizability, see the sequential consistency, casual consistency, pipeline ram, and slow
memory. We have also discussed the fundamental problem of shared memory mutual
exclusion with the help of the Lamport’s Bakery algorithm. In the upcoming lecture, we
will discuss about distributed minimum spanning tree.
Thank you
394
Distributed Systems
Prof. Rajiv Misra
Lecture – 14
Distributed Minimum Spanning Tree
In previous lecture we have discussed the basic concepts of distributed shared memory, it
is consistency models and algorithm for shared memory mutual exclusion.
Content of this lecture in this lecture we will discuss about distributed minimum
spanning tree and also discuss the well known GHS algorithm from the following
reference. Gallager, Humblet, Spira, a distributed algorithm for minimum weight
spanning trees in ACM transactions programming language in 1983.
395
Introduction, the distributed minimum spanning tree MST problem involves the
construction of a spanning tree by a distributed algorithm of a minimum weight, in a
network they are the nodes communicate by passing messages. One important
application of this problem is to find a tree that can be used for broadcasting. In
particular, the cost of a message to pass through an edge in a graph is significant and
MST can minimize the total cost of for a source process to communicate with all other
processes in the network.
396
Preliminaries, weighted graph this algorithm requires a weighted graph of n number of
vertices and m number of edges where the weights are given to the edges or non negative
weights and also the weights are the distinct.
So, the spanning tree is a tree induced in a graph which is connected acyclic graph
spanning all the vertices of G. So, we have shown you that the red colored edges forms
our tree out of the given graph which we have shown you in the previous slide.
397
Now, that spanning tree which is subject to the sum of its weight is minimized is called
minimum spanning tree. So, for minimum spanning tree the weight of a tree is nothing,
but sum of all the edges of that particular tree and it should be the minimum of all
possible spanning trees of the minimum weight is called minimum weight spanning tree.
A spanning tree formation or a construction algorithms in the classical system is well

known in the form of Prim’s algorithm and Kruskal’s algorithm. But if the system model
is the distributed system where the nodes are communicating through the messages and
the messages are transmitted with unpredictable delay in a finite predictable time.
So, in that particular model construction of a minimum spanning tree is going to be a

challenging task. So, here we are going to see the algorithm of constructing the minimum
spanning tree. So, minimum spanning tree means minimum weighted spanning tree in a
distributed system we are going to see. Now we are going to see some of the
terminologies which we are going to use in this particular lecture the first of them is a
spanning tree fragment. So, any connected sub tree of a minimum spanning tree is called
the spanning tree fragment.
So, here you can see here that the sub tree shown as the green edges is called a basically
fragment spanning tree fragment.
398
Now, the spanning tree fragment will have different edges which are out of that
particular tree and connecting to some other fragments are the nodes. So, an edge
adjacent to the fragment with the smallest weight that does not create a cycle is called
minimum weight outgoing edge. Take for example, that we have seen this as a fragment,
and out of this particular fragment we can see that these are the edges which are out of
this particular fragment. Among these edges we can see the edge which is given as the
red colored edge is of the minimum weight.
So, hence the definition says that minimum weight outgoing edge. So, the edge nine
becomes the minimum weight outgoing edge and which is popularly known as MWOE
in this particular algorithm.
399
Now, MST fragment a connected sub tree of the minimum spanning tree, so you can see
in this example there are different fragments are possible.
So, minimum spanning tree properties, the first property MST property 1 says that given
a fragment of an MST let e be a minimum weight outgoing edge of a fragment then
joining e and its adjacent non fragment node to a fragment yields another fragment of an
minimum spanning tree. Take this particular example that this is a fragment with one
minimum weight outgoing edge which is shown as e. So, here we can see out of this
400
animation that when then joining e and its adjacent, adjacent non fragment node to the
fragment yields another fragment of a MST which is shown over here.
Now, MST property 2 says that if all the edges of a connected graph have different
weights then the MST is unique. So, here in this animation we can see that in this
particular tree let us take an edge e’ and in this particular spanning tree on that particular
graph another spanning tree that is T with another e and in that edge is e. Now we can
see that let e be the minimal weight edge not in both MST without loss of generality e is
in T. Now, at least with e as a cycle. Now, at least cycle edge e’ is not in T since weight
of e is less than weight of e’ we conclude that T’ with e and without e’ is a smaller MST
then T’. So, basically this particular animation proof shows that if all the edges of the
connected graph have different weights then MST is a unique.
401
So, with these two properties the idea of MST based on these two properties we can see
that we have to start with the fragment of a singleton node as shown over here.
Now, the next step is to enlarge the fragment in any order by using property 1 and
combine fragments with a common node. So, that is property number 2. So, in this
process; that means, applying iteratively property 1 and 2 we will get the complete
coverage and will get the MST; that means, the fragment is continuously going using
property 1 and 2 till it covers as one fragment and that is a minimum weighted spanning
tree.
402
So, minimum spanning tree in a message passing model the message passing model is
one of the most commonly used model in the distributed computing in this model each
process modeled as a node in a graph the communication channel between two processes
is an edge of the graph.
Two commonly used algorithms for classical minimum spanning tree problem are Prim’s
algorithm and Kruskal’s algorithm; however, it is difficult to apply these two algorithms
in a distributed message passing model.
The main challenges are both Prim’s algorithm and Kruskal’s algorithm require
processing one node or a vertex at a time, making it difficult to make them run in
parallel. Both Prim’s algorithm and Kruskal’s algorithm require processes to know the
state of the whole graph which is very difficult to discover in the message passing model
due to these difficulties new techniques were needed for designing the distributed
algorithms for minimum spanning tree in this problem setting that is in the minimum in
the message passing model.
403
GHS algorithm, GHS algorithm of Gallager Humblet and Spira 1983 is one of the best
known algorithms in distributed computing theory. GHS is a distributed algorithm based
on Kruskal’s algorithm that constructs the minimum weight spanning tree in a connected
undirected graph with distinct edge weights. A processor exists at each node of a graph
knowing initially only the weights of the adjacent edges the processor will be the same
algorithm and exchange messages with the neighbors until tree is constructed. This
algorithm can construct the minimum spanning tree in asynchronous message passing
model.
404
GHS algorithm can run in synchronous as well as in asynchronous mode of
communication and computation, in synchronous GHS it works with non uniform model
with a distinct weight steps first initially each node is a fragment repeat in parallel
synchronous phase each fragment coordinated by the fragment root node finds its
minimum weight outgoing edge merge the fragment adjacent to minimum weight
outgoing edge until there is only one fragment.
In asynchronous GHS it simulates the synchronous version works with both uniform and
distinct weights uniform models distinct weights. So, steps in asynchronous GHS every
fragment F has the level which is greater than 0 or equal to 0 L(F) ≥ 0 at the beginning
each node is a fragment of level 0, two types of merges absorption and join.
Overview the input graph is considered to be the network and links are basically given
weights is as in the classical problem. Edges represents the communication links at the
beginning of the algorithm nodes know only the weights of the links which are
connected to them. As the output of the algorithm every nodes knows which of its links
belongs to the minimum spanning tree and which do not.
405
Preconditions the algorithm should run on a connected undirected graph the algorithm
should have distinct finite weights assigned to each edge, each node initially knows the
weight of weight for each edge incident to that node. Initially each node is in a quiescent
state and it is either spontaneously awakens or awakened by receipt of any message from
another node. Messages can be transmitted independently in both the directions on an
edge and arrive after an unpredictable, but finite delay without error. Each edge delivers
message in FIFO order.
406
The idea of distributed MST, GHS notations first fragment is every node start with a
single fragment.
Each fragment finds its minimum outgoing edge then it tries to combine with the
adjacent fragment every fragment has an associated level that. That has impact on
combining fragments a fragment with a single node is defined to be a level 0. The
combination of two fragments depend on the level of fragments.
407
So, if the fragment F wishes to connect to a fragment F’ this is F and this is F’ and the
level of fragment F is L < L’ then F is absorbed in F’ and the resulting fragment is at the
level L’ that is shown over here. So, the resulting fragment will be F’ and its level will be
the same as F’ level that is L’ = 2.
Now, if the fragments F and F’ have the same minimum outgoing edge and L= L’ that is
their levels are same. Then the fragments combine into a new fragment F’’ and the level
408
of this new fragment will be plus 1 of the earlier fragments which were seen as L. So, it
will become L+1.
Now if the fragments F and F’ with the same level were combined then combined edge is
called the core of the new segment.
So, here you can see that the core this is the core why because this was instrumental in
combining the two fragments F and F’, it becomes the core of the new segment.
409
Then in GHS notation next thing is about node states, state each node has a state there
are three states in which the node will be at a particular instant of time that is the
sleeping state is initial state find state means during the fragment search for a minimal
outgoing edge found means otherwise when a minimal outgoing edge was found. So,
sleeping find found. So, there are three different node states.
The algorithm description of the algorithm that GHS algorithm assigns a level to each
fragment which is a non decreasing integer with the initial value 0, each nonzero level
fragment has an ID which is the ID of the core edge in the fragment which is selected
when the fragments is constructed during the execution of the algorithm each node can
classify each of its incident edges into three categories that is the branch edges is the first
category are those that have already been determined to be a part of MST.
The second type of edges rejected edge are those edge that have been already determined
not to be a part of MST. Third is the basic edge are neither branches nor the rejected
edge.
410
So, the description of algorithm for level 0 fragments each awakened node will do the
following. First choose its minimum weight incident edge and marks that edge as the
branch edge, send a message via the branches to notify the node on the other side wait
for the message from the others end of the edge.
The edge chosen by both the nodes it connects becomes the core with level 1. For
nonzero level fragment and execution of the algorithm can be separated into three stages
in each level - first is broadcast the two nodes adjacent to the core broadcast messages to
411
the rest of the nodes in the fragment the messages are sent via the branch. Edge not via
the core each broadcast message contains the ID and the level of the fragment at the end
of this stage each node has received the new fragment ID and the level.
Converge cast at this stage all nodes in the fragment cooperate to find the minimum
weight outgoing edge of the fragment; outgoing edges are the edges connecting to other
fragment. The messages sent in this stage are in the opposite direction of the broadcast
stage initialized by all the leaves the nodes that have only one branch edge a message is
sent through the branch edge.
The message contains the minimum weight of the incident outgoing edge it found. For
each non leaf node let the number of its branch edges are branch edges be n after
receiving n-1 converge cast messages it will pick the minimum weight from the
messages and compare it to the weight of its incident outgoing edges the smallest weight
will be sent towards the branch it received the broadcast from.
412
Change core after the completion of the initial stage the two nodes connected by the core
can inform each other of the best edge they received then they can identify the minimum
outgoing edge from entire fragment. A message will be sent from the core to the
minimum outgoing edge via path of the branch edges. Finally, a message will be sent out
via choosing via chosen outgoing edge to request to combine the two fragments that the
edge connects. So, that was a brief overview of the algorithm.
413
Now, let us see the execution how these steps are basically incorporated in the execution
of the algorithm. So, in this particular execution example the fragment with a minimum
outgoing edge discovery is shown that is special case of 0 level fragments they are
sleeping when a node awakens from the sleeping state finds a minimum edge connected.
Marks it as a branch of the MST and sends a connect message over this edge, goes into a
found state.
So, we have seen that the node is making a transition from sleeping to find to or found
three states of the process and also send a message that connector over minimum weight
outgoing edge which is shown as the red color over here. Now this particular correct
message will try to combine the fragments.
So, take a fragment at level l that just combined out of the two level L - 1 fragment the
weight of the core is the identity of the fragment it acts as the root of the fragment tree.
414
So, the nodes adjacent to the core send an initiate message to the borders, relayed by the
intermediate nodes in the fragment puts the node in the find state.
So, the basic edges yet to be classified can be inside the fragment or outgoing edges.
415
Rejected will always be inside fragment and branches is an MST edge.
416
So, on receiving the initiate message a node tries to find the minimum outgoing edge
sends a test message on the basic edge a minimal first.
On receiving the test message in case of the same identity send a reject message the edge
is rejected same identity means in it is in the same fragment, same fragment the
connection will lead to a cycle.
417
In case the test was sent in both directions the edge is rejected automatically without a
reject message. In case of self lower level delay the response until the identity rises
sufficiently.
In case such an accept message the edges accepted as the candidate.
418
So, the node sent the report messages along the branches of MST, if no outgoing edge
was found the algorithm is complete after sending they go in a found state.
Every leaf sends the report when resolved its outgoing edge.
419
And its children send theirs.
Every node remembers the branch to the minimal outgoing edge of its sub branch
denoted the best edge.
420
The core adjacent nodes exchange reports and decide on minimal outgoing edge.
When decided a change-core messages sent over the branches to the minimal outgoing
edge the tree branch point to the new core. Finally, a connect message is sent over the
minimal edge.
421
So, when connecting the same level fragments both core adjacent nodes send a connect
message which causes the level to be increased, as a result core is changed and new
initiate message are sent.
When lower level fragment F’ at a node n’ joins the same fragment at a node n before n
sends its report we can send n prime and initiate message with fine listed, so it joins the
search. When the lower level fragment F’ at a node n’ joins some fragment f at a node
422
after n sent its report it means that n already found a lower edge and therefore, we can
send n prime an initiate message with the found message. So, it does not join the search.
Forwarding the initiate message at level L when forwarding an initiate message to the
leafs, it is also forwarded to any pending fragments at level L-1 as they might be delayed
with the response.
423
So, we can see that log n is an upper bound on the fragment levels and the connect
messages send on the minimal outgoing.
Edges only and there is no deadlock.
424
We have to assume then the communication complexity at each at every, but the 0 or the
last levels each node can accept up to 1 initiate accept messages it can transmit up to 1
test report change route connect message. Since the number of levels is bounded by log n
number of such messages is at most 5N(logN-1).
425
At level 0 each node receives at most one initiate and transmit at most one connect at the
last level a node can send at most one report message, as a result at most 3N messages.
So, as a result the upper bound is 5Nlog 2N+2E. So, time under the emission of initial
awakening it is 5Nlog2N+2E.
426
Conclusion, distributed manual spanning tree algorithms are useful in communication

network when one wishes to broadcast information from one node to all other nodes and
there is a cost associated with each channel of the network. In addition to the broadcast
application there are many potential control problems for the network whose
communication complexities are reduced by having (Refer Time: 27:13).
427
Distributed Systems
Prof. Rajiv Misra
Lecture - 15
Termination Detection
Lecture 15 Termination Detection preface recap of previous lecture.
Previous lecture we have discussed the basic concepts of distributed minimum spanning
trees and also discussed the algorithms that is GHS algorithm, content of this lecture. In
this lecture we will discuss about ‘Termination Detection’ and a set of representative
algorithms for termination detection. Termination detection based on concepts of
snapshot based on weight throwing and his spanning-trees etc. these different algorithms
which we are going to see in this lecture.
428
Introduction termination detection problem was brought to the prominence in 1980 by

the famous people, a fundamental problem to determine if the distributed computation
has terminated, meaning in the sense that the distributed computation consists of the
cooperation of many processors and the corresponding their communication channel.
So, the distributed application which runs on different processors required to know when
the entire computation is completed this is a fundamental problem as far as distributed
computation is concerned this task of finding the termination detection is a non-trivial
task.
Because there does not exist a global time and also there does not exist a global state and
the processes no process has the complete knowledge of the global state. In this
particular setting finding out the termination of a distributed application is a non-trivial
task. A distributed computation is globally terminated if every processes locally
terminated and there is no message in transit between any processes, given this definition
locally terminated is state is a is state in which a process has finished it is execution and
will not restart any action unless it receives a message. In termination detection problem
a particular process must infer when the underlying computation has terminated and it
has a different application it has wider application in the distributed system applications.
429
So, Introduction termination detection algorithm is used for this particular purpose that is
the messages used in the underlying computation are called basic messages and the
messages used for the purpose of termination detection are called control messages.
So, in termination detection algorithm we are going to use 2 different type of messages.
So, the messages which are normal computational messages called ‘basic messages’ and
the messages which will basically lead to the termination detection messages are called
the ‘control messages’.
Termination detection algorithm must ensure the following that is first is the execution of
termination detection algorithm cannot indefinitely delay the underlying computation.
Second is the termination detection algorithm must not require additional new
communication channels between the process so; that means, without disturbing the
underlying computation how the termination detection algorithm works and how the
termination is detected is going to be an important issue for designing such algorithms.
430
Now, the system model which is going to be assumed in our discussion for designing the
transmission detection algorithm, at any given time a process can be in one of the 2 states
that is the active, and the idle state. So, if it is in active state; that means, it is doing the
local computation and idle means where the process has temporarily finished the
execution of it is local computation and will be reactivated only on the receipt of the
message from another process. So, these states basically are not fixed they are changing.
So, an active process can become idle at any point of time and idle process can become
active only on the receipt of a message from any other process only active process can
send the messages. Now a message can be received by a process when the process is
either active or idle. So, on the receipt of a message the idle process can become a active
that is shown here.
So, the sending of a message and the receipt of a message occur as atomic actions.
431
Now we are going to define the termination detection formally let pi(t) denote the state
that is either active or idle of a process i at the time instant t. Let ci,j at time t denotes the
number of messages in transit in the channel at the instant t from pi to pj.
A distributed computation is said to be terminated at any time instant let us say t0 if and
only if for all processes at time t0 by the state is idle and for all the channels between i
and j at time instants t0 is having not having any messages; that means, it is an empty
channel or there is no message in the channel thus the distributed computation is
terminated if and only if all the processes have become idle and there is no message in
transit in any of these communication channel then the state of the global state is
basically defined as the termination detection or is a termination state.
432
So, Termination Detection Algorithm there are various termination detection algorithms
available in the literature. So, these algorithms are different based on what topology they
are going to assume and also how they are going to basically address the basic and the
control message flows and without the destruction of the basic computation how the
termination is detected.
So, Dijkstras algorithm is ring based termination detection algorithm that is given first in
1983 after that Topor has given the termination detection algorithm for the distributed
computation that we are going to study.
So, there are different kind of algorithms and Huang also has given a transmission
detection algorithm based on weight throwing that we are going to cover in this part of
the lecture.
433
Termination detection algorithm using distributed snapshot by Huang in 1989 that we are
going to understand first the algorithm assumes that there is a logical bidirectional
communication channel between any every process. Communication channels are
reliable, but non-FIFO. So, message delays are arbitrary, but they are finite main idea of
this algorithm is that when a process goes from active to the idle it issues a request to all
the other process to take the local snapshot and also request itself to take the local
snapshot.
So, when a process receives this request if it agrees that the requester became idle before
it itself then it grants the request by taking a local snapshot for the request a request is
successful if all the processes have taken a local snapshot for it the requester or the
external aj agent may collect all the local snapshot of the request.
If a request is successful global snapshot of the request can thus be obtained and
recorded global state will indicate the termination of the computation.
434
So, formal description of this algorithm goes like this every process i maintains a logical
clock and which is denoted by x which is initialized to 0 at the start of the competition
process increments it is x by 1 each time it becomes idle. So, the basic message sent by
the process at it is logical time x is represented by B(x). A control message that requests
the processes to take the local snapshot should by your process i at a logical time x is
represented by R(x,i), each process synchronizes it is logical clock s loosely with the
logical clocks x’s on the other processes in such a way that it is the maximum of the
clock values ever received or sent in the message.
A process also maintains a variable cases that when process is idle to (x, k) pair is
maximum of the values (x, k) on all the messages that is R(x, k) ever received or sent by
the logical time is compared as follows (x, k) > (x’, k’) iff (x > x’) ((x=x’) and (k > k’))
that is if there is a tie between x and x’ here then the process ids are used to break this
particular tie in this comparison.
435
Now, the algorithm has 4 different rules they are Guarded actions which basically will be
activated any one of them at a particular time depending on the situation when these
rules are qualifying.
So, the rule 1 says that when a process i is active it will send a basic message to process j
at any time by sending be x to j on receiving this particular message the x’ process i does
it basically updates it is clock here it will send the clock value. Now if process i is idle
then on receiving this message it will become an active here in rule 2.
Rule 3 says that when process i goes idle it does 2 things it will increase or it will
increment it is clock and then it will send the message R(x, k) to all other process and
also it will take the local snapshot for the R(x, k). Rule 4 says that when this particular
local snapshot taking message that is a control message is received at the process i it
does find out whether the message whether the timestamp of the incoming message is
having higher time than the process i and also process i is idle if it is then it will update it
is clock and also it will take the local snapshot for the request which is received by R(x’,
k’).
Now, if the clock value which is basically piggy bagged or which is coming inside the
message if it is having lesser value then and i is idle then it will not do anything why
because the existing process that is i is finished or is idle or is terminated after the receipt
of message so nothing has to be done.
436
Now, third point here in are 4 rule says that if i is active and on receiver of this particular
message then it will not or it is not terminated then basically it will only in or it will only
update it is clock why because it is still active not terminated . So, the last process to
terminate will have the largest clock value therefore, every process will take a snapshot
for it; however, it will not take the snapshot off for any other process.
The second algorithm for termination detection is known as the weight throwing
algorithm and is given by Huang in 1989.
So, we will first understand the system model which is used in this algorithm a process
called controlling agent monitors the computation. So, if let us say that this is the
computation model and this is the controlling process. So, it should have the connection
with all other processes and this is called a controlling agent which will monitor the
computation.
A communication channel exists between each process and the controlling agent and also
between every pair of process that we have seen in this figure initially all the processes
are in idle state the weight at each process is 0 and the weight at the controlling is 1
whereas, here the weights are all 0. The computation starts when controlling agent sends
a basic message to one of the process here the non-zero weight w from between 0 and
one is assigned to each process in the active state and to each message in the transit in
the following manner.
437
So, when a process sends a message it sends a part of it is weight in the message, when a
process receives a message it adds the weight received in the message to it is weight. So,
for example, if B(w1) is basically carried on the message so w will be updated as w2
where w1 + w2 was equal to w initially.
So, w was broken into 2 parts and w2 will be retained is it is current weight and w1 will
be sent. So, the process receiving this particular message it adds it is weight in the
message to it is own weight thus the sum of the weights on all the processes and on the
message in transit is always 1. So, if we can sum all the weights of all the processes for
all the process and weights of all the channels that is equal to 1. So, when a process
becomes passive it sends a weight to the controlling agent in the control message. So,
that is called CW and this particular weight will be sent.
Whatever to the controlling agent the controlling agent, which controlling is an x to it is

weight? So, controlling agent w will adds w plus this w 2, now if the controlling agent
concludes the termination if the weight is the controlling agent weight will become 1
then it will conclude that termination state of the algorithm.
438
Notations the weight on the controlling edge and the process is represented by W here
B(DW) is the basic message which is carrying a weight and C(DW) is the control
message which is carrying a weight from a process to the controlling agent.
So, this is the algorithm. So, the algorithm is defined in 4 different rules here Rule 1 says
that the controlling agent or an active process may send the basic message to one of the
processes, say P, by splitting it is weight W into W1 and W2 that I told. W1 and W2 they
are non-zero if then assigns it is weight W := W1 and sends the basic message with the
439
weight assigned as the other part that is W2 P on receipt of this particular message the
process P DW to it is weight if the receiving process is in idle state it will become an
active.
Rule 3 says that the process switches from active state to the idle state at any time by
sending the control message C(DW: = W) to the controlling agent and making it is way
to W; that means, it will return the weight to the controlling to the controlling agent and
put and basically put it is weight as 0.
So, on received of this such messages which is shown in Rule number 3 the controlling
agent will add DW which is coming from different processes to it is weight now if after
getting the weights come from all the processes if summation of the weight is equal to 1
then the controlling agent will conclude that the computation is has terminated.
Now, the Correctness of this Algorithm let capital A be the set of all active processes,
capital B with the set of all basic messages in concept, Capital C is the set of weights on
all the control messages in the transit and Wc is the weight on the controlling agent, there
are 2 invariants I1 and I2 which are defined for the algorithm like this.
So, variant I1 says that the weight of weight on the controlling agent plus, the sum of the
weights on a that is all the active processes and the weights on all the basic messages in
transit and some of all the control messages in transit is equal to 1 ; that means, entire
440
weight on the system is equal to 1. And second invariance is says that that for all W. So,
A∪B∪C and W > 0.
So, in variant I1 states that the sum of the weights at the controlling process at all the
active processes on all the basic messages and all the control message in transit is always
equal to 1 in variant to state says the weight at, each active process on each basic
message in transit and on each control messages in transit is non-zero.
Hence the weight on controlling agent is equal to 1 this will imply by the invariant 1 that
the other part that is Wc + W ∈(A∪B∪C ) W = 0 why because this is equal to 1.
Now this will imply that (A∪B∪C ) = φ by I 2 because all are non-zero. So, hence now
since the communication channel is empty. So, that is why A∪B = φ this implies that the
computation has terminated therefore, the algorithm never detects a false termination
further A∪B = φ that is Wc and Wc =1 by I1, since the message delay is finite that is
after the consideration estimated eventually Wc will become equal to 1.
Thus the algorithm permitted in a finite amount of time meaning to say that the control
messages which carries the weights form the terminated process eventually reach to the
controlling agent and the sum of all the weights will become 1 within a finite time.
441
Third Algorithm the Spanning-Tree-Based Termination Detection Algorithm, now we

consider in this algorithm that there are N processes Pi is which are modeled as the nodes
of a fixed connected undirected graph. The edges of the graph represents the
communication channel the algorithm use fixed the spanning tree of the graph with the
process P0 at the root which is responsible for the termination detection process P0
communicates with the other process to determine their states through signals.
All the leaf node reports to their parents if they have terminated and this is called inflow,
the parent node will similarly report to the parent when it has completed processing and
all of it is immediate children are terminated and so on. The root concludes that
termination has occurred if it has terminated and all of it is immediate children have been
terminated.
442
Now, 2 waves of the signal generated one on moving inwards moving inwards in the
sense this is a tree structure the leafs this is called leafs and when they terminated they
will send the message and this is this direction is called moving inwards and moving
outwards; that means, if the root sends the message down to the leaf then it is called
outward initially a contracting wave of signals called token moves inwards from leaf to
the root here.
Now, if this token wave reaches the root without discovering that the termination has
occurred the root initiates the second the root initiates the second outward wave of
request signals, at this repeat wave reaches the leaf and the token wave gradually forms
and starts moving inwards again this sequence of event is repeated until the termination
is detected.
443
Simple Algorithm which we will first understand and then we will see the problems and
we will using these problems we will see the final algorithm.
So, the simple strategy is that initially the leaf nodes are given the token. So, they are
given token let us say t 1 and t 2. Each leaf process, after termination sends it is token to
it is parent, where the parent process terminates and after it has received the token from
each of the children; it sends their token to it is parent. This way each process indicates
to exponent that the sub tree below is idle. In similar manner the tokens get propagated to
the root. So, when it gets propagated to the root the root will contain that particular token
and this concludes that the entire computation has terminated after it has become idle and
has received the token from each of it is children.
444
Now, the problem exists in this particular algorithm the simple algorithm fails under the
some situation when the process after it has sent a token to it is parent which indicates
that the that that particular process is idle, but again receives a message from some other
process.
So, once it has received a message from it is process then the idle state again will go to a
tournament state, but it is parent is knowing that that particular process is idle which
could cause the process to again become idle and that is shown in figure 15.1.
445
In 15.1 you can see this particular situation that after 1 has given it is token 2 the 2 it is
parent the goal number 5 has sent a message back to 1. So, 1 will again become from idle
it will become an active, but this parent knows that 1 is idle which is going to make the
contradiction.
So, this kind of problems is now corrected and the final algorithm is given by the Topor
in 1984. So, the main idea is to color the processes and the tokens and change the color
when the messages such as we have shown in figure 15.1 are involved the algorithm
works as follows initially each leaf is provided with the token the set S is used for the
book keeping to know which processes have the token hence S will be the set of all leads
in the tree. Initially, all processes and the token are colored white when a leaf terminates
it sends the token it holds to the parent process
A parent process will collect the token sent by each of it is children after it has received
token from all of it is children and after it has terminated the parent process since a token
to it is parent. A process turns black when it sends the message to some other process
when a process terminates it if it is color is black it sends black token to it is parent black
process turns white after it has sent a black to it is parent.
446
So, let us understand the entire algorithm through this particular example. Now as in the
algorithm you have seen that all the leaf nodes are given the tokens initially Token to the
leaps now when these leafs finishes their computation they will send the token to the
parent.
So, here you can see that this particular node and this particular leaf they have finished
their computation. So, they have sent their token to it is parent. So, this will become an
idle.
447
And the set S initially which was having 3 4 5 6 because they were having token now the
active set of processes will be the process 1 then 5 because it is having token and then 6
and Node 3 and 4 they have become idle as far as the working of the console algorithm.
Now this one has received the token from both of his children. So, it will send this
particular token to it is parent, t 1 will send it is token t 1 to the parent. So, now, the
parent will know that the underlying sub tree is terminated, that is it is idol and the set of
active states s will be now the Node 0, Node 5 which is having token and node 6 now the
algorithm will monitor them.
448
Meanwhile Node 5 has generated a message for the Node 1, now Node 1 from idle state
it will become an active state after receiving this particular message although this token
represents that 1 is active.
So, the node which sends this particular message that node will become black in this case
and once this particular node which is black finishes it is execution and it is idle then it
will send the black token to it is parent and become white then in that case.
449
So, here the Node 5 becomes white, but it has send a token that is T 5 is a black token
similarly 6 has send a s token to this parent. Now the active states are 0 and 2 here in this
case.
Now, in this black token finally, reaches to the Node 0 and Node 2 become idle now S
the activates states is only S. Now Node S will Node 0 will understand by receiving the
black token that that there is some message communication. Hence the distributor
computation is not terminated. So, by taking this black token Node 0 or a root will again
send the outward token and this outward token will propagate the token down to the leafs
and restart the entire process again.
So, in the second round if let us say that if it is if the computation is terminated without
any message exchange the root may get all white tokens back and this will be the
terminated state. If in the second round also if it receives a black token then again this
particular process iterates till it receives a white token and that is the indication that the
mess that the algorithm is terminated.
450
Now, the performance the best case message complexity of this algorithm is of the O(N),
where N is the number of processes in the computation. Which occurs when all the nodes
sends all the computation messages in the first round worst case complexity of this
algorithm is of the O(N*M). M is the number of computation messages exchanged. So,
this if only one message is exchanged then basically it will 2 times 3 times and so on. So,
number of times this algorithm iterates outwards inwards it depends upon the number of
messages.
451
Conclusion determining if the distributed computation has terminated is a fundamental
problem in distributed systems. The detection of termination of a distributed computation
is a nontrivial task since no process has the complete knowledge of the global state and
also the distributed system does not have common clock. So, a number of algorithms we
have seen have been developed to detect the termination of a distributed computation
why because it is going to be used in many applications these algorithms are based on
the concepts which we have seen here the representative algorithms based on these
concepts of snapshot collection weight throwing and a spanning tree.
So, in this lecture we have described a set of representative termination detection

algorithm in the upcoming lecture we will discuss about message ordering and group
communication.
452
Distributed Systems
Prof. Rajiv Misra
Lecture – 16
Message Ordering and Group Communication
Lecture 16 Message Ordering and Group Communication.
Preface Recap of Previous Lecture. In previous lecture we have discussed about

‘Termination Detection’ and set of representative termination detection algorithm based
on the concept of snapshot collection, weight throwing and spanning-tree. Content of this
Lecture in this lecture we will discuss about Message ordering, Group Communication
and Multicast that is multi casting.
453
Introduction: at the core of the distributed computing is the communication by message

passing among the processes participating in the application. So, in this lecture we will
study several message ordering paradigms for communication such as asynchronous,
synchronous, FIFO, causally ordered, and non-FIFO ordered these orders form the
hierarchy will examine few algorithms to implement these orderings.
Group communication is an important aspect of communication in a distributed system

causally order and total order are popular forms of ordering when doing group multicast
and broadcasts, algorithm to implement these orderings will also be discussed before we
go ahead let me give you few examples where this message ordering and multi casting
will be useful. So, as you know that in a distributed databases the database is normally
replicated to add more than one sites.
Now, whenever there is an update at the database then at all places that is at all the
replicas have to be updated instantaneously. So, if 2 messages are sent for update at
different point of time. So, this particular order in which 2 messages are sent these
updates are to be made in that same order irrespective of how much delay the messages
are ensuring, how this is all done this is all done here in ordering of messages and also
another form of message ordering or communication in a group is basically called multi
casting.
454
So, multi casting is an example of 2 cases one is called closed group and the other is
called open group. Open group is like online railway reservation system where any
customer at any point of time can come and enter into a system and perform the
operations. So, that client an external person is using a set of nodes for it is application
then it is called an open system open system communication is basically done through
the multicasting.
So, if large number of users are allowed in reservation systems then it is not an easy task
it is a difficult task to basically ensure the correctness and running the application. So,
this will be talked about during the group communication and multi casting how this
paradigm is going to be implemented in the system of message passing system.
That is called distributed system there are few notations and some of these notations are
explained so we skip this.
455
Now, message ordering paradigms message ordering paradigms the order of messages in
a distributed system is an important aspect of system executions because it determines
the messaging behavior that can be expected by the distributed program. So, distributed
program logic greatly depends on this order of delivery to simplify the task of a
programming languages in conjunction with the middleware provides well defined
message behavior.
So, again recreate on the same point the message the order of delivery of the message is
most important as far as distributed applications are concerned and it reflects and it
(Refer Time: 04:41) and it that particular ordering comes from the applications and the
implementations will be basically governed here using the message ordering paradigms.
So, the programmer can send the code for the programming logic with respect to this
particular behavior there are several ordering on the messages which are defined as non-
FIFO; FIFO causal order and synchronous order.
456
Let us see a few definitions on these different kind of message orders. So, asynchronous
execution you know that is an execution for which causality relation is a partial order
that we are seen another thing is called FIFO in a first in first out executions is an
asynchronous execution in which for all pairs sender and receiver and another pair of
sender receiver. So, if these sender and these receivers they are happening within a
system and there is an order between the sender then there will be an order in the
receiver also because these senders and these receivers are happening within a particular
process and so within or process they are happening.
So, if this particular send is proceeding over the other send that is s’. So, their receive
also requires this kind of precedence relation and that is called FIFO that is being
preserved. So, here the communication channel has to ensure that the delivery is to be
following the FIFO manner. Similarly the causal order is an execution in which for all
senders and receivers where the sender s and s’ they are basically sending to the same
destination that is that is r.
And if there is a precedence between the send operation; that means, if s precedes s’,
then this particular receive process are also proceeds r’ and r and r ‘ is happening at the
same process that is what this particular symbol indicates. So, the diagram reflects
whatever is the definition of a causal order. So, if there is a precedence between 2 sents
which are happening at different sites then when they will be received and if they have
457
precedence relations. So, that relation also will be insured at the time of delivery and that
ordering of the message is called causal ordering of messages.
Now, another kind of ordering is called synchronous; synchronous that sync order and
execution for which the causality relation is a partial order is called the synchronous
order. So, synchronous order is like for every send there will be instantaneous receive.
So, this will be handled as far as synchronous ordering is concerned over an
asynchronous communication channel.
So; that means, the sender and the receiver they basically work in a synchronization and
whatever sender is sending receiver will be receiving within that synchronization they
are called synchronous ordering. So, it is also called the instantaneous communication
and this is the modified definition of causality.
458
These kind of ordering that is we have defined like non-FIFO order FIFO order causal
and synchronous order they basically follow in hierarchy in the execution class which is
shown over here in the figure at A. So, in this particular figure you can see this kind of
relation holds where the causal order is a proper set of FIFO and synchronous is a proper
set of the causal order and a means non-FIFO a means a synchronous. So, asynchronous
does not follow any order and that is why it is non-FIFO.
So, here we can see from this particular hierarchy that that sync or a synchronous order is
basically a proper set of causal order and causal order is a proper set of FIFO and FIFO
is a proper set of non-FIFO or an asynchronous the examples here we can see that if it is
a sync; that means, sender and receiver they works in synchronization they can be
represented by this particular arrows.
Causal order in a causal order we can see that if these 2 sends they are having the
precedence relation or this send is happened before the other send then as far as when
they when they receive. So, this particular order is also ensured at the delivery time
hence s 1 sorry hence s precedes s’ at the delivery also why because here they are
following the receive order.
In FIFO we can see that the order in which these messages are sent the same order is
preserved at the time of delivery of these messages asynchronous does not follow any
order. So, the order at the time of sending these messages is not preserved at the time of
459
delivery of the messages. So, this will include a hierarchy of these kinds now underlying
the asynchronous mode where the message passing follows the delivery of the message
having unpredictable delays these particular ordering, how they are going to be insured
we are going to see in the further slides how they are going to be implemented and used
up.
Now, the group communication processes across a distributed system cooperate to solve
the joint tasks often they need to communicate with each other as a group and therefore,
there needs to be support for the group communication. So, example of such applications
are for example, if there is a railway reservation system which comprises of several
nodes. So, the person outside can directly communicate to these set of system which
represents the railway reservation system which includes a bank and the railways and so
on and this particular system requires a group communication.
Why because this particular client or a user has to communicate only to a set of
processor and this is called a group of processor and this kind of communication is called
a group communication. Group Communications are to be supported by the applications.
A message broadcast is sending up a message to all the members in a distributed system,
the notion of a system can be confined only to those sites processes participating in the
joint application refining the notion of the broadcasting there is a multi casting wherein
460
the message is sent to a certain subset identified as a group of the processes in the
system.
So, at the other extreme is the unicasting which is familiar point to point message
communication let me tell you that that broadcasting means sending a message to all is
called broadcasting if sending message not to all, but to a subset or to a group then it is
called basically a multi casting now if sending a message to another node then it is point
to point communication and it is called a unicasting.
Now, network layer or the hardware assist the multicast cannot easily provide application
specific semantics on message delivery order adopt groups to dynamic membership
multicast to arbitrary process set at each send provide multiple fault tolerance semantics.
So, these are basically the few things or a few properties which has to be implemented
why because in the network layer this hardware assists multicast cannot be cannot be
provided.
So, I told you that the closed group that is the source is a part of the group. So, the source
is also a part of the group to which they have to communicate then it is called a closed
group and an open group means if the source is outside the group then it is called the
open group. So, these this particular example which will tell you about the application is
specific semantics on the message delivery order.
461
For example there are 2 process P 1 and P 2 they are causally related through this
particular message M which is shown over here and there are 3 sides R 1 R 2 R 3 which
basically will maintain the replicas. So, if message M 1 is multicast this is a multicast to
all R 1 R 2 and R 3 after that P 2 multicast another message M 2 for updates on these
replicas then let us see the scenario what happens.
Now, you know that P 1 is sent before P 2 and this particular message m ensures the
precedence relation between P 1 and P 2. So, as far as R 1 is concerned R 1 will receive P
2 first and then P 1. So, it is not following this particular precedence relation it is
violating R 2 is concerned R 2 is receiving P 1 first and then P 2 R 3 is also receiving.
So, this particular update as far as the b is concerned is not basically causal order and
also total order is also violated, total order in the sense which I will explain later on total
order meaning to say that each site will see same ordering. So, here you see that here P 2
followed by P 1 is the order in which the requests are coming for the updates as far as R
2 is concerned R 2 will have another view P 1 followed by P 2.
So, basically it is not having the total order as well. So, causal order is also violated and
total order is also violated another scenario which says that in all other sites P 2 P 2 and
then P 1. So, P 2 is occurring first then P 1 violating causal order is violated here in this
particular scenario we will discuss all these things in great details in this part of the
lecture.
462
Raynal-Schiper-Toueg algorithm in 1991 let us discuss this algorithm for the group
communication intuitively it seems logical that message M should carry a log of all other
messages or their identifiers, send causal before M’s send event, and sent to the same
destination M. This log can then be examined to ensure whether it is safe to deliver the
message all the algorithms aim to reduce this log overhead and space and time overhead
of maintaining log information at the processes.
So, using this algorithm the ordering which is required in causal order or the total order
can be ensured before the delivery like the messages are arrived which are not following
causal order then they have to be buffered and applied this particular Raynal-Schiper
Toueg algorithm before delivery they have to be buffered and ordered the way they want
to be delivered and that is why each process carries the log each message carries the log
of other messages.
So, the message becomes heavy why because it contains all this information of ordering
and when this message reaches to a site it will be buffered and based on the information
which is contained in the log the destination will try to order.
The messages in the buffer itself and then accordingly it will be delivered as per the
order. So, let us see this particular algorithm which ensures the delivery this algorithm
assumes the FIFO channels. And there are 2 properties safety and liveness safety
property is ensured here in 2 way that is deliver message m which is received at the Pi
463
where this particular condition ensures that the all previously sent messages are to be
delivered before this message will deliver otherwise it will be stored in the buffer itself.
Now this particular algorithm you can see that it requires the message space in the form
of a log and also the local space because these data structures array of sent array of
delivered they are to be maintained.
So, basically different techniques we will see that how they are going to reduce it. So,
that the overhead gets reduced and the algorithm becomes efficient and affordable to be
implemented.
Now, we are going to see the total order total order requires that all the messages
received in the same order by the recipients of the messages is formally defined as for
each pair of processors Pi and Pj and for each pair of messages that are delivered to both
the processes Pi and Pi both the processors Pi is delivered Px before Mx before my if and
only if Pj is delivered Mx before My.
464
So, this is called total order. So, this particular figure I have already explained that this
does not satisfy the total order, but this is satisfying the total order why because every
process will have the same view and if every process is having same view of the message
delivery then it is called the total order. How the total order total ordering of the
messages is implemented. So, we are going to see the first algorithm which is called a
centralized algorithm or total ordering of messages.
So, if Pi want to multicast a message M to a group G then it will send the Pi will send to
the group coordinator and the group coordinator in turn will send it to the group of
messages. So, this is called coordinator and hence it is called a centralized algorithm
meaning to say that considering these particular channels are FIFO. So, the order in
which all the messages are being received at the coordinator and when the coordinator
will send to the group members the messages will be ordered because they are assumed
as a FIFO.
So, assuming all the process broadcast messages the centralized solution source in this
algorithm and forces the order in the system with a FIFO channels I explained you each
process sends a message it wants to broadcast to a centralized process which simply
release all the message it receives to every other process over FIFO channel in the group.
465
So, it is a straightforward to see that their total order is satisfied why because the order in
which the messages are given by the coordinator the same order it will be delivered
hence this particular since it is a total order it also satisfies the casual order why because
we have seen in the hierarchy class of this model complexity of this algorithm is that
each transmission takes 2 message hops and exactly n messages in the system of n
processes the Drawback is the centralized system has a single point of failure and also
there will be condition towards that.
466
So, another algorithm is called three-phase algorithm three-phase algorithm also ensures
the total order. So, if it is ensuring total order; that means, causal order is also ensured as
far as the hierarchy of these classes are concerned. So, a distributed algorithm that
enforces total order and causal order for closed groups is given by three-phase algorithm.
So, the three-phases of the algorithms are first described from viewpoint of the sender
and then from the viewpoint of receiver.
So, this algorithm has 2 parts one is the sender part of the algorithm the other is the
receiver part of the algorithm. So, first we will see the all three phases of the sender and
in the receiver also we will see those corresponding three phases to the sender. So, sender
phase 1 in the first phase a process multicasts which is given in line number 1 of the
algorithm that we will see multicast message them with a locally unique tag and a local
timestamp to the group members.
Phase 2 in the second phase the sender process awaits a reply from all the group member
who respond with their tentative proposal for a revised timestamp for that message m the
await call in line 1 d is non-blocking that is any other message messages received in a
meanwhile are processed once all expected replies are received, the process computes the
maximum of the proposed timestamp for M, and uses maximum as the final timestamp.
Third phase is that in the third phase the process multicasts the final timestamp to the
groups in the line 1 f. So, the sender will send 3 times in the first phase 1 it will send the
message 1 with the timestamp, second time it will send the sender waits awaits the
replies from all then it will send a tentative proposal for the revised timestamp and third
time it will send the messages with the final timestamp and on the other side receiver
will receive them and process them and replies them in all 3 phases that we are going to
see here.
467
So, in phase 1 the receiver receives the message with a tentative proposal time stamp
which is sent by the sender it updates the variable priority that tracks the highest
proposed time stamp and then realizes the proposed time stamp to the priority and places
the message on it is tag and revises the time stamp at the tail of the Q in the Q the entry
is marked as undeliverable.
So, meaning to say that after receiving the messages from the sender it basically finalized
a tentative proposal time stamp and accordingly the messages are placed in the in the
queues and they are marked as undeliverable they will not be delivered why because they
are stored in the Q and according timestamp they are being they will be ordered and then
delivered.
468
So, to do that it will go for a second phase the receiver sends the revised timestamp back
to the sender the receiver then waits in a non-blocking manner for the final in phase 3 the
final timestamp is received from the from the multicaster or from the sender the
corresponding message entry in a Q is identified using the tag and marked as deliverable
after the revised timestamp is over written by the final timestamp the Q is then restored
using timestamp field of the entries.
So, as the Q is already sorted except for the modified entry for the message under
construction that entry has to be placed into a sorted position if the message entry is at
the head of the Q that entry and all the corresponding subsequent entries are also marked
as deliverable are dequeued from the Q and delivered to that particular process.
469
So, this particular code I have already explained and let us go through the illustrative
example for the working or the execution of that 3 phase algorithm.
Now, here you can see that side A and side B they want to send a message to those 2
replicas C and D. So, a will send with a tag 7 the message to D and B will send with a
tag 9 to C. Now D when it receives this particular message with a with a tag 7 it will it
will give it is own timestamp and position it in the Q that is called temporary Q it will
470
not be delivered and similarly first message D will receive the first message from B, it
will be queued with the timestamp and the next message 7 minute delivered.
So, time this time will be incremented and will be given to this particular message at
time ten. So, on the head of the Q will be the message which comes from B, it will not be
delivered it will be just queued this particular timing which is basically the revised time
or a proposed time will be sent back from a to B in the second phase. Similarly the 9 tag
or a timestamp 9 is sent from D to B. Similarly here 7 will be received first, it will be n
queued on the top of the Q and 7 will be returned back to A and once 9 will be received.
So, 9 will be in the order, 9 stamp will be returned back.
Now, it will process them, after receiving these values as 7 and 10 it will take the
maximum over here and this will be the final value and a will send the final value 10 to
both of them similarly here after receiving 7 and 9 2 values the maximum is 9 and 9 will
be sent back to both of them. So, if you see here in this scenario mine is at the head of
the queue. So, from temporary it will be put on the delivery of the queue and this
message will be delivered to the process.
Here also the same ordering will be done since 9 is at the head of the queue it will be put
on the delivery and 9 will be delivered. So, first 9 will be delivered at both the ends and
then 10 will be delivered. So, hence it follows the total order according to the algorithm.
471
So, all these steps of the algorithms are basically I have explained you through the
example the Complexity of this algorithm since it uses 3 phases and every phases a
message is being exchanged.
So, in every phase n minus one messages are exchanged. So, out of 3 phases 3(n-1)
message will be exchanged delay will be 3 message hops because 3 times the message
will go and come back. So, it will be 3 message hop delay will be there it also
implements casual order that we have already seen. So, 3 phase algorithm is closely
structured along the lines of Lamport’s clock for the mutual exclusion that algorithm we
have studied.
So, Lamport’s mutual exclusion algorithm has the property that when a process is at the
head of it is own queue and as it received a reply from all the process requests of that
process the head of all the queues this can be exploited to deliver the message by all the
process in the same order instead of entering into a critical section. So, exactly on the
same lines this particular algorithm was designed and is structured.
472
So, we have seen the algorithm for total order and that will be achieved through the
through the algorithm why because underlying network is an asynchronous network
which we are now assuming. So, the ordering which is required from the application is to
be ensured through that algorithm which we have just covered.
Now the next topic is about the multi casting. So, as I told you that unicasting multi
casting and broadcasting there are 3 different ways of communication in a for the
applications in a distributed system which is being utilized.
So, now we are talking about the multi casting that is the communication to a to a group
of processes. So, 4 classes of source destination relations for the open groups are there.
So, that previous algorithm was for closed group here this example is basically or this
particular nomenclature is for the open groups. So, the first nomenclature for the open
group is SSSG that is single source and single destination group where there is a single
source and the single group is for the destination that is called SSSG.
MSSG is basically multiple sources here you see there are multiple sources and a single
destination called multiple sources single destination M SSG then comes SSSMG means
single source and multiple groups. So, here these multiple groups are overlapping also
and then MSMG that is multiple sources here you see 2 sources are there and multiple
groups which are overlapping also. So, there are 4 different ways in which these open
groups can be organized or can be classified.
473
Now we can see that before we go ahead let us see that it is quite easy to implement
SSSG why because this can be done using the centralized approach. So, this will be a
coordinator for example, this will send to a coordinator and coordinator will basically
sequence the messages in the order. So, this is quite trivial this also can be handled using
the centralized approach and MSSG can also be handled through that approach why
because these 2 centers are there and they can be converted into a single source single
group communication why because these 2 sources can send to the coordinator and
coordinator intern will send to one group that also can be handled, multiple sources and
multiple groups is very difficult it cannot be straightaway implemented in triple SG way.
So, now we are going to see the implementation for multiple source and multiple group.
One way to implement multiple source and multiple groups is through the propagation
trees. So, propagation trees for multi casting are going to basically solve the purpose.
474
So, let us see how the propagation trees are constructed if this is the set of nodes. So, we
are forming the Meta-group out of these groups and these Meta-groups can be organized
in the form of a tree and this is called a propagation tree.
So, in this example illustrating the proposition tree Meta-groups are shown in the bold
phases these are all Meta-groups. So, you can see that (ABC) is a primary Meta-group.
So, primary Meta-groups will basically have group (AB), (AC), (A) then (B) and then
(C) and then (BC) also and (BCD) also. So, this particular ABC will become the primary
Meta-group and becomes the root of all those Meta-groups.
Now, in turn (BCD) again will become the primary Meta-group for these Meta-groups
means (BCD) is here. So, (BCD) will have (BD) then (CD) and (BC) is also absorbed
already absorbed in (ABC) it will not be reflected over here and (D) and (DE). Now as
far as the (D) will become the primary Meta-group for € so the Meta-group for the
primary Meta-group (DE) will be (E) then (CE) and (EF) now (EF) will become the
primary Meta group for (F).
So, just see that these complete groups of nodes they are organized in a form of the
propagation tree implementation of a propagation tree will now becoming can be easily
handled. So, basically in this once the propagation tree is there and you want to send the
messages to a particular or in a multicast way. So, first messages are sent to the primary
Meta-group node and that in turn will communicate to it is Meta-groups and if it is has to
475
reach to a Meta group (F). So, it has to traverse through all the primary Meta-groups till
that meta-group which contains that primary meta-group which contains that meta-group
reaches to that point there it will be delivered in that particular tree fashion.
Now, there are various classification of application level multicast algorithms. So, first
there are 4 different class of multi class algorithm first one is called privilege based
algorithm in privilege based algorithm the token rotates. So, the node which is having the
token will be allowed to send to the destination. So, the process delivers message in the
order of the sequence numbers typically the closed groups and casual order and total
order will be there are 2 algorithms which are implemented based on this particular
classification to totem and on demand multicast algorithms are available.
476
The next one is called moving sequencer. So, here there will be a sequencer nodes are
sending the message to the sequencers. So, sequencers token has the sequence numbers
and list of the messages for which the sequence number has been assigned on receiving
the token the sequencer assigns the sequence number to the received, but unsequenced
messages and sends the newly sequenced messages to the destination.
Another next one is called a fixed sequencer fixed sequencer is like centralized approach
that we have already seen centralized approach means there is a one single coordinator
and fifth one is called destination agreement. So, in destination agreement the ordering
will be done among the destination. So, destination received the limited ordering
information and using time based Lamports 3 phase algorithm the agreement will be
evolved.
477
Few other algorithms for message ordering and group communication are available in
the literatures that are listed over here Conclusion.
Inter process communication via message passing is at the core of any distributed system
in this lecture we have discussed non-FIFO, FIFO causal order, asynchronous order,
synchronous communication paradigms for message ordering, then examine several
algorithms to implement these orderings group communication is the important aspect of
communication in a distributed system.
478
Causal order and total order are the popular forms of ordering when doing the group
multi casting and broadcasting then we have explained the propagation trees for
multicast algorithm and classification of application level multicast algorithms in the
upcoming lectures we will discuss about self stabilization.
Thank you.
479
Distributed Systems
Prof. Rajiv Misra
Lecture - 17
Self-stabilization
Lecture 17: Self-stabilization. Preface: recap of previous lecture.
In previous lecture we have discussed about message ordering, group communication

and application level multicast algorithms. Content of this lecture: in this lecture we will
discuss the concept of self-stabilization, related issues in the design or self-stabilizing
distributed algorithms and systems, and Dijkstra’s self-stabilizing token ring system.
480
Introduction: concept of self-stabilization: the idea of self-stabilization in distributed

computing was first proposed by Dijkstra in 1974. The concept of self-stabilization is
that, regardless of its initial state the system is guaranteed to converge to a legitimate
state in a bounded amount of time by itself without any outside intervention. So, the non-
self-stabilizing system, may never reach the legitimate state, or it may reach a legitimate
state only temporarily. The main complication designing self-stabilizing distributed
system is that, nodes do not have the global memory that they can access
instantaneously. Each node must make decision based on the local knowledge available
to it, and actions of all the node must achieve the global objective.
481
The definition of legitimate and illegitimate state depends upon the particular
application. Generally all illegitimate states are defined to be those states which are not
legitimate. Dijkstra also give an example of the concept self-stabilization, using self-
stabilizing token ring system; that is called Dijkstra’s self-stabilizing token ring system.
So, for any token ring, when there are multiple tokens or there is no token, then such a
global state are known as illegitimate state. When we consider distributed systems, where
a large number of systems are widely distributed and communicate with each other using
message passing or shared memory approach, there is a possibility for these systems to
go into an illegitimate state. For example, if a message is lost, the concept of self-
stabilization can help us recover from such situation in a distributed system; so again
before going ahead. So, Dijkstra gave an example of a token ring system. In a token ring
system you know that, only one token is called privilege which circulates, and if it is
circulates then it is legitimated state in the system.
Now, in contrast to this, if this particular token has, this particular token ring has two
privileges, or two tokens, or more than one token, or it does not have a token at all, or a
token is lost. This particular situation is illegitimate state as far as this definition is
concerned. So, this particular state which is illegitimate state, if it goes to a legitimate
state automatically without external intervention, then it is called a self-stabilizing
system. If it is a token ring system, then it is called self-stabilizing token ring system.
482
In distributed systems you are seeing that lots of processors are connected through the
communication network, and they exchange through the message communication. So,
there is a possibility that nodes may fail down or the messages may lost; obviously, these
conditions will basically lead to a illegitimate state in the distributed system. How
automatically, how the self-stabilizing distributed system will recover? Automatically to
a legitimate state is basically a design issue which Dijkstra has opened through this
particular system. That is called a self-stabilizing system.
So, the concept of self-stabilizing is very useful to understand about the self-stabilizing
system, and the design for different distributed systems, based on this particular concept.
So, let us explain the concept of self-stabilization using an example. Let us take a group
of children and ask them to stand in a circle. So, after a few minutes you will see almost
a perfect circle without having to take any further action. In addition you will also
discover that the shape of this circle is stable, at least until you asked the children to
disperse. If you force one of the children’s out of this particular position, the others will
move accordingly and they will form a bigger circle.
483
So, keeping the shape of the circle unchanged in all the cases. So, in this example the
group of children’s, children build a self-stabilizing circle. So, if something goes wrong
with the circle, they are able to rebuild the circle by themselves without any external
intervention. So, we have seen a example of a self-stabilizing circle, and by a group of
children, and the legitimate state and illegitimate state, all these things we have seen.
So, this particular example motivates us to understand about self-stabilization in different

type of systems, especially in a distributed system. Now the time required for
stabilization varies from experiment to experiment, depending on random or the initial
position; however, if the field size is limited in the case of, the children building a circle
is limited this particular time will be bounded.
The algorithm does not define the position of the circle in the field. So, it will not always
be the same. The position of each child relative to each other will also vary.
484
So, there are several factors, meaning to say that particular factors are going to basically
take how much time the system is going to stabilize. So, the self-stabilization, principle
applies to any system build on a significant number of components, which are evolving
independently from one another, but which are cooperating or competing to achieve a
common goal and example is the distributed system.
So, this applies in particular to large distributed systems, which tend to result from
integration of many subsystems, and components developed separately, at the earlier
times or by different people.
485
So, in this lecture we will first present the system model of distributed system and
present the definitions of self-stabilization. Next we will discuss Dijkstra’s seminal work
and use it to motivate, the topic. And then we will discuss by issues arising from
Dijkstra’s original presentation, as well as several related issues in the design of self-
stabilizing algorithms and systems.
Let us see the system model. The term distributed system is used to describe the set of
computers that communicate over the network, variants of distributed systems have
486
familiar fundamental coordination requirements among communicating entities, whether
they are computer processor or processes.
Thus an abstract model that ignores the specific setting, and captures the important
characteristics of a distributed system is usually used. In a distributed system each
computer run a program composed of executable statement. Each execution changes the
content of the computers logical memory. And abstract way to model a computer that
executes a program is to use, state machine model.
A distributed system model comprises of n state machines called processors that

communicate with each other. So, each processor is nothing, but a state machine, and
these particular processors will communicate with each other also, and these processor
will make a transition among these states. So, the example here is basically comprising a
set of n state machines, called processors which will communicating with each other.
Usually denoted i-th processor denoted by P I, the neighbor of processor are the
processors. So, here we are calling it as, the left and the right neighbors, and the system
we are considering to be in a anticlockwise directed.
A processor can directly communicate with its neighbors, a distributed system can be
conveniently represented by a graph, in which each processor is represented by a node,
and every pair of neighboring nodes are connected by a link.
487
That I have shown in the figure. The communication between neighboring processors
can be carried out, either by message passing or shared memory, communication by
writing in and reading from the shared memory usually fits the system with processors
that are geographically close together; such as multiprocessor computers.
Message passing distributed system, distributed model fits both processors that are
located close to each other, as well as they are widely geographically distributed, and
they are connected over a network. In the message passing model neighbors
communicate by sending and receiving messages. Message passing communication
model will should contain a queue, and which is represented as Qij for the messages
from Pi to pj.
488
It is convenient to identify the state of a computer or a distributed system at a given time.

So, that no additional information about the past of the computation is needed in order to
predict the future behavior, of the computer or a distributed system. A full description of
the message passing distributed system at a particular time consists of the state of every
processor and content of every queue.
The term configuration, or a configuration is uses that; such a description configuration

is noted by a set that is c, which contains the states where Si is the state of a Pi and Qi,j,
489
where i ≠ j is a state of a Qi,j; that is the messages sent by Pi to Pj, but not yet received.
The behavior of a system consists of a set of states, a transition relation between those
states and a set of fairness criteria on the transition relation.
The system is usually modeled as the graph of processing elements, where edges
between the elements model the unidirectional or bidirectional communication links that
I have already explained. Let N be the upper bound on n that is the number of nodes in
the system, the communication network is usually restricted, to the neighbors of the
particular node. So, here the diameter ᵹ and the ∆, denotes the upper bound on that
diameter.
490
A network is static, if the topology remains fixed, dynamic the links and networks can go
down and recover later on. So, self-stabilization is guaranteed eventually, in spite of all
the faults. Shared memory model, is basically we are not going to consider two
neighboring nodes, having access to a common data structure variable is not possible in
So, the algorithms are modeled as the state machine, performing the sequence of steps. A
step consists of reading input and the local state.
491
And then performing the state transition and writing output. Communication can be by
exchanging messages over a communication channel. So, a related characteristics of a
system model is, the execution semantics, if the self-stabilization. This has encapsulated
within the notion of scheduler or a daemon, also called demon one under a central
daemon at most one processing element is allowed to take i step at a point of time.
So, we will see that, these particular assumptions are basically well defined for a
particular self-stabilizing system, definition of self is stabilization. We have seen an
492
informal definition of a self-stabilization at the beginning. Formally, we define self-
stabilization for a system S with respect to a predicate P, or its set of global states, where
P is intended to identify its correct execution. So, the states satisfying the predicate, P is
called the legitimate state, and those not satisfying the predicate p are called illegitimate
state. We use the term safe and unsafe interchangeably with the legitimate and
illegitimate respectively. So, a system S is self-stabilizing with respect to the predicate P,
if it satisfied the following two properties, which are most important.
So, the first property is called the closure property; says that P is closed under the
execution of S; that is once P is established in S, it cannot be falsified. Second one is
called convergence starting from an arbitrary global state, the predicated system defined
by P; that is called S is guaranteed to reach a global state satisfying P within a finite state
of transitions. So, the closure and convergence are two important properties in that self-
stabilization.
So, we define the self-stabilization or stabilization for a system S, and self-stabilization is

basically a part of the stabilization. So, self-stabilization is a special case of stabilization.
493
Now, then there is a reachable set, often when the programmer writes a program, he she
does not have a particular definition of a safe and unsafe states in mind, but develops a
program to function from a particular set of start states. Such situation it is reasonable to
define, as states those states that are reachable under normal program execution from the
set of legitimate start states.
These are referred to as reachable sets. Transient failure set. Transient failure is
temporary or short lived and does not persist. The transient failure may be caused by
494
corruption of local state processes, or by corruption of channel or shared memory.
Transient failures may change the state of a system, but not its behavior.
Issues in design of self-stabilization algorithms: a distributed system comprises of many

individual units, and many issue arise in the design of self-stabilization, algorithms in
distributed system some of the main issues, are number of states in which each of the
individual unit in a distributed system, uniform and non-uniform algorithms, central and
distributed demon, reducing the number of states in a token ring, shared memory models
mutual exclusion and cost of self-exploration.
495
The mentioned issues can be explained with the help of Dijkstra’s landmark self-
stabilizing token ring system. So, his token ring system consisted of set of n finite state
machine connected in a form of a ring. I told you previously, and he defines the privilege
of a machine, to be, the ability to change the, to change its state. So, that particular node,
which is called a privilege node, has the ability to change its state, and Dijkstra assumes
initially that there is only one privilege at a particular point of time, but later on its
change the model, and we are going to see all both models in this part of the discussion.
So, this ability is based on the Boolean predicate that consists of the current state and the
state of its neighbor. So, when a machine has a privilege, it is able to change its current
state, which is referred to as a move. So, furthermore when multiple machines enjoy a
privilege at a same time, the choice of the machine that is entitled to make a move is
made by a central demon, which arbitrarily decides which privilege machine will make a
move. So, these are the important concepts which I will again highlight before going
ahead.
So, first thing is, in a given self-stabilizing in a system, there must be some set of
privileges, and these privileges are subject to the Boolean predicate, and it will change
the state from illegitimate state to the legitimate state, following two rules which we have
seen; closure and the convergence. So, furthermore when multiple machines, they enjoy
more than one machines having a privilege at the same point of time, then the central
496
demon will come, and this will decide, among many privileges, one of these privileges
will be activated, or will be allowed to make a boom at particular point of time.
A legitimate state must satisfy the following constraints. There must be at least one
privilege in the system; that is liveness or no deadlock. Every move from legal state must
again put the system into a legal state.
So, from one legal state the system will make a move, and go to another legal move; that
is called a closure property. So, during an infinite execution, each machine should enjoy
a privilege an infinite number of times.
497
So, that is the no starvation condition. So, given any two legal states, there is a series of
moves that change one legal state to another; that is called the reachability. So, Dijkstra
considered, a legitimate or illegal state as one, which exactly one machine enjoys the
privilege as I told you that only one privilege is initially considered by Dijkstra. This
corresponds to a form of mutual exclusion, because privilege process is the only process
that is allowed in the critical section. Once the process leaves the critical section it passes
the privilege to the other nodes; the number of states in each of the individual units.
498
So, the number of states that each machine must have for the self-stabilization, is an
important issue. Not only important issue, but it is a design issue as well. So, in the
previously we have seen how to minimize this number of states, and that will be the one
of the most important design issue in self-stabilization. So, Dijkstra offered three
solutions for a directed ring, with n machines, each having K states. So, the three
solutions, where in the first one assumes that the number of states is same as n, or more
than that; so ≥ n. The second solution assumes, the number of states is = 4, and then you
will see that k = 3. That means, the Ghosh has proved later on that, with the number of
states is = 3, it is possible to design a self-stabilizing system.
So, Ghosh proved that minimum of three-states is required in a self-stabilizing ring. So,
in all three algorithms by Dijkstra’s assume the existence, of at least one exceptional
machine, that behaved differently from others.
Let us see the first solution where the number of states, is assumed to be greater than or
equal to the number of nodes, or the processors in the system. So, this is very loose
bound on the number of states. So, for any machine we use the symbols S, L, R, S means
its own state, and L is by state of its left neighbor, and R is the state of its right neighbor
on the ring respectively. So, meaning to say, that if this is the ring and this is by state. So,
if this is the current state. So, this is S, its left neighbor is L, and its right neighbor is R
respectively.
499
Now, Dijkstra’s assumed one machine which is called an exceptional machine, the code
of exceptional machine is like this. If L=S; that is the state of left is equal to the state of
the current state, then S the state of S, or the current state will be modified as (S+1 mod
K); the other machines which are not an exceptional machine. So, there if the state; that
is left is not equal to the current state, then the current state will be same as the left state.
Now, in this algorithm except the exceptional machine 0 all other machines follow the
same algorithm in the ring topology, each machine compares its state with the state of its
the anti-clockwise neighbor and if they are not same, it updates its state to be the same,
as that of its anti-clockwise neighbor.
So, if there are n machines, and each of them is initially at a random state, drawn from
possible set of the states, then all machines except the exceptional machine; that is
machine 0 whose states are not the same as their anti-clockwise neighbor are said to be
privileged, and there is a central demon which will decide, which among those privileged
machine will be allowed by the system to make a move.
500
Suppose machine 6 in a system which has the number of mod or more than 6, makes the
first move. It is obvious that it state is not the same as that of the machine 5, and hence it
has, it had the privilege to make the move. And finally, sets, its state to be the same as
that of machine 5. Now machine 6 loses the privilege, as its state is same as that of its
anticlockwise neighbor machine.
So, let us see the example of this particular description right over here, before we take it.
So, if this machine 6 is going to make a move. So, its neighbor is 5 which is at the left. If
it is going to make a move then, the current state of 6 is not equal to current state of 5.
So, it will change by state accordingly, let us say that the state of 5 is let us say 0, and the
state of 6 is 1 . So, now, the state of 6 will be changed to 0. So, finally, sets the state to be
the same as that of machine 6, that we have done here. Now machine 6 loses its
privilege, as its state is same as that of anti-clockwise neighbor 5, that I have explained.
Now, next, suppose machine 7 whose state is different from the state of machine 6. Let
us say its state is 1, is given the privilege. It is having a privilege why, because its left
neighbor is not same as its current a state; that is 0 and one they are different. So, hence
it will be a privileged. So, it results in making the state of the machine 7 as that of
machine number 6. So, it will become 0 in this case. Now machine 5 6 7, they are in the
same state; that is equal to 0 in the above example.
501
So, eventually if you see the progress, all the machines eventually will basically make a
transition, and will have the same state; that is 0, then what will happen. So, eventually
all the machines will be in the same state in the similar manner. So, at this point, only an
exceptional machine that is a machine 0 will be the privileged, as its condition L = S will
be satisfied. So, there exists a machine 0, whose left and right, whose left and the current
state both are same here. Then according to the exceptional machine code, if they are
same then then they have to make a move, and if state is the same as that of anti-
clockwise neighbor.
Now, there exists only one privilege or a token in the system; that is machine 0 makes a
move and change its state from S to (S+1) mod K. So, that machine in that case, 0 will
make a move and its state will become 1, according to this particular formula. And this
will trigger, the next state which is left which is 0. So, they are not same and so on. This
particular way the token circulates, or the moon circulates around the ring.
So, this will make the next machine 1, here as I shown you, is privileged as its state is
not same as in the anti-clockwise neighbor; that is here. Thus it can be interpreted as a
token currently with the machine 1. So, machine 1 will change it to 1 and so on. So, now,
it will go to the 2, 2 will is having 0; it will become a privileged and so on. So, the moves
will move in this manner, the clockwise way. Although the ring is basically identified in
a anti clockwise manner.
502
So, machine 1, as per the algorithm changing state, to the same state of that machine 0
and move to the machine 2 and so on. So, this is a simple algorithm, but requires the
number of states which depends on the size of the ring, which may be awkward for some
applications.
So, I explained you that, if let us say mod 6 is going to make a move, move this if it is a
privileged, and privileged means its left, is not equal to the current state that is S . So, let
us assume that it is having 0, state is 0, and this state is 1, so they are not equal. If they
are not equal, then S will be assigned to L. So; that means, one will be changed to 0, and
both will be now having the same state.
Now, let us assume 7, which is having the state 1. Now these two states are not same. So,
this will become privileged. This is no longer privileged because its left is, same as the
current state. So, this will become a privileged. So, if it is privileged, then again it will
set its state to the same as the left one so; that means, here S will be assigned to the left,
left is 0. So, this way eventually all the nodes will be able to change the state, and that is
the same state, but there will be an exception
So, according to the convention, there is a non or there is an exceptional node. So, this let
us say 0 is an exceptional node, and exceptional node says that, the code of a exceptional
node says that, if this particular left S is equal to left, S is not equal to left. Then basically
this will change to 1. So, exceptional machine says that if both are equal, then S will be
503
incremented by 1. So, if both are equal to 1. So, it will be changed to 1 in this case. Now
this a non-exceptional node, a non-exceptional node, will see that its left neighbor is 1.
So, it is going to change, and it will become, it will change its a state, and it is no longer
privileged. So, the privilege will be here why, because both S is not equal to L and so on.
So, just see that all the states will now being rotated, and change to 1, and again finally,
come back and found out that this is an exceptional node, and it is now privileged, and it
will change again from 1 to 0 and so on. So, keeps on rotating, the token in this particular
manner, and this is the legitimate state. So, you just see that whenever there is a move, it
will goes from one legitimate state to another legitimate state according to the programs
which are defined.
Now there is another solution which says that the previous solution was has considered
that K ≥ n; that means, the numbers of states are too much, depending upon the number
of nodes which are there in the system.
Now, the second solution says that, let us assume K = 3. So, the states for the K = 3. So,
the states for the K will be 0, 1 and 2 which are assigned to a three-state machine, to
every node of a system. So, in the first algorithm, there is only one exceptional machine;
so by reducing the number of states from n to 3. Now the number of privileged nodes
also will be increased from 1 to 2 over here.
504
So, there will be two machines which are privileged machine, or exceptional machine
code. So, these are called machine 0, and which is machine 0 is also called a bottom
machine. So, here if you see a ring structure; let us see that this is a 0, so it is a bottom
machine and its number is 0 and here this will be a machine number n minus 1, and it is
called a top machine, and they are two machines which are exceptional machines.
So, let us see the program, in this particular problem setting where K = 3. So, the bottom
machine as I explained you will have this kind of code, if (S +1) mod 3 = R, R means the
neighbor on the right side, then S = (S – 1) mod 3, and the top machine which is the
machine number (n – 1). If left neighbor = right neighbor, and the (L+1) mod 3 is not
equal to the current state of the machine, then current of the state of the machine is
increment, is basically the L+1.
And all other machines in this case; either of these two rules are applied whichever is
correct is being applicable; if the current state plus 1 is equal to the left neighbor state,
then state will be same as the left neighbor. Otherwise if (S+1) mod 3 is equal to the
right, then basically S = R.
505
So, in this algorithm the bottom machine 0 behaves as follows; that we have seen. Thus
the state of the bottom machine depends upon the current state, and the state of its right
neighbor. The condition (S+1) mod 3 covers three possible states, for S = 0, 1 and 2.
Thus we have (S + 1) mod 3 = 1, 2 and 0; these results in the following three
possibilities. So, when (S + 1) mod 3 = 1, when (S + 1) mod 3 = 2, and when it is 3, and
it is same as R, then S will be 0, 1 and 2.
So, let us see that when S = 0. Here if you see S = 0, and R = 1, then the state of S will be
changed to S - 1 that is 2, when S = 1 here, and R = 2 then S will be changed to 0,
because S - 1, S – 1, 1 - 1 will become 0. Similarly in this condition S - 1 means 1. So,
this particular case is considered for the bottom machine; that is machine number 0 will
behave in this manner.
506
Similarly we can see about the top machine; that is the machine number n minus 1 will
behave according to this rule. So, the top machine depends upon both its left and right
neighbors. The condition specifies that the left neighbor L and the right neighbor R
should be in the same state here. And (L + 1)mod 3 ≠ S, the other condition connected by
an end. Note that (L + 1)mod 3 = 1, 2 and 0, when L = 0, 1 and 2 respectively. Thus the
state of the top machine will be assigned to 0, and it will be assigned to 0, and it will be
assigned to 0 and it will be assigned 2, from 1 to n 0 according to this particular rule.
507
Now, all other machines behave as follows, as I explained you, while finding out the
states of the other machines machine 1 and machine 2 let us for example, below we first
compare the state of, state with its left. So, when S = 0, and L = 1. So, S = 0, this will
become 1, and L is also 1, both are equal, then S will be assigned to L. If it is not
assigned to L, then basically the second rule will be basically considered and so on.
So, if the given conditions are not satisfied, then the machine compares its state with its
right neighbor that I explained in the previous example. So, the same execution of
Dijkstra’s three-state algorithm for a ring of 4 processors is shown in the next table.
Machine 0 is the bottom machine, and machine 3 is the top machine. The last column the
table gives the number of machines chosen to make a move, initially 3 privileges exist in
the machine, the number of privileges decreases only 1 privilege is in the left.
508
So, initially you see that there are three privileged machine are there, and then it will be
moved to two privileges, and finally, one privilege at a time onwards. Now, as far as the
machine is concerned. So, let us go back again. So, machine 0 is the bottom machine,
and machine 3 is the top machine. So, this is the bottom machine, and machine 3 is the
top machine. These two codes are different, and these are the other machines which will
follow a different code.
So, on the last column, here you can see that the machine to make a move is basically
mentioned over here; that means, these are the privileged machines. If more than one
privileges are there, then central demon will select one, and it is allowed to make a move.
So, here out of two, this is allowed to make a move in all other cases, the same machines.
509
Same previous node will be allowed to make a move in this particular diagram.
Observations: so we can make the following observations, there are no deadlocks in any
state. The closure property satisfied, no starvation is also satisfied, reachability is also
there.
So, all four constraints for a legitimate state are satisfied. So, the system is stabilized.
Few other works in the field of self-stabilization are mentioned here for further reading.
510
Conclusion self-stabilization has been used in many areas, and the areas of study
continues to grow. Algorithms have been developed using central and distributed demons
and uniform and non-uniform algorithms. The algorithm that assumed the central demon
can usually be easily extended to support distributed demon.
So, these algorithms are still useful when applied to the distributed system. In this lecture
we have discussed the concept of self-stabilization, system model, related issues in the
design of self-stabilizing algorithms and systems, and also discussed Dijkstra’s self-
stabilizing token ring system.
Thank you.
511
Distributed Systems
Prof. Rajiv Misra
Lecture - 18
Case Studies
Randomized Distributed Algorithm
Randomized Distributed Algorithm.
Introduction: this lecture focuses on specific type of distributed algorithm, which employ
randomization. Randomization has proved to be a very powerful tool from designing
distributed algorithms, as far as many other areas also. It often simplifies algorithms and
more importantly allows us to solve the problems in situations, where they cannot be
solved by the deterministic algorithms, or with the fewer resources then the best
deterministic algorithm ever solves.
512
So, in this particular discussion or a lecture, we will see the power of randomization. It
basically is used with the distributed algorithm. Then it is going to solve some of the
impossible to results also, but how they are going to be solved, with the same problem
statement or with a modified problem statement that we are going to see.
So, randomization if it is must with, or it is given as a information to the distributed

algorithm, then we are going to see the intricacies in this part of the lecture. So, this
lecture extends the formal model to include the randomization, and describes the
randomized algorithm for leader election problem. So, leader election problem will be a
case study, where we will use the randomized approach or randomization, and we will
basically discuss the randomized algorithm for leader election problem. Randomization
allows us to overcome impossibility results, and lower bounds by relaxing the
termination conditions or the individual liveness properties.
513
Distributed mutual exclusion algorithm; a case study. Now the problem definition, we
will basically see that, the same problem definition which is impossible. Now we will
take up, it is weakening the problem definition, and let us see that how using
randomization we are going to solve it. So, a randomized algorithm is an algorithm that
has access to some source of random information; such as that provided by flipping a
coin or rolling a dice. So, more formally extend the transition function of a processor to
take as an additional input, a random number, drawn from a bounded range, under some
fixed distribution.
514
So, what is important here is that, the addition of random information alone, typically
will not affect the existence of impossibility, or the worst case bounds. So, for instance,
even if the processors have access to the random numbers, they will not be able to elect
the leader in an anonymous ring, or solve the consensus problem, fewer than f + 1 rounds
in all admissible executions.
So, randomization with weakening of the problem statement becomes a powerful tool, to
overcome these limitations which are basically proved using the impossibility results,
and also the lower bounds. So, usually weakening involves the termination condition.
For instance leader election must be elected with a certain probability. This becomes a
weakening of the problem statement.
So, a leader must be elected with a certain probability. So, this becomes a problem
statement, or a new problem statement, and this we will see how using randomization we
are going to solve. So, randomization and weakening of the problem statement together,
will solve the purpose, or will solve the impossibilities and the bounds which are being
proven.
515
Now this randomization is differs from the average case analysis of the deterministic
algorithm. In average case analysis, there are several choices as to what is being
averaged over; one natural choice is the inputs.
So, there are two difficulties, with this averaging issue. First is determining an accurate
probability distribution on input is often not practical. Another drawback is, even if such
distributions can be chosen with some degree of confidence, very little guarantee about
the behavior of the algorithm on a particular input. For instance, even if the average
running time over all the input is, determined to be a small, there still could be some
input for which the running time is enormous.
516
In the randomized approach, more stringent guarantees can be made, because the random
numbers introduce another dimension of variability, even for the same inputs. There are
many different executions for the same input. A good randomized algorithm will
guarantee a good performance, with some probability for each individual input. Typically
the performance of the randomized algorithms defined to be the worst case probability
over all inputs
517
Now, randomized leader election problem; the simplest use of randomization is to create
an initial asymmetry, in the situations, that are inherently symmetric. For example, if we
say an anonymous ring; so anonymous ring, where the ids are not uniquely assigned. So,
they become anonymous rings, and ids are not distinct. Now we have seen the
impossibility result, that there is no non uniform anonymous algorithm for a leader
election in a synchronous ring.
So, that impossibility result reminds that, if let us say that, the nodes are not are having
the distinct ids, then basically this will create the problem of the symmetry, and in that
symmetry the state machines of all the processes will be the same in every execution.
Hence the leader will not be elected in that case. In spite of the non uniform, in spite of
non uniform; that means, in spite of knowing how many nodes are there in the ring.
. So, the simplest use of randomization is to create the initial asymmetry in the situations
that are inherently symmetric, that example we have seen. So, one such situation is
anonymous ring, where the processors do not have distinct ids. So, it is impossible to
elect a leader in this particular problem setting. So, it is said that, it is impossible to elect
the leader in this problem setting. Now from the earlier theorems, we have also seen that
there is no non uniform anonymous leader election algorithm in a synchronous ring. This
particular impossibilities are also holds for the randomized algorithms; however, the
randomized algorithm with the modified formulation of a leader election; that is the
leader is elected with some probability, with this weakening of the problem statement.
Now, let us see that whether this particular weakening of the problem statement, and if
we add the randomization, whether is it going to solve this particular impossibility result,
and give up with the randomized leader election problem. So, a variant of leader election
problem that relaxes, that condition, that eventually a leader must be elected in every
admissible execution. So, relaxed version of a leader election problem requires two
properties to be to be satisfied.
The first is called safety property. Safety property is says that in every configuration of,
every admissible execution, at most one processor is in the elected state. Meaning to say
that, in no situation or in no admissible execution, there can be more than one leader
elected in an algorithm, then it will become the wrong or it will violate the safety
518
conditions. The safety conditions ensure that in every execution, or every admissible
execution, the algorithm produces at most one leader.
The other condition is called the liveness condition. So, liveness condition is says that, at
least one processor is elected with some nonzero probability. Here we have included the
weakening of the problem statement, in that weakening is affected in the liveness
condition. So, liveness says that at least one processor is elected with nonzero
probability, the leader. So, leader is now the modified, liveness condition say the leader
is elected with some probability.
So, safety property has to hold with certainty as I explained; that is the algorithm should
never elect two leaders or more than one leader, and this is the safety property.
So, there is no weakening in the safety property. Safety property has to hold with the
certainty as we have seen earlier. The second property called liveness property, liveness
condition is relaxed, and the algorithm need not always terminate with a leader, rather it
is required to do so with the nonzero probability. So, an algorithm that satisfies this
weekened liveness condition can fail to elect leader, either by not terminating at all or by
terminating without electing a leader.
519
So, we are going to take up this particular issue, and we will see algorithm designed with
this weekened the liveness condition to elect a leader, using the randomized approach.
And we will see also the remedy that if the leader is not at all elected at the termination
of the algorithm, how are we going to basically resolve this.
So, the first algorithm which we are going to discuss is the synchronous one shot
algorithm, and this is the randomized algorithm. So, let us first consider the synchronous
ring. So, in that synchronous ring, there is only one admissible execution on this
anonymous ring for the deterministic algorithm. Now for a randomized algorithm;
however, there can be many different executions, depending on the random choices.
So, as you know that in anonymous ring for a deterministic algorithm, there is only one
admissible execution and where all the states are the same. The state transitions are the
same. Hence the there is no asymmetry; hence the leader cannot be elected, but for
randomized algorithms here, there are many different executions why, because that
depending on the random choices of the id. So, asymmetry can be introduced with some
probability in the randomized algorithm.
So, the approach that is used to devise the randomized leader election is to use the
randomization to create asymmetry, by having processors choose a random pseudo
identifiers, drawn from some range, and then execute a deterministic leader election
problem. So, in this problem setting you will see that, range of values is from 1 and 2;
520
that means, the pseudo random, pseudo randoms, pseudo identifiers, are from the set of
only two numbers; 1 and 2, and then execute a deterministic leader election, and let us
see that if this particular choosing the number out of these twos range of numbers 1 and
2. Whether the deterministic leader election problem can tap this asymmetry, and elect a
leader, or it is fail to elect a leader, because of the symmetry.
So, a deterministic leader election with these properties is the following.
So, each processor sends a message around the ring, to collect all the pseudo identifiers.
Let us assume this particular ring. So, here every processor will send its id, which is
drawn out of a random range of numbers. Here we are considering only 1 and 2. So, this
particular process will toss a coin, to select one of these two numbers, or randomly pick
one of them out of these range of numbers given, and that will become its id, and this id
will be send on the left part of the edge. The next process connected by an edge will
receive this from the right side of the ring. So, when the message; basically it will
collect. So, this particular message will collect the pseudo identifiers, or it will add its
own id and that will circulate around the ring.
So, after n rounds, if n numbers of processes are there; so n rounds, then all the
processors will know the ids which is being selected here by the processors, and the one
which is having the highest id will be the elected as the leader. So, when the message
521
returns after collecting n pseudo to identify the processor knows whether it is unique
maximum or not.
Now, coming the algorithm with that idea, the algorithm goes like this in this algorithm
the pseudo identifier is chosen, to be 2 with the probability 1/n, and the number 1 is
chosen with the probability (1 - 1/n). So, the range of numbers is only two numbers. So,
ids will be chosen by n different processors. So, the probability of choosing to, let us say
is 1/n, is the probability of choosing number 2. Similarly the probability of choosing
number 1 will become (1 - 1 / n). So, n is basically the number of process in the ring.
So, this becomes the first step of the algorithm, thus each processor makes use of its
source of randomness exactly once, and the random numbers are drawn from this range
that I explained. The set of all possible admissible executions of this algorithm for the
fixed ring of size n, will contain exactly one execution for each element of this set, which
will basically be nothing, but a vector, which vector of size n, which will be containing
the elements either 1 or 2, and that is represented by a bigger set. That is by specifying
which random number 1 or 2 is obtained by each of the n processor in its first step, we
have completely determined the execution.
So, given an element R of R, then the corresponding execution will be denoted by

execution of R.
522
So; that means, in a particular execution, we can admissible execution, we can denote it
by x R of that particular instance which will. So, the algorithm ,which is called leader
election algorithm using randomization in an anonymous ring for a particular process pi,
and this particular algorithm will be implementing or will be running on all processors;
that is ranging from 1 to n - 1. So, if there are n processors. So, this kind of this
algorithm will run on each node of the topology, which is organized in form of a ring.
. So, initially spontaneously or upon receiving the first message, the ids will be picked;
that means, a processor will pick an id 1 with the probability (1 – 1/n), and id 2 with the
probability (1/n). So, it will pick a number 2 with a probability (1/n). So, then after
picking one of these numbers randomly, then it will send this number as an id of a
particular process to the left of the ring, to the left, means this id will be sent, id of i will
be sent to the left. Upon receiving this particular message S from the right, this particular
process will check if the size of this S = n.
That means, this particular message is already flown n different times in the ring, and n
are the number of process; that means, it has collected all ids of n different processors.
So, if it is n, then the id, if idi is the unique maximum of S, then it will be elected, that
node will be elected, else the node becomes non elected. So, if that particular id is not
maximum of all the other ns, then the node will be non elected. If it is not circulated n
523
times, then it will send this particular id affixing its id. So, it will contain id i and id j,
and this message will be sent to the left and so on.
So, this way this particular size will grow, when it will reach over here, it will contain all
the n ids listed in that particular message, and by that it will check whether its id is
maximum among all the, among unique maximum, not only maximum, but it should be a
unique maximum. That means, if let us say 2 nodes are picking id as 2. So, there cannot
be 2 leader elected. Obviously, there will not be any leader elected. So, the leader will be
elected only in one case, when 1 node will have the id 2, and all other nodes have the id
1. So, this is only one case, when the leader is elected.
So, definition let P be the predicate on executions for example, at least one leader is
elected, then the probability P of that event is nothing, but an execution that satisfies that
particular predicate. Analysis, what is the probability that the algorithm terminates with a
leader.
524
So, this happens when a single processor has the maximum id 2, and; that means, only
one processor has the maximum id 2, and all other processors are having ids 1, then only
the leader will be elected. The probability that a single processor draws 2, is the
probability that n - 1 processor draw 1, and 1 processor draws 2, times the number of
possible choices of the processor drawing to that is. That means, only one processor is
draws to that probability is 1/n, and all other processors n - 1 processors are drawing 1,
that probability is 1 - n and n - 1 processors will draw, and this kind of, this how many
times is possible choices, a processor drawing 2 is basically nC1.
So, if you simplify, then this particular part of this particular formula will cancel n by n.
So, it becomes 1 - 1 by n - 1. This is the probability that the algorithm terminates with a
particular leader. Now we will see that, what is the probability that all n. So, what is the
probability if we take the difference? So, we will see that this is having a higher
probability than this particular probability, where none of them has chosen 2. so; that
means, this particular probability of termination, is converging, if the number of nodes is
basically going to be a large value. So, it will converge to a value which is 1 / e. So, this
is a constant. So, the probability that the algorithm terminates is a constant, and that is
1/e and that is shown in the next slide.
So, the same thing is 1/e that is the probability of choosing of that algorithm terminates
with the leader is 1/e.
525
Now, it is simple show that every processor terminates after sending exactly n messages,
moreover at most one processor terminates in an elected state. In some executions; for
example, when two processor choose pseudo id 2 of 2, then no processor terminates as
the leader why, because safety conditions says that only, there can be only 1 leader than
only the algorithm terminates successfully. However, the above analysis shows that this
happens with the probability less than 1 – 1/e.
526
The message complexity; so the theorem for this analysis or this algorithm, says that
there is a randomized algorithm that with the probability c>1/e, elects a leader in a
synchronous ring and the algorithm, sends how many messages O(n2) messages, because
we have taken the n2 algorithm, and deterministic algorithm, and converted into a
randomized algorithm with relaxed live condition.
Now, we have seen that this particular algorithm which we have just seen, may not
terminate with a leader elected, and may end without any leader elected, because if there
are two nodes with a maximum value two or more nodes, with a maximum value then
leader will not be elected. So, leader will only be elected in a condition, when only one
node is having the maximum value, and all other nodes are having the other value
elected as non leader. So, we are now going to see the next step forward; that means, we
want that algorithm should be terminated after several iterations; that means, this
algorithm is to be now iterated, or more than one times or more than one round.
Finally, when the leader, when it elects a leader. So, the synchronous iterated algorithm
and the expectations. So, we are now going to discuss the modification of the previous
algorithm, or how the previous algorithm is to be iterated over the round. So, that it can
elect a leader, and terminate after that. So, then we will be calculating is expectations.
So, it is pleasing to know that the probability of termination in the previous algorithm
527
does not decrease with the ring size. So, as we increase the ring size, the probability is
not going to decrease. So, hence it is basically a scalable or say good news.
However we may wish to increase the probability of termination at the expense of more
time and the messages. The algorithm can be modified so that each processor receiving a
message with n pseudo identifiers, checks whether a unique leader exists or not; that
means, in the previous algorithm, if you see at 5 to 7, what we can do is, after seeing the
ids which are coming on the message, and if it is found that more than one maximum
number that is 2 is present in the message, then no need then basically that iteration or
this algorithm should be terminated, because it is not going to elect the leader. So, it is an
early detection. That means, this particular 5 to 7 lines, during this lines we can insert a
condition, and this algorithm will come out of that current iterations, and it will again
reassign the pseudo ids to the node, and restart again; that is called iterative algorithm.
So, we are now discussing about that iterative algorithm, because we are interested that
algorithm should terminate with a leader. So, if not, the process of choosing the new
pseudo ids and then iterate the algorithm that I have told you. We will now show that this
approach amplifies the probability of success.
Let us consider the second option in more detail, in order to repeat the algorithm one,
each processor will need to access the random number source multiple times. In fact,
potentially infinitely often, saying that the random number, the pseudo random id is 1
528
and 2 in the algorithm number 1, is accessed by the nodes exactly one times, but now
when you are iterating. So, this axis of this particular number by the same processor may
be a multiple times, because these ids if they are going to change, maybe in some
admissible execution a leader is going to be elected. So, that is why there is a
modification that, the processors will need to access the random numbers, random
number source multiple times, and in fact, potentially infinitely often.
To completely specify the execution of algorithm, we will need to specify for each
processor a sequence of random numbers that it obtains. So, in this process every node,
after picking up a random number out of this particular range will generate sequence of
random numbers, and every processor will obtain this kind of sequence. Let R becomes
be the set of all n-tuples each, element of which is possibly infinite sequence over {1, 2};
that is the range of numbers.
Now, the analysis of this iterative algorithm; the probability that the algorithm terminates
at the end of k-th iteration, is equal to the probability that the algorithm fails to terminate
in the first k - 1 iteration, and succeed in terminating the k of iterations. So; that means, if
the algorithm terminates in k-th iteration, means that it is not successful in k, the
previous k - 1 iteration, and it is successfully terminating with a leader elected in the k-th
iterations.
529
So, the analysis of algorithm 1 shows that the probability of success in a single iteration,
is 1/e, because the probability of success or a failure in each iteration is independent, the
desired probability of terminating at k-th iteration is given by this particular formula;
1 - c is basically the probability that, it is not successful in terminating with a leader
elected. So, 1 - c and this will repeat for k - 1 times, and then at k time this is the
probability to elect a leader. So, this is the total probability that after k-th iteration, the
leader is elected during the iterative algorithm.
So, this probability tends to 0, as the k tends to infinity. So, if the k tends to infinity, this
particular probability becomes 0. Thus, the probability that the algorithm terminates with
the elected leader is 1. So, this is a good news, that after sequence of rounds or iteration,
the algorithm will elect a leader with a probability 1, or definitely elected leader.
So, the time complexity of this iterative or iterated algorithm is like this. Now the worst
case number of iterations if it is ∞, the expected number of iterations if we see, is 1/c < e,
or the value of e 2.7 something, or you can say that the expected number of iterations is
3; that means, within 3 iterations some leader will be elected, and that is not a bad
number. Let T be the random variable that for a given execution, is the value of the
complexity measure of interest for that run. For instance the number of iterations until
the termination; why we are doing is, because the worst case n number of iterations is
infinite, that particular value is meaningless to us.
530
So, in this particular iterated algorithm, randomized algorithm, we will take the expected
value of the total number of iterations, and then the message complexity will be
discussed or will be considered in that terms of average, not in the worst case . So, ET be
the expected value of T, and T is the random variable for a given execution, is that the
way is that the value of the complexity measure of interest for that run. So, that is the
number of iterations until the termination.
So, this particular expression will lead to this particular theorem that, there is a
randomized algorithm that elects a leader in a synchronous ring with a probability 1, in
(1/c).n that is e.n. So, within e times n number of expected rounds, the algorithm
terminates with a leader and this particular algorithm will send O(n 2) messages, expected
messages in each round.
531
So, the conclusion: randomization has proved to be a very powerful tool for designing
the distributed algorithm, and in this part of the lecture we have taken case study of the
leader election problem in an anonymous ring, and we have seen that by doing the
weakening of the problem definition, that basically leader is to be elected with certain
probability, that randomization is able to solve the problem of leader election in
anonymous ring. It allows us to solve problem in the situation, where they cannot be
solved by the deterministic algorithms.
This lecture extends the formal model to include the randomization, and describes the
randomized algorithm for leader election problem. We have presented two randomized
algorithm, synchronous one shot algorithm with O(n2) messages, and synchronous
iterated algorithm with the O(n2) expected messages.
Thank you.
532
Distributed Systems
Prof. Rajiv Misra
Lecture - 19
Peer-to-Peer Computing and Structured Overlay Network
Peer-to-Peer Computing and Structure Overlay Network.
Introduction, peer-to-peer network application level organization of network overlay for

flexible sharing of resources such as files multimedia documents which are stored across
the network wide computers. So, peer-to-peer network will provide the sharing of
resources using an application level organization which is called an overlay and this is
called peer-to-peer networks.
In peer-to-peer networks all the node that is called peers are equal that is they are
working as a client as well as the server and they communicate directly between the
peers. Now, it allows to find the location of an arbitrary object and here no DNS is
required DNS servers are required in internet to find out the location of the nodes that is
nothing, but an IP address of a node and you can node and you can node name is given
the DNS lookup will provide the IP address so that the network can directly access or can
reach to that particular node with an IP address.
533
Here to find out the location of an arbitrary object here we are going to see how peer-to-
peer network uses without DNS. Peer-to-peer network is basically nothing, but a sharing
of large combined storage combined CPU power and combined other resources without a
scalability cost. Dynamic insertion and deletion of the nodes is called the churn in peer-
to-peer network as well as of the resources provided at a low cost. Overlay network
overlay networks refers to the network that are constructed at the application level on the
top of another network that is called an overlays of application level organization. Peer-
to-peer overlay network is an overlay network that is constructed by internet peers in the
application layer on top of an IP network.
So, the desirable characteristics of a peer-to-peer is that peer-to-peer networks are self
organizing that is and they provide a large combined storage CPU power and resources.
Second feature of peer-to-peer is that it has purely distributed control which will enable
the first search for the machines and the object. The nodes have the symmetry as far as
the roles are concerned. So, all the nodes are symmetrical; that means, they are working
in the same manner and hence they will provide the scalability without any extra
overhead. Another feature is the anonymity here efficient management of churn is also
maintained, another characteristic is naming mechanism that is selection of
geographically close servers are they being used. Security authentication and trust the
redundancy in the storage and path will ensure it. So, these are the desirable
534
characteristics and performance features of peer-to-peer networks application layer
overlays.
The core mechanism in peer-to-peer network is the searching for the data and this
mechanism depends on how the data and the networks are organized, so the as you know
that to a search data we have to store the data in such a way that the lookup or the search
for a data becomes efficient or a fast.
So, as the search algorithm for peer-to-peer network tend to be the data centric as
opposed to the host centric in the internet algorithms. So, when we say data centric
means the query directly can be made for a particular data or the object or a file of
interest and peer-to-peer will support it.
So, peer-to-peer search uses the peer-to-peer overlay network which is nothing, but a
logical graph among the peers that is used for the object search and object storage and
the management algorithms. Note that above the peer-to-peer overlay is the application
overlay where the communication between peer-to-peer is point to point representing the
logical all to all connectivity once a connection is established.
535
The classification of peer-to-peer overlay networks.
Peer-to-peer overlay networks can be classified in 2 types the first one is called
structured the other one is called unstructured. Under structured peer-to-peer network we
will see the case studies like distributed hash table can chord tapestry pastry and so on
non DHT based structured overlay network is basically provided by the application
which is called a mercury.
536
Unstructured peer-to-peer overlay network are of 2 types non deterministic and
deterministic. Non deterministic are basically given as an example there in Napster,
BitTorrent and JXTA deterministic are Gnutella and Kazaa.
So, the classification of peer-to-peer overlay network can be classified into structured,
structured in the sense it provides a fixed topology like hypercubes, meshes, a butterfly
networks, de Bruijn graphs. Unstructured overlay means that there does not have any
fixed structure so no particular graph structure is used or assumed in unstructured type of
overlay network.
So, a structured overlays use some rigid organizational principles based on the properties
of peer-to-peer overlay graph structure for the object is storage algorithms and the object
search algorithms. So, in structured overlay you can no deterministically about how the
nodes are organized and where the data is stored on which kind of node in a deterministic
manner.
Unstructured overlays, use very loose guidelines for object storage as there is no definite
structure to overlay graph the search mechanisms are more ad-hoc and typically use
some form of flooding or a random walk strategies now these unstructured overlays in
turn will evolve some random structure that we will see more detail when these overlay
unstructured overlays are being discussed in the further lectures. Thus object storage and
537
search strategies are intricately linked to the overlay structure as well as to the data
organization mechanism.
So, let us see the differences between structured and unstructured overlays. Structured
means a placement of file is highly deterministic file insertion deletions have some
overhead because of that structure. Here there is a fast lookup hash mapping based on
single characteristic that is the name file name range queries, keyword queries are
difficult to support in the structured overlays that is why unstructured overlays are there.
Examples of a structured overlay chord content addressable network and pastry.
Unstructured overlays does not have any structure for the overlay and the data placement
or a file placement does not require any structure in this kind of peer-to-peer system.
Node joints and departures are easy local overlay simply adjusted here only the local
indexing is used file search entails high message overheads and high delays the complex
keywords range queries are being supported here unlike in structured overlays.
Data indexing, data identified by indexing which allows physical data independence
from the application. So, this is one of the most important task in peer-to-peer network
how the data is organized and being accessed through the indexes which will make the
physical data independence from the different applications.
538
Or 3 types of indexing we are going to touch upon centralized indexing these versions
we can see in Napster and DNS, distributed indexing that is the indexes to the data is
scattered across the peers, access data through the mechanism such as distributed hash
tables these differ in hash mapping search algorithms diameter for lookup fault tolerance
and churn resilience.
Local indexing each peer indexes only the to the local objects remote objects need to be
searched for typical DHT uses a flat key structure used commonly in unstructured
overlays locked Gnutella along with the flooding search or a random walk search. So,
these 2 are going to be discussed in more details.
So, another classification is called semantic indexing human readable filename keywords
database keys are supported in semantic indexing. So, supports the keyword searches
range queries and approximate searches as you know that this kind of features are
supported in unstructured overlays. Semantic free indexing is not human readable
corresponds to the indexes obtained by the use of hash functions, so semantic free in
indexing is used in structured overlays.
So, semantic free indexing is used in structured overlays. Now, we are going to cover the
structured overlays and in particular. The distributed hash tables distributed hash table
scheme says that the mapping from node addresses space and object space is done in a
simple DHT highly deterministic placement of file or data allows fast lookup, but the file
539
insertion and deletion under churn incurs some cost. Attribute search range search
keyword search etcetera are not possible.
Let us see this, let us understand this particular illustrative figure which will explain how
the distributed hash table is or analyzing its common key or an identifying space. So,
every node is given an id and that will form set of all ids that will form an address space
for the nodes or it is also called an id-space. Now these id-space will be mapped on to a
flat structure of a identifiers that is called a common key ids.
So, this mapping is done through a function which is the consistent hashing function f
which will take this id of a node and maps on to a common key which is given here as a
common space. Similarly the object or the file also map to the particular to the ids using
again consistent hashing function. So, this particular function will take an object by name
or you give a file by name and it will map to the ids which are basically in the common
key space.
So, common key is space is most important and furthermore the most important part is
basically the mapping or the consistent hashing which will basically support the lookup
operation. So, lookup in the sense if you look up for a particular file by a name, it will
basically generate a particular key and this particular key will basically nothing, but a
mapping key will be mapped further to the nodes where this particular file is stored.
540
So, we will be looking up how the mapping from the node address space and how the
mapping from an object addresses space is supported. Normally this is done through the
hash table, but in the distributed environment this hash table is distributed that is why the
name is called distributed hash table. So, hash table you might have seen hashing is
nothing, but a supported with the hash function and this particular consistent hashing
provides that to support this hash table over the distributed network and that is why it is
called a distributed hash tables.
Chord distributed hash table, chord is an example which uses a distributed hash table. So,
chord is a protocol which basically will support the distributed hash table.
So, overview of the chord protocol, so chord protocol proposed by Stoica in 2003 uses a
flat key space to associate the mapping between the network nodes and data objects or
the file nodes. So, there will be a flat key space and here in one side there are the nodes
which are mapped to these keys similarly there are files and objects there also to be
mapped. So, that is what is written.
So, the chord protocol uses a flat key space this is the flat key space to associate the
mapping between the network nodes this particular mapping so that means, a network
node which is identified let us say using an IP address or a sha based an IP address. So,
the mapping of this IP address into a flat keys, flat key ids and let us say it is an m-bit id.
541
So, hashing function will mapped and the data objects and the file also is mapped to that
flat key space. So, this is the most important part of this chord protocol.
The node address as well as the object values is mapped to the logical identifier in a
common flat key space using consistent hashing function that I have told you. So, the
properties of a consistent hashing function is that these particular identifiers are
distributed uniformly across all the nodes. So, when a node joins or leaves the network of
N nodes the overhead is only (1/ N) keys have to be moved from one location to another
location because the keys are distributed uniformly. So, using that particular property this
node joining and leaving will become easy with very little overheads of moving the keys
from one location to another location.
There are 2 steps involved in the chord protocol the first one is to map the object values
to the keys. So, the object values that means, mapping is done using a function here the
object values are given maybe a name of a file or a variable and it will give a particular
key and key belongs to that flat key space. So, it is an m-bit value which will be returned.
And these particular keys are uniformly distributed over the nodes.
Now, another step is to map the keys to the nodes. These keys are again mapped to the to
the nodes using again a function f which will take the node id and will make a m-bit key
in the native addresses space using a lookup. Now the lookup operation design of a
lookup operation becomes very very important in the chord protocol.
Now, the common address space in m-bit identifier as I have been telling you; that
means, m-bit identifier will be able to generate 2m different addresses and this is space is
arranged on a logical ring structure that is; that means, it will support the mod 2 m
operations. So, that is nothing but a ring structure will form if you support mod 2 m
structure. This particular kind of ring structure also you will obtain if you see a wall
clock in the wall clock the numbers are arranged from 0, 1 and so on after 11 it becomes
0.
So, that mod 12 will basically give a clock structure similarly mod 2 raise power m will
give the numbers which can be organized in a form of a logical ring structure that is why
it is mentioned logical ring, structure will be there. Now these keys key K gets assigned
to the first node here such that the node ids equals r is greater than the key identifier of K
in the common address space.
542
The node is the successor of K denoted by successor of k. Let us take this example these
are the nodes which are stored here having the keys let us say 2 7 10 and 20. So, and the
keys are let us say that keys are given as 18, then here it is 5, then it is 6, then it is 9, then
it is 15. Now, these keys are to be as you have seen that the key is to be located or keys
to be stored in some nodes which will follow the successor of that particular key value.
So, successor of key says that either, so the key assigned to the first node such that the
node ids is equal or greater than the key id of k. For example, the key 5, key 5 can be
stored here in this node 7 why because the 7 is greater than 5; similarly the key 9 will be
stored on 10, 15 will be stored on 20, 18 will be stored on 5.
So, using successor of K this particular storage of a key will be performed and they will
follow the same common key flat structure whether it is the node or it is the key.
The same example is illustrated over here, here this is the chord the ring of m-bit is 7. So,
7 means it will generate the keys of the range 0 to 127 values will be organized in a form
of a ring structure and that is depicted by this logical ring.
So, the nodes N 5 you can see over here. So, I will draw another diagram which will
express you let us say this is the node N 5 and this is N 18, N 23, 28 and so on. They are
being given an id and they are stored in a form of a ring ids will be drawn out of these set
of values from 0 to 127.
543
Now, when 6 keys are given which are required to be stored in these numbers let us take
the example of K 8. So, 6 keys K 8 K 15, 28, 53 and are stored among these nodes using
successor operations as follows. So, successor of 8 here successor of key 8 will be what?
So, successor of key 8 will be here in this case 18 it will be stored here in 18. Can it be
stored here in 5? It cannot be stored why because it says that a key gets assigned to the
first node such that the node identifier equals or is greater than the key identifier using
successor function.
So, successor of K 8 is 18. So, 18 it will be stored K 15 successor of 15 is again 18, so on

this particular same node 2 keys are restored. Then successor of 28 since there is a node
which is having the id 28. So, it will be stored on the same node successor of 53,
successor of 53 will be stored on the successor of 53 that becomes the node 63 why
because it cannot be stored in 28 because 28 is less.
Now, similarly the successor of 87 is 99, so it will be stored in 99. Successor of 121 will
be stored in 5 why because the mod 2 raise to power 7 operation will perform and here
121 will be stored in node number 5.
Now, having organized or stored the keys on the nodes now there is a possibility for a
lookup of those values which are a stored on the nodes. So, when a application wants for
a lookup of the key or a object by its name then using hash or using lookup function you
can locate the node which stores that particular key and will be done in that particular
544
manner. So, let us understand this particular aspect which is called simple lookup in
chord. Each node tracks its successor on the ring, so query for the key x is forwarded on
the ring until it reaches the first node whose identifier is greater than or equal to the key
(x mod (2m)) that I have explained you in the previous example.
So, this will form a simple algorithm here let us say that you want to locate a successor
for a particular key, so here you have to see that the key whether the key lies between the
id of a node and between the successor if it is between the successor then successor will
be returned why because it will be a successor, successor will be known to that particular
node i. Successor will be returned and successor knows where this particular key is
stored on which node. Else if it is not lying between i and successor then what it does, it
will it will give a successor to locate further successor for that particular key.
So, this particular operation in turn will be called again and again and will perform the
lookup, again and again in the sense recursively it will find out. So, the example I will
give here you can see there is path or a routing path it will make a routing path to reach
to a particular node where that key x is stored. So, the result the node with the key K is
returned to the querying node along the reverse of the path that was followed by the
query this mechanism requires order one local space why because every node is storing
only the successor node value that is all and, but it requires order N hopes why because it
has to it has to traverse through the through the successor and will basically reach to the
node which contains that value.
So, let us take the example if you let us say this example is for lookup the key value 8 at
this node 28. So, lookup for the key value 8 at 28. So, i is basically 28 and its next
successor is 63 and the key value K 8 does not lie between i and successor or it is beyond
this particular successor K 8 value lies. So, it will reach to that particular successor and
then again performs the same task whether it is between the i is becomes 63 and the next
successor is 73. So, K 8 again lies beyond that successor, so it will again reach to that
successor and so on, it will go on in this particular manner till it reaches N 5.
Now, when it reaches N 5, so i is N 5 and the next successor is 18 and the key which is
being solved is 8 so that means, this 8 is basically greater than 5 and it is less than 18. So,
this particular node will restore that particular key, so here the routing path should end
and the key 8 will be given as the result of the lookup.
545
Now, there is a possibility of improvement why because the storage was only of the
order 1 because only successor information was a stored, but as far as number of hops
are concerned which was required of the order N this particular hop we can reduce by
increasing the storage size and hence we are going to see a more improved version that is
called a scalable lookup.
So, a scalable lookup will try to increase the storage from O(1) to O(m), m is of the
O(log n). Similarly, the routing path or the number of hops which are need to be solve
this of the order log n. So, you see that search will be reduced from order n to log n, but
the storage is being increased from O(1) to O(m) and that is called the scale of scalable
lookup we are going to see how it works.
So, node i will maintain a routing table which is called a finger table of the order log n
entries, here we call it is m different entries such that xth entry of that finger table which
is having m different entities is the node id of the node that is called successor of (i+2 x-1)
and denoted by i.finger[x] is successor (i+2x-1).
So, this is the first node whose key is greater than the key of the node i by at least 2 x-1
mod 2m. So, the complexity t here is of the O(log n) message hops are required at a cost
of O(log n) space in the routing table that I have already explained. So, due to the log
structure of the finger table there is more information about the nodes which are closer
than about the nodes which are further away.
546
So that means, if there are m values are stored in the finger table of a node. So, finger
table is also called as a routing table. So, it is saying that the nodes that the entries are
more for the nodes which are closer compared to the nodes which are further away are
having very little entries and this is called logarithmic structure of the table organization.
So, if the node is very close closely located in the close successors then basically it can
be identified in the table itself, but if it is further then it will be having a link to the
another table where it becomes a closer to some other node and it will be identified in a
very few number of hops and this structure is called logarithmic structure. Now consider
a query on a key, key at a node i if the key lies between i and its successor the key would
reside at the successor and its address will be returned as you have seen in the simple
lookup.
Now, if the key lies beyond the successor then node i searches through m entries in its
finger table here it differs from simple lookup because it has more possibilities because it
has m different successors in stored in the finger table.
So, this particular m entries will be searched to identify the node j such that j is most
immediately preceding that particular key k, among all the entries in the finger table.
Now, where j is the closest known node that precedes key j, so j is the most likely to have
a most information on locating the key locate locating the immediate successor node to
547
which the key has mapped the this particular is scalable lookup procedure you can see in
this particular algorithm.
So, here this algorithm says that if the key lies between i and the successor then the
successor will be returned this is same as the simple lookup else what it does it will try to
find out this closest preceding node and this closest preceding node procedure goes like
this for that particular node it will search through the finger table of m entries.
So, if the node is too far then the last entry will basically be not be able to satisfy the key
so; that means, key between i and successor the key lies beyond the successor. So, the
last entry of the successor will be returned then in that case because key is lying after this
particular last entry in the successor here. So, that way it will be returned and this
particular lookup will lead to the new entry. So, let us take the example of scalable
lookup let us take that the query of a lookup at K 8 at node 28.
548
So, at node 28 there is a lookup of for a key 8 let us see how in scalable lookup. The
number of hops are very little compared to the previous simple lookup where too many
number of hops were there to locate or to lookup the key value K 8. So, let us see that i is
28 over here and the successor is 63 the key which you are now searching is 8, now 8
does not lie between i and successor.
So, hence we are going to see the next successor up to this point the key does not lies.
Then we have to see the last entry that is 99, between 28 and 99 also the key 8 does not
lies. So, it lies beyond this, so this particular node will be now looked up. So, its finger
table of 99 will be looked up now between i is 99 and the first successor is 104 the key
value is beyond that it is not lying. So, basically, it will come to this particular entry
where it is N 5.
So, between 99 and N 5 when it comes to 63 it is far away, so N 5 knows where this key
8 is stored. So, the search will go to N 5 and N 5 you just see that N 18 is the node which
which is stores K 8. So, 1 2 3 in 3 hops it reaches to K 8 why because see here it refers to
the last entry why because in a logarithmic manner this particular lookup table is a stored
why because the entry is too far. So, the last entry basically will carry to N 99 furthest
node and that node in turn will basically figure out N 5 which contains this information
where this ns lookup table or ns finger table it knows this information.
549
Managing churn that is node joins and node leaves that is called churning. So, the node,
the code for managing dynamic node joins departure and failure is given in the next
algorithm and that is called managing churn in the chord protocol. Node joins to create a
new ring a node i executes create new ring which creates a ring with a singleton node to
join the ring that contains some node j node i invokes join ring.
So, this particular joining of a ring is nothing, but just like or inserting a new node in a
linked list that kind of scenario will arise over here. So, node j locates i successor. For
example, if I wants to join or a new joiner if i is the new joiner, it will locate its successor
and let us say that it is j. So, j is successor. So, having located I successor on I the logical
ring I will inform of its successor. So, successor will change who is its predecessor from
let us say that it is earlier it was K from predecessor value of j which was earlier K has to
be changed to i and as far as the K is concerned it will contain the successor of K is equal
to let us say j that also has to be changed to i and that way i is inserted, now that is not
all.
Once I gets into the ring then now it has to create its own finger table and all other finger
table need to be updated and that is all achieved by the procedure called stabilize fixed
fingers and check predecessors that are periodically invoked by each node. So, all that
things whatever I have explained in the previous slide is shown in this particular figure.
So, here when I execute a join ring here, it tries to enter into the a position into a ring in a
550
form of an id and, I will identify its successor let us say the successor is j. So, the
predecessor which is earlier K has to be removed you just see that it is changed to i.
Similarly, the i is successors successor which is basically earlier positioning as particular

j now has to be changed in this particular manner as particular i and that is called
predecessor, predecessor of i has to be changed. So, this is a node insertion quite simple.
So, once all the successor variables and a finger table have been stabilized a call by any
node to locate successor will reflect the new joiner i, until then a call to locate successor
may result in a locate successor call performing the conservative scan.
551
The loop in the closest preceding node that is scans the finger table will result in a search
traversal using a smaller number of hops rather than truly logarithmic hops resulting in
some inefficiency.
So, still the node i will be located through via more hops. So, it says that till the finger
table is being built it will perform the simple lookup, hence it is not the logarithmic
number of hops it will be basically linear number of hops that is of the order n, but after
sometimes it stabilizes and populates its finger table then basically it gets the benefits.
Now, managing churn the node failures and departures now when a node fails then check
predecessor which periodically checks will i be identified that that the node is not
working or a fail and hence basically these entries are to be modified. So, node i gets a
churn to update the predecessor field when another node K causes i to execute notify K
because, but that can happen only if Ks successor variable is i. This requires the
predecessor of the failed node to recognize that it successor has failed and get a new
functioning successor. So, where a node fails using these check predecessor and notify
will take care that another node is in place; that means, successor and predecessor values
are adjusted. So, there the new predecessor a new successor comes into the effect.
Note that from algorithm 3 that knowing that a successor is functional and the nodes
pointed to by the finger pointers are functional is essential. This is all is mentioned in this
particular code now we are talking about the complexity for chord network with n nodes
552
each node is responsible for at most (1 + ϵ)K/n keyswith high probability where K is the
total number of keys. Using consistent hashing epsilon can be shown to be bounded by
an order log n the search for successor locate successor in chord with n requires the time
complexity of the O(logn) with high probability.
The size of the finger table is log(n) is bounded by m and the average lookup time is 1/2
log(n).
553
So, comparison of structure peer-to-peer network, so we have discussed the chord
protocol similarly the can GISP, Kademila, Pastry, Tapestry, Viceroy is the another form
which have different structure to basically design the overlays and that is why these
different routing path lengths are being obtained.
Comparison of structure peer-to-peer overlay if we see that the routing path is a greedy
algorithm and nodes joining and nodes leaving all these are being supported.
So, conclusion peer-to-peer networks allow equal participation and resource sharing
among the users. This lecture first gives the basic fundamentals of underlying principles
of peer-to-peer network that is the structured peer-to-peer network and unstructured peer-
to-peer network. Then we have discussed the concept of distributed hash table using the
concept of consistent hashing and distributed hash table based classical structured peer-
to-peer network that is in the form of logical ring structure we have seen in the chord
protocol.
Thank you.
554
Distributed Systems
Prof. Rajiv Misra
Lecture - 20
Google File System (GFS)
Google File System; GFS.
555
Introduction to a file system; a file system determines how the data is stored and
retrieved. Distributed file systems manage the storage across the network of machines;
added complexity is due to the network.
GFS, HDFS are distributed file systems. Before we start the topic of this particular
lecture, let me introduce you to the motivation and the background for Google file
system. So, Google is storing it is data over a large number of computers, which are
having the storage within it. So, it is exploiting the storage of large number of computers
which is also called as commodity hardware. This particular large number of computers
which are connected through the network, are now used to store the information which
the Google requires for its application.
Now, once having stored this information on a large number of computers; maybe
thousands or more than a large number of clients required to access these particular data
which is stored in the distributed environment simultaneously without any problem for
this Google has devised a file system which is called as a Google file system and after
that HDFS file system, hadoop has also made the same kind of file system with different
terminologies within it.
So, let us understand this Google file system. So, in the Google file system description
that is why we are going to make why, because the distributed file system existed earlier
also, but as far as modern day applications are concerned, which involves the large data
sets or large amount of information which is to be stored which cannot be stored at one
place, which is stored in a distributed systems. If it is stored; then how many clients can
simultaneously access it, so for that we are going to see a thing which is called a Google
file system or a file systems which will facilitate not only to store the file over a large
with the help of a large number of computers and also it will support how the large
number of clients can access to that stored information in a distributed system.
556
So, the introduction says like this Google processes it is data in a computing cluster.
Google makes their own servers use commodity class CPUs running customized versions
of Linux machines. We need to say that the cluster comprises of several servers is called
a cluster maybe they are networked with each other. Now each of these servers they are
running the Linux operating system within it.
So, Google will make use of this particular scenario, to implement the Google file
system. Now this will seek to maximize performance and how this is specifically
measured is not public information, then novel switching, power supply all these are
generic problems of infrastructure that is going to solve if the commodity hardware’s are
used instead of being machines like supercomputers, that is; the philosophy of a Google.
So, GFS utilizing these commodity hardware’s and it does not require a very big
infrastructure to support it.
So, let us go for the introduction of a GFS.
557
GFS is scalable distributed file system for large data intensive applications shares many
of the same goals as previous file systems such as performance, scalability, reliability
and availability. They were there in earlier file system also here also they are also will be
there, but in addition to that it is going to basically support the large data intensive
applications.
So, the design of GFS is driven by four key observations, that is; if the commodity
hardware’s are there, then there is a possibility that the failures will be a norm, it is not
an exception. Now another important thing is the file size is a big huge. So, it is huge
files which are basically are going to be supported in your modern applications. Third
one is the mutation of the files and the benefits of co-designing the applications and the
API is to access all that things are basically these are the main point to be supported in a
Google file system.
558
Google file system assumptions the hardware failures are a common that it has become a
non why? Because, if the commodity machine they are inexpensive commodity
machines they are prone to failures and they may fail also. Similarly, as far as the files
are concerned, now here modest numbers of huge files are there and see these files of a
bigger size are now very common in this kind of environment. So, hence they are to be
supported optimally whereas, small files are not that very common. So, basically it will
not focus on optimizing about topped small files.
The workloads are two kinds mostly they are of reads and writes. So, writes means the
files are written that is a big data set is generated once in the form of writes and after that
most of the time it will be read only operations. So, two kinds of operations are
supported. So, large streaming reads of 1 Megabyte or more and small random reads, so
random reads small random reads will not be more focused. So, only the; it is large
streaming reads or a sequential read operations will be supported here.
So, sequential append to a file by 100 of data producers will become an important issue
as far as handling workloads are concerned. High sustained throughput is more important
than the latency. So, the response time for individual read and writes operation is not
very critical here in this kind of scenario.
559
Google file system design overview. So, in Google file system these are the following
components which will be participating and realizing this file system operation. The first
one is called a master, there will be a single master and which will handle the centralized
management. The second entity here is called file which are stored as the chunks. So,
chunks are the another important entity in Google file system.
Chunk is nothing but a fixed size of 64 MB is called a chunk and the file is nothing, but
file is stored in the form of a chunks. So, if a file is bigger than 64 MB, then file may
contain more than one chunks and the entire data of the file will be stored. Reliability
through replication these chunks are stored on the commodity hardware which are prone
to the failures. So, the only way to overcome from these failure fault tolerant is basically
go for the replication. So, by default the replication number is 3; that means, a particular
chunk is replicated in three different chunk servers.
Now, the next entity is called data caching due to the large size of data sets data is not
cached or is not required to be cached at the client as well as at the chunk servers;
however, the chunk servers are running the Linux file system locally. So, Linux local
caching is good enough to be supported here no external or no extra management of
caching is supported here in Google file system. As far as the interface is concerned; it is
suitable to the Google applications or Google apps it will create delete, open, close, read,
560
write is snapshot record appends. These are the different interfaces which will support
these operations that we will see in this part of the discussion.
Now, Google file system architecture. Let us go through this architecture there will be a
single master which is shown over here.
There will be single master and there will be a several clients as far as and the data will
be stored on several servers which are called as chunk servers. Now the data is not
flowing across client and master that you can see, rather the data is flowing from client to
directly to the chunk servers which store these particular data. So, data flow does not
flow across the GFS master.
561
Now, let us go and see what the master does in Google file system.
So, the role of the master; master basically is nothing, but I Metadata server; that means,
it maintains all file system Metadata; that means, information about them the data where
it is stored or information about the file system is called metadata and that is being
maintained at the master.
So, basically master will deal with handling four of the file name in space, handling file
to chunk, mapping and it will also handle chunked location information, keep all the
562
metadata in the master; masters first memory. Let me explain you through this particular
example. So, the filename is space; that means, once the client gives a file name and file
name and the directory will form the file name space filename in space is maintained by
the Google file system master so and second information which is maintained by the
master is about the chunk servers.
So, 2 important information which is maintained by the master, the first one is basically
called namespace, and the second one is about the chunk servers. This information’s the
metadata is stored at the master. So, file to chunk mapping will be handled using these
two information by the master chunk location information will be given. So, the file
contains the file is stored in the chunks. So, those information will be provided through
the chunk servers by the master that information also chunk placement information is
also with the master. Now it will keep all the metadata in the master’s pasty space.
Now, location of the chunk center replicas master does not keep the persistent record and
operation logs for persistent and memory is maintained, pull it from the chunk servers
master it as the monitor no cost meaning to say that the master does not maintain the
information about the chunks or the location of the chunks. Location of the chunks are
maintained by the chunk servers; however, the master continuously pulls from the chunk
servers and fetch this information whenever is required. So, it is not going to store at the
master level.
So, master does not keep this persistent information of the chunks and the replicas. So,
operations for the persistent and the recovery for that log is maintained. So, all the
informations or the metadata which is stored in the master a log will, basically be
maintained for it is persistence so; that means, whenever there is a failure at any end at
the end. So, master will use this log information or log will be used to restore or recover
it.
563
So, master has the metadata, server it will monitor. So, it will monitor as I told you that
master does not keep the information about the chunks. So, the master and the chunks
servers, so master used use this particular heartbeat messages to detect the state of these
chunk servers and communicate with these chunk servers periodically. Now, whenever
master requires the information of the chunks which are managed by the or which are is
closed by the chunk servers, using these heartbeats and piggyback this information to
fetch the details of the chunks.
And so, that is why this will communicate; master will communicate with the chunk
servers periodically and also not only to get the health of the chunk servers, but also
fetch the information of chunks norm it. Now, master is basically having a centralized
controller it will also monitor the system-wide activities, it will manage the chunks
replicas in the whole system.
So, there is a single master which will simplify the design of the system. It will control
the chunk placement using the global knowledge, because the master is having the
complete knowledge of the chunk servers and the placement of the chunk can be decided
globally.
564
So, the bottlenecks are to be resolved, in the sense the client do no read write data
through the master; that means, client need not have to do the read and write for the data
operations through the master. So, master is free. So, the clients can directly access the
data from the chunk servers.
Now, the clients cache the metadata only. So, clients typically ask for the multiple
chunks in the same request and the master can also include the information for the
chunks immediately following those requests. So, these particular provisions will free the
master to be referred for read and write through operations from by the client. So, client
is not required to access the master for each and every single operation for data related
operations can directly be handled between client and the chunk service master is not
basically involved in it.
565
Now, caching metadata that is; cast on the client after being written from the master;
only kept for this; for a specific period of time to prevent by stale data. So, the caching
for metadata is only done not the exact data. So, file caching clients never catch the file
data as I explained to you earlier. So, chunk servers never catch the file data, because
Linux buffer does that caching for the local file. So, that will be sufficient in Google file
system.
So, file working sets are too large and that is; why it is not required also to be cache why,
because a big size files which are streamed as far as the operations are concerned are not
required to be cached so; obviously, when caching is not there, then it will allow; it will
simplify all the problems or issues of having a cache with the problems like cache
coherence.
566
Now, GFS chunk;
So chunks is the most important part of as far as the design is concerned. So, the chunk
size is 64 MB. So, chunk is stored as the plain Linux file on a chunks server. So,
advantage of keeping large chunk size is that, it will reduce the client and master
interaction, as I told you earlier the previous slide. So, one request per chunk suits the
target workload. Client can cache all the chunked location for multi-terabyte working set.
567
So, only the metadata is cached for the chunks so; obviously, if this particular chunk size
will reduce the size of the metadata also which has to be kept in on the master. So,
disadvantage of keeping a large sized chunk is that chunk servers can become the hottest
part for some popular files. So, this particular problem can be can be handled in future
also with some point of research that the clients who are already accessed those
particular data can become the servers for the other clients and hence the hotspots can be
removed, but this is for the future scope of improvement.
So, this particular hotspot problem is very practical and it requires some innovative
solutions to be overcome.
So, chunk locations master does not keep persistent record of the chunk replica locations.
As I mentioned in the previous slide master polls chunkservers about their chunks at the
startup. Master keeps up to date through periodic heartbeat messages. Master,
chunkservers easily kept in sync when the chunk servers leave, join, failed, restart.
So, once if the master is not keeping the information about the chunk replicas locations,
then it will allow the master and the chunk server to be kept in sync at all the point of
time easily. By not keeping the chunk replicas locations at the master, now the master
and the chunk server can easily be in the sync with the help of the heartbeat messages.
568
So, now when the chunks are out of the master, then basically these chunk servers can
leave join fail restart at every regular interval and can be handled separately out of the
master. So, chunk servers have the final words over what chunk it has.
Operation log persistent record of critical metadata changes is stored in the log. Critical
to the recovery of the data, changes to the metadata are only made visible to the clients
after they have written to the operation logs operation log, replicates on a multiple
remote machines. So, before responding to the client operation, log must be must have
been flashed locally and remotely.
So, master recovers it file system from the checkpoint and the operation. So, all these
aspects how the check pointing and the recovery we have already covered so that
concept is used. So, that the master becomes the fault tolerant. So, the master is
becoming inactive or a faulty then it will be recovered with the help of a checkpointing
which is maintained in the stable storage or with the help of a mirroring of the master
node.
569
Now, consistency model atomicity and correctness of refining space are ensured by the
by the namespace locking. Now, after successful data mutation writes or read appends,
changes are applied to the chunk in the same order on all the replicas. In case of a chunk
server failure at the time of mutation, it is the garbage collected at the soonest
opportunity. So, regular handshakes between master and chunk servers helps in
identifying the fields chunk servers and detect the data corruption by checksumming. So,
this simplifies more complicated consistency model this will good enough to solve the
consistency issues system interactions leases and mutation order.
570
So, master grants the chunk lease to one of the replicas and this particular replica is
called a primary replica. So, all the replicas follow the serial order picked by the by the
primary. So, leases times out at 60 seconds and leases are revocable.
Let us see the interactions. So, the first step is the client asks the master which chunk
server holds the current lease of the chunk and the locations of the other replicas this is
the first level interaction. This interaction happens when a client wants to access a
particular file with the file name and a data at a particular location or a index, knowing
the size of the chunk, it can this particular index is nothing, but an offset in a file which
particular inside a file that data which is required by the application having calculated the
index and the file name this particular message will be sent to the master.
571
So, master asked the, so the client asked the master about which chunk servers holds that
current lease of the chunk and the location of other replicas. Master will respond by
giving the identity of the primary and the location of the secondary replicas. So, this
information will be used by the client and the data will be pushed this is the data red line
shows the data will be pushed directly in a linear fashion to the primary and secondary
replica servers.
So, this particular flow of the data will use the simple network bandwidth it will not
follow any topology. So, it basically it will follow a linear fashion the data will be
pushed here to all the replicas, because now the client knows the primary as well as
secondary replicas and it will form a chain and the data will be post.
Fourth step says that once all the replicas have acknowledged the receipt of the data, the
client sends the right request to the primary; now here it says to right. So, data which is
pushed to all the replicas servers that is primary and secondary replicas now the client
will issue right command to the primary replica.
So, the primary is now having the leads of all the for maintaining all the replicas. So, the
primary assigns a consecutive serial number to all the mutations it receives and it will
provide the serialization and it applies the mutations in that serial order. So, the data in
the form of the chunks will get assigned the serial order or serial number from the
primary from the primary replica and these serial numbers are given or assigned and
572
these secondary replicas will use that serial number to order all the notations. Now the
primary forwards the right requires to all the secondary replica they apply the mutations
in the same serial order which I have told you.
The step number six says that that all the secondary replicas, they will reply to the
primary replica that they have completed the operation. Now having completed all the
operation the primary replica the primary replies to the client with the success or if it is
not success then some error message it will provide to the client.
So, that was the system interactions. So, system interactions has two parts one is called
data flow; that means, data will be pushed through the network in a linear fashion among
all the three servers that is primary and to secondaries. So, the data is pipelined that is
called pipelining and data is pipeline over TCP connections a chain of chunk servers
form the pipeline. So, each machine forwards the data to the closest machine that is
called data flow.
Now, second level interaction is the atomic record appends GFS provides an atomic
append operation called record append and in the snapshot means that it can make a copy
of a file or a directory tree almost instantaneously. Let us see the read algorithm.
573
So, the application originates the read request, here to the GFS client indicating the file
name and the byte range. So, out of this particular byte range and knowing the chunk
size this client will compute the chunk index and copying the file name as it is. So, inside
file the client will compute the chunk index and this particular pair the file name and
exchange index will be given to the master in the master intern will provide a metadata
for the chunk that is called chunk handle.
And also the replicas, where that chunk is stored. This information will be passed on
from master to the client.
574
Now the client, then directly send the messages that is the chunks handle and the byte to
range direct to the chunk servers and these chunk servers will fetch the data from the file
and will be given to the client and then the data will be passed on to that application that
becomes the read algorithm.
Let me read out all the steps.
So, the application originates the read requests. GFS client translates the request from
file name, byte range to the file name and chunk index, and send it to the master that I
575
have explained. The master responds with the chunks handle and the replica locations.
Now the client picks the location and sends the chunk handle and a byte range with the
request to read to that particular location. Chunkservers send the requested data to the
client and the client forwards the data back to the application.
So, that becomes the read algorithm about write algorithm. I have earlier explained
during the system interactions, but let us go ahead again. So, the application originates
the write request. GFS client translate the request from file, data to the file name, and
data to the chunk index, and send it to the master that we have already seen in the read
algorithm.
Now, master responds with a chunk handle and the replica location that also we have
seen in the read algorithm. Now, the client pushes the write data to all the locations that I
have already explained in the previous slide of the system interactions. The data is stored
in the chunkservers in its internal buffers. When the client sends the write command to
the primary memory, then the data will be looted then the data will be written.
576
So, the primary determines the serial order for the data instance is stored in its buffer and
writes the instances in that order to the chunk. Primary sends the serial order to the
secondaries and tell them to perform the write operation.
Secondaries perform the right operations to the primary request. So, primary response
back to the client. Now, if these write fails at one of these chunks servers, then the client
is informed and this write operation will be retried again to complete the write
operations.
577
So, let us see this particular interaction.
So, the data will be pushed, data will be pipelined to the primary and both the secondary
servers.
The primary then issue the sequence numbers like this and these sequence numbers will
be now maintained as far as the sequence number order is maintained for storing that
particular data for the write operations. This particular serial order is maintained in the
chunk and they are being communicated to the; from primary to the secondary storage.
578
So, having written that sequence order messages in into the chunks. The primary the
secondary is respond to the primary about the completion of the operation and there upon
the primary responds to the client about the success.
Now, record append operation. As I told you that the file is created once and then it is not
updated randomly, but the append operation is supported. So, the application; so let us
see about the append operation which is supported here. So, we have seen the read we
have seen the right now we have seen the append operation algorithm. Application
579
originates the record a parent request. GFS client translates the question send it to the
master. Master response with the end client pushes the right. So, primary checks if the
required fits in the specified chunk. If the record does not fits, in the specified chunk,
then primary will pad the chunk tells the secondary to do the same, and inform the client.
Client then he tries the append with the next chunk and if the require fits.
Then the primary appends the record, tells the secondaries to do, receives the response
from the secondaries, and send the final response to the client that is data to all the
locations. So, it is same as almost same as the write operation, but it will be written at the
end that is called the append operation.
580
Master operation namespace management and locking locks are used over the namespace
to ensure proper serialization. Read and write locks are used. GFS simply uses the
directory like file names. GFS logically represents this namespaces a lookup table
mapping full path name to the metadata.
So, if a master operates operation involves a file name, a path name, read locks are
acquired on all these path names and either the read or a write lock on a full path name is
being applied.
So, let me mention that the file name and its index is stored not in a form of a directory,
but in a form of the lookup table which is being hashed. So, using this particular change
the namespace lookup is very very efficient here in Google file system.
581
So, chunk replica placement creation in each of initially empty chunks. Use the under-
utilized chunk servers; which are spread across the racks and re application that is started
once available the replicas fall below the setting. So, re application is performed. So,
that. So, that all the empty servers are properly utilized. Rebalancing the master
rebalance is the replica periodically examines the distribution and moves the replica for a
better disk space and load balancing approach.
582
Garbage collection deletion logged by the master. So, the file rename to the hidden file,
deletion timestamp kept. So, all these particular provisions are there for garbage
collection why, because if a particular chunk is failed, then basically another chunk is
already in place. So, those failed chunk will become garbage. So, these garbage will be
periodically scanned and checked periodic scan of masters chunk namespace is also
done.
So, it Stale replica detection using this scanning is all already done.
583
Fault tolerance so this is also very important aspect in Google file system. So, fault
tolerance is achieved with a fast recovery, chunk replication and master mechanisms.
Fast recovery master and chunk servers are designed to restart restore in a few seconds.
Chunk replication across multiple machines, across multiple racks is being made. So,
master mechanisms keeps log of all changes made to the metadata, periodic checkpoints
of the locks are maintained log and checkpoints are replicated on the multiple machines.
So, whenever a master state is replicated on a multiple machines shadow master for
reading data if the master is particular down.
Now, for further reading on this particular topic you can refer to them to the paper by
Sanjay Ghemawat; the Google file system which is available at the Google link.
584
Conclusion: GFS is a distributed file system that supports large-scale data processing
workloads on commodity hardware. GFS has different points in the design space:
component failures are a norm and optimize for a huge files or a large data sets operation
or a computation. GFS provides fault tolerance by replicating the data, fast and automatic
recovery, and doing the chunk replication. So, GFS has made a very simple, centralized
master that does not become a bottleneck in this particular problem.
Thank you.
585
Distributed Systems
Prof. Rajiv Misra
Lecture - 21
Map Reduce
Map reduce. Introduction: Map reduce is a programming model, and an associated

implementation for processing and generating large datasets. Users is specify a map
function that processes the key value pair to generate a set of intermediate key value
pairs, and a reduced function that merges all intermediate values associated with the
same intermediate key. Many real world tasks are expressible in this model.
586
Programs written in this functional style are automatically parallelized and executed on a
large cluster of commodity machines. The runtime system takes care of the details of
partitioning of the input data; scheduling the programs execution across the set of
machines, handling machines failures, and managing the required inter machine
communication. This allows the programmers without any experience with parallel and
distributed system to easily utilize the resources of a large distributed systems.
587
A typical map reduce computation processes may have the data of terabytes of data on
thousands of machines, hundreds of map reduce programs have been implemented, and
upward of one thousand map reduce jobs are executed on a Google’s cluster every day,
meaning to say that a very large data set computation, which cannot be handled using the
classical algorithms or the existing machines, how are they going to be computed, and
how the Google is now doing it, that we are going to see in this lecture.
So, the Google has devised a programming paradigm, which exploits the cluster
machines. So, using exploitation of the cluster machines, the distribution of large data set
into small chunks on different nodes, which are distributed across the clusters on
different nodes, will be used to compute. So, how the computation, how the programmer
will write down those programs. So, Google has given this particular paradigm which is
called a map reduce paradigm, or a programming paradigm which will allow the
programmers to comfortably write the programs, for their application, without bothering
the intricacies of underlying distributed system, and the distribution of program, or
distribution of data programs, and also how the communication is going to take place.
All these inter dependencies or intricacies are purely hidden, and an abstraction is
available; that is in the form of a map reduce.
So, in this particular lecture we are going to introduce you this particular notion and we
are going to give you several examples based on that. So, let us see the model or the
588
architecture, which is going to be used for map reduce programming. So, this is a single
node architecture, where it consists or it is having one CPU, memory and a disk.
This particular node, collection of such nodes we are going to form a cluster, and that
will be commodity cluster if the normal machines like laptops and the desktops are used,
then it is called commodity cluster.
So, web data sets can be very large tens to the thousands of terabytes. So, standard
architectures which are emerging to accommodate the large data sets are like cluster of
commodity nodes, gigabit Ethernet interconnect will be used in it, how to organize the
computation in this architecture.
589
So, this is an example of a cluster architecture. So, earlier we have seen one such node.
So, several such nodes, which are connected through a switch, and further another switch
will connect, the complete set of different nodes. So, I no doubt thousand, a cluster of a
thousand nodes is also available for this particular computation. So, these switches are a
very fast switch, and infiniband switch is a gigabit switch, or gigabit internet is used for a
very high speed interconnect across these nodes.
So, this is an architecture for the cluster, which consists of 64 nodes shown here in this
particular figure. So, given this particular architecture how a program, can be written
which will exploit this kind of architecture. So, a large data set can be stored, and can be
computed, which is now not possible to be utilized with the single node. So, this is the
current state of the art, which the Google and other big companies are now using it. So,
we are going to see this new technology.
590
So, to utilize this cluster architecture, we require something which is called a distributed
file system, which we have already covered in the previous lectures. So, distributed file
system will provide a single abstraction of a system to the programmer, and that will be
in the form of Google file system and Hadoop HDFS file system, and Kosmix KFS,
which will give the global file name a space. So, typical usage patterns is, it is able to
accommodate huge files that is hundreds of terabytes, and data is rarely updated in place,
and reads and appends are very common operations, when we are dealing with the large
data set.
These commodity nodes, they also are failing. So; obviously, the norm failure is
becoming a norm in such kind of commodity clusters. So, how can we store these
particular stores and compute these data persistently. So, we assume there is a distributed
file system; like Google file system or HDFS file system is in place which will ensure
the stable storage.
591
That is the faults are fault tolerance systems are already in place within that particular
system, and the failures and faults will be handled accordingly.
Distributed file system comprises of file chunk servers, master nodes and a client library
for file access. Chunk servers is nothing, but where the files is split into a contiguous
chunks and they are being stored. So, each chunk is of size 64 MB. So, each chunk is
replicated three times, and try to keep these replicas on a different racks of the cluster.
So, the rack in a cluster consists of, or a stores more than one nodes, and this is being
interconnected with the fast switch. So, this is called a rack, which is located at in one
unit.
592
So, motivation for map reduce, is to support the large scale data processing, which
otherwise is not possible on a single node, or on a supercomputer, because they have the
finite space, and finite computing power. Now if we are considering a cluster, a big
cluster of thousand nodes, which has the storage and processing capabilities now how
can this particular cluster of a very big cluster can be utilized, to store the data and to
compute the data, and what will be the programming model for the programmers without
dealing with the intricacies. So, we are going to see that environment, programming
environment which is called a map reduce, which will be used for a large scale data
processing. So, here the programs are quite simple, but the data is very large. So, that
becomes a challenge, how a very large program can be computed.
The map reduce architecture will provide automatic parallelization, and the distribution
of the large data sets. It also ensures the fault tolerance why, because the nodes can fail at
any point of time, and I/O scheduling is also required and monitoring, and status updates,
how that is all done in that particular architecture.
593
So, programming model, the computation takes a set of input key value pairs, and
produces a set of output key value pairs. So, the user of a map reduce library expresses
the computation as two functions, which is called a map and the other function is called a
reduce.
So, map function is written by the user takes the input pair and produces the intermediate
key value pairs. The map reduce library grouped together, all the intermediate values
associated with the same intermediate keys, and passes them to the next page; that is the
594
reducer phase. Meaning to say that the input is in the form of a key value pair, where the
map function will be applied on this particular data set. This is going to generate an
intermediate key value pair, which is grouped together with the group together. So, that
all the intermediate values associated with the same intermediate key I, is now being
passed same key, and there is a list of values. These are passed to the reducer and that
will give an output again in the form of the key value pair.
So, the key value; the key value, using the key value pair the map reduce functions for
word count, is given in this particular manner. We can see there is a function which is
called a map, which will take key value pair, the map will read each word, which is given
as a part of the value, and for each word, it emits one similarly as far as the reducer is
another program, which programmer has write down, which is based on the output or the
output of the mapper is called the intermediate results, and these intermediate values are
taken by the reducer to produce the final output; that is in the form of a key value pair.
So, we will explain this particular example in this lecture.
595
So, map reduce functions as I told you that input will be in the form of a set of key value
pairs. So, the user has to supply two different functions map, and reduce based on the
application, which they have in the mind. So, the map are based on the input set of key
value pairs, it will form a list of key value pairs, and it will group according to a
particular key, all the values and that will be given to the reducer. So, reducer will take
the key and all group values as the input for the reducer, and it will generate an output
which will be the key value pair, again the values based on a particular key.
So, output again will be that key k 1, and its value k 2 will be the output. So, key value
pair will be the output. So, this is a simple explanation that map reduce function provides
this simple constructs of map, and reduce. Now programmer has to fill in, what is the
map, and what is the reducer reduce functions as per the application is concerned.
596
Let us see different applications, where this map reduce paradigm or a model of
programming is going to be used. There are a few simple applications of interesting
programs that can easily be expressed as map reduce computations, distributed grep
count of URL access frequency reverse web link graph term vector per host and so on,
inverted index distributed sort.
597
Let us see the implementation. So, many different implementations of the map reduce
interface are provided are possible.
The right choice depends on the environment. For example, one implementation may be
suitable for a small shared memory machine, and another for a large NUMA
multiprocessor, and yet another machine for even bigger collection of networked
machines. Here we describe an implementation targeted to the computing environment,
which is in wide use at Google.
598
So, Google, large clusters of commodity PCs connected together through a fast Ethernet
switch. So, this is the description of the cluster machine that we have already explained.
Let us consider the distributed environment or distributed execution, which basically

takes place on invocation of map and reduce functions. So, the map invocations are
distributed across multiple machines by automatically partitioning the input data into the
set of M splits. The input splits can be processed in parallel by different machines; that is
all taken care, when the map is invoke reduced function is invoked; that is called reduce
599
invocations here, that the distributed these reducers, the reducing invocations are
distributed by partitioning the intermediate keyway space into R pieces using the
partitioning function, which is nothing, but a hash function on the mod R.
So, number of partitions are n the number and the partitioning functions are specified by
the user figure shows, the overall flow of the map reduce operations.
So, this is going to be a distributed execution. Now the user program is in the form of
different threads. So, three different processes here it is shown will be created, and the
first one is a special process, which is called a master, and all other processors, all other
processes will be assigned, either the map function or the reduced function. So, the map
function will on a particular worker, the map function will take the input in the form of a
splits input is a big file, which is splitted into some multiple of chunks, which can be
stored on of different nodes. So, these splits of all data set, and which is given as an input
in the form of input file is split.
So, here it is shown as three splits. So, three is splits, means three workers will be
reading it in parallel, and applying the map function on these splits, then afterwards the
next step is that the output of the map function will be written locally, or that is being
buffered locally, then as far as these intermediate key value pairs are concerned. So,
these values which are read from the local buffer, and then basically they are grouped
together on the same key. So, for a particular key, all the values which are being grouped,
600
they are being given now to the server. So, serve to the worker which basically will be
running the reduce functions.
So, the reducer which is running on a worker, will remote read to four performing read, it
will do a sort also. So; that means, for a particular key all the list of values; that means,
according to the key, it will be sorted, and all the values for a particular key. These
particular values will be given in the form of a list to the reducer. So, reducer will
perform the reduced functions on that particular intermediate key, and its corresponding
list of values and the same program will happen on another worker. So, it depends upon
how many workers are there, this particular data will be now targeted on this particular
worker.
So, it depends upon the hash function on a particular key, and this hash will be divided
into R different workers, this is R 1 R 2. So, it will be something mod R. So, the hash;
that means, these intermediate values which are coming, they are to be hashed on a
worker of size R with using mod R. So, that is we have explained here in the reducer
invocation by partitioning the input intermediate key space into R pieces, using
partitioning function hash key mod R. So, R different partitions of the intermediate
values will now take place, and each partition is given to the worker. So, how many
workers are assigned with a reducer program is fixed by R. So, many; that means, the
intermediate key value pair, is partitioned further into Rs plates, and they are being
assigned to the reducer in this particular form.
So, the partitioning is done at two levels; one is at the input data, the other partitioning is
done for the intermediate values. So, after applying the reducer on each different
partition, it will produce an output file in the form of the key value pairs.
601
So, this sequence of action whatever I have explained through the figure, is again going
to be explained. So, when a user program also the map reduce function, the following
sequence of action occurs; first actions is that the map reduce library in the user program
first is splits, the input file into M pieces, of typically 64 MB. If you recall 64 MB is the
size of the chunk. Then it, it then starts up many copies of the program on a cluster
machine. So, the first step is over, once the input is split.
Now, one of the copies of the program is a special that is called a master and the rest of
the workers are assigned by the master. There are M map tasks and are reduce tasks to
assign. So, the map picks the ideal workers, and assign one map task and one reduce
task. So, that is what I have explained that, user program will create the threads. One
thread will be master, and all other threads will be a sign the worker job, and these
workers will be assigned, either to do a map task, or basically to do the reduced task. So,
the worker who is assigned the map task reads the contents of the corresponding input
split, it parses key value pairs, out of the input data and passes each pair of user defined
map function.
602
So, the map function will generate the intermediate key value pairs, and that is buffered
in the memory. Periodically the buffered pairs are written to the local disk, then they are
partitioned into R regions by a partitioning function that I have explained, which is
nothing, but an hashing with a mod R. So, the locations of these buffered pairs on the
local disk are passed back to the master, who is responsible for forwarding these
locations to the reduce workers.
So, the information about partitioning of intermediate values is basically given to the
master. And when reduced worker is notified by the master about these locations, it uses
remote procedure call to read the buffer data from the local disk of the map workers,
when a reduced worker has read all the intermediate data is sort sheet by the intermediate
key. So, that all the occurrences of the same key are grouped together, that I have also
explained before giving the intermediate split to the reducer.
Sorting is needed, because typical many different keys map to the same reduce task. If
the amount of intermediate data is too large to fit in the memory and external sorting is
used
603
Sixth’s steps says that the reduce workers iterates over the sorted intermediate data, and
for each unique intermediate key and counter, it passes the key and the corresponding set
of intermediate values to the users the reduced function the output of the reduced
function is appended to the output file, for this reduce partition, when all the map tasks
and reduce have been completed the master wakes up the user program. At this point the
map reduce call in the user program returned back to the user code.
604
So, after successful completion the output of the map reduce execution is available in R
output files; one per reduced task with the file names is specified by the typically users
do not need to combine these R output file into one file. They often pass these files as
input to another map reduce call, or use them from another distributed application; that is
able to deal with the input; that is partition into multiple files.
So, here let us go into more details of the master data structure. The master keeps several
data structures for each map task and reduce tasks it is stores, the state that is idle in
progress or completed, and the identity of the worker machine for non ideal tasks. So,
you know that master has the complete global view of all the workers and. So, the states
of all the processes are maintained, and also the identity of the worker machines, because
they are going to be used in this particular computation.
So, the master is the conduit through, which the location of the intermediate regions is
propagated from map tasks to the reduce task. Therefore, each completed map tasks the
master stores the location and sizes of R intermediate file regions produced by the map
tasks that we have already explained. Updates to this location in size information are
received as the map tasks are completed. The information is pushed incrementally to the
workers, that have been progress reduce tasks.
605
Now, another aspect is called fault tolerance. Since map reduce library, is designed to
help process, very large amount of data using hundreds of thousands of machines. The
library must tolerate machine failures gracefully. When a map worker fails, so the map
task completed or in progress at the worker are reset to the idle, reduced worker are
notified when the task is rescheduled on another worker. Reduce worker failure only in
progress tasks are reset to the idle, master failure, so map reduce task is aborted and the
client is notified.
606
So, if the master is failed, then the entire setup is aborted, and the client is notified.
Locality network bandwidth is relatively scarce resource in the computing environment,
we can conserve the network bandwidth by taking advantage of the fact that input data is
stored on a local disk of the machines that can make up our cluster. GFS divides each file
into 64 MB chunks, and stores several copies of each block; that is 3 copies on different
machines.
So, map reduce master takes the location information of the input into the information
attempt to schedule the map tasks on the machines. Failing that it attempt to schedule
map tasks, near the replica of the task input, when running the large map reduce
operation on a significant, fraction of the workers in a cluster most input, is read locally
and consumes no network bandwidth.
607
So, the next task is. So, task granularity. So, the map phase is subdivided into M pieces
and reduce phase in R pieces. Ideally M and R should be much larger than the number of
workers nodes. So, here there are the bounds, practical bounds on how large M and R
can be, that can be a granularity. So, granularity is mentioned here that the, there is a
master must take order M plus R scheduling decisions, and keeps order M star M
multiplied by R state in the merge in the memory. Further R is often constraint by the
users, because the output of each reduce task ends in a separate output file.
608
So, partition function that I have already discussed, ordering guarantees are also taken
care of. So, and combiner function is also we have discussed.
Let us go ahead with few examples. The first example is called word count.
609
So, if a document is given, and we are going to count how many frequency, or what is
the frequency of a particular word; that is called a word count, and how this is going to
be used using map reduce program. Let us explain through this illustrative example.
So, let us say this file, contains these words see bob run, see spot throw, when it is given
to the map function. So, map as in the function you see that after for each word, each
word is parsed, and then it will do an emit with 1. So, c will be emitted with the value 1,
610
bob will be emitted with the value 1, run is emitted. So, all the words, they are called
key, there will be emitted with the value 1. So, this is the key, and this is a value.
So, this particular output will be generated as the intermediate results. Now, these
intermediate results will be sorted according to the keys. For example, see how many, 2
times see is there. So, it will go to the 1 worker, with the reduced function, this reduce
function on it. Similarly bob will have another worker, and run will also have another
worker, and spot is also another worker, and throw is also another worker. So, just see
that, as far as see is concerned, see will output to why, because it will just add both of
them. So, here you see that, it will sum all once in this list. So, in this list there are 2
times 1 is coming. So, see will be output as 2; whereas, bob is only one occurrence. So,
bob will be output as bob 1, run also will do the same thing, spot also and throw also. So,
just see that the output also is in the form of a key value pair.
So, input was the document, and the output is the key value pair. So, you just see that the
word count has happened. So, for every word, how many the frequency of that particular
word in the document; that is being counted using map reduce program.
Now, another program we are going to see; that is for counting words of different length.
For example, and or. So, and is of length 3, R is of length 2 and so on. So, we have to
count how many words of length 3 are appearing in the document, how many words of
length 2 are appearing in the document. So, for that we are going to write down a map
611
reduce function. So, the map function takes the value and output the key value, and
outputs the key colon value pair.
So, first key will be output and its corresponding value will be given. So, for instance if
you define the map function that takes the string and outputs the length of the word as
the key, and the word itself as the value. So, if we give map to a Steve, this particular
map will return. The length of this particular word as a key, and the value will be the
word itself. So, this particular map function will emit, for a word, in the document it will
emit the length of the word colon, the word itself, and this is basically the output.
So, this will allow us to run the map function against the values in parallel, and it
provides a huge advantage.
Let us see what happens. So, if a file is given with these words. So, the map function will
output in this form. This is basically the length of the key; this is the key value pair. This
is the output of the map function. Now once this is given as an output of the map
function, then they are grouped according to the key values. So, you see that this is in the
sorted order for to group, according to the keys values. We have to do the sorting,
according to the key. So, this particular list is sorted according to the key.
Now, then we are going to group the same keys, and the list of words we are going to
generate like this. So, for the value, for the key 3, here this is the list of words. For the
612
key 4, this is the list of words. For key 5, these two are the list of words, and for key 8,
these are the two lists of words. These lists of words, this is an intermediate data, which
is basically given to the reducer for each key there will be a reduce function, and this will
count that is all, how many elements are there in that particular list. So, for 3 it will say
3, for 4 it will say 3, for 5 it will say 2, for 8 it will output 2; that means, the words of
length 3, there are 3 frequencies 3; that means, 3 times it is appearing the length of word.
4, how many times they are appearing is 3 times and so on.
So, this was the example of a map reduce programs. So, that is what we have explained
that using map and reduce function.
You can solve a bigger problems also. Let us take another problem as an example.
613
Which will find out the friends, using map reduce programming paradigm. So, finding
friends face book has a list of friends. Note that the friends are the bi directional thing on
the Facebook. So, if I am your friend you are mine. So, they also have a lot of disk a
space, and they serve hundreds of millions of requests every day, they have decided to
pre compute the calculation, when they can to reduce the processing time of the request.
So, one common processing request is, the you and Joe have 230 friends in the common
feature.
So, when you visit someone’s profile you see the list of friends that you have in
common. This list does not change frequently, so it would be wasteful to recalculate it
every time you visit the profile. So, we are going to use this map reduce function, so that
we can calculate everyone’s common friend once in a day, and store those results.
614
So, later on it just a quick lookup we can use it, and this will save lot of disk space, and it
is cheap also. So, let us see how map reduce, can be used to find out the common friends.
So, let us assume the friends are stored in this particular form. The person and having a
list of friends, this is given as an input to the program. So, this is the person and then
followed by the list of friends.
Now, let us see what next, how the map and reduce function will do. So, map function
will take this particular person and its friend list and will emit that particular person with
615
a pair one of these friends and so on. So, that way there are three friends, then the map
function will emit for each word in friend list. What it will do? It will emit, let us say
person is A and W. So, this way, and followed by that list to emit. So, it will emit A B B
C D, then it will emit A C then B C D, then it will emit A D B C D. Similarly for person
B with the list A C D E, it will emit A B. So, order is to be changed here. Note that A
comes before B in the key. So, A B then same list, then BC, then same list of persons B
D and so on.
So, for all persons, this particular map will emit this key value pair. Now let us see what
happens next. Now before we send these key value pair to the reducer, we group them by
their key and get these values.
616
So, here you see that this is one group, this is the another group. So, all pairs will be in a
separate group. So, the reducer what it will do? It will take the intersection of these two
lists. So, it will generate A B followed by the intersection.
So, the intersection is C D. Similarly here in AC, it will generate if you take the
intersection B and D will come and so on.
So, each line will be passed as an argument to the reducer, and the reduce function will
simply do the intersection of the list of these values, and hence they will output the
617
common friends between two people. So, when A visits another person’s side; that is B
side, then he can find out that C and D is a common friends between A and B this way.
So, with these three different examples, we are sure that we can solve the data
processing, large data set computing using map and reduce functions, and Google is also
running several such programs daily, every day
Now, further reading on this particular topic, you can refer this particular reference,
which is the map reduce simplified data processing on a large clusters, and that is
available on the Google site.
618
Conclusion. The map reduce programming model has been successfully used at the
Google for many different purposes. The model is easy to use, even for the programmers
without experience with parallel and distributed systems; since it hides the details of
parallelization or tolerance, locality, optimization and load balancing. A variety of
problems are easily expressible as map reduce configuration. For example, map reduce is
used for the generation of data for the Google’s production, web search service for
sorting, for data mining, for machine learning and many other systems.
Thank you.
619
Distributed Systems
Prof. Rajiv Misra
Lecture – 22
Case Studies HDFS
HDFS and spark,
620
The Hadoop distributed file system HDFS, introduction Hadoop provides a distributed
file system and a framework for the analysis and transformation of very large data sets
using map reduce paradigm. An important characteristic of Hadoop is the partitioning of
data and computation across many thousands of hosts, and executing application
computations in parallel close to their data.
A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by

simply adding commodity servers. Hadoop clusters at yahoo spans 25000 servers, and
store 25 petabytes of application data with the largest cluster being 3500 servers 100
other organization worldwide report using Hadoop. So, let me give you a more intuitive
notion of Hadoop file system and Hadoop environment. So, Hadoop is an environment
where HDFS is a part of the Hadoop.
So, Hadoop also contains map reduce and other different application which are going to
be used in this particular framework which is called a Hadoop. So, Hadoop file system is
a particular file system which basically will be using the cluster setup for large scale
computations, these computations will be designed using the programming environment
such as map reduce or a spark.
So, when we say large scale computation; that means, the data set is of large scale data
sets. So, this particular platform is good enough to support the computation of a large
data set size. So, the cluster is basically nothing, but a set of hosts which are connected
through the ne2rk. So, it is scalable in the sense as we grow the number of nodes without
any problem of ne2rk bottlenecks this particular kind of cluster set up is scalable.
In contrast to a particular system becoming more powerful by adding more hardware it

has a bottleneck of the scalability here the scalability is basically ensured in this
particular kind of application. In this particular setup how this particular large scale
computation can be made applicable, how this can be exploited for large scale
computation. So, we are going to discuss the first the Hadoop environment and in
particular the Hadoop distributed file system which will exploit or which will be running
over this cluster setup.
So, HDFS will exploit this kind of; that means, cluster setup and this will be a point of
discussion in this particular lecture.
621
So, Hadoop is an apache project all components are available via the Apache open source
license. Yahoo has developed and contributed to 80 percent of Hadoop which comprises
of HDFS map reduce and other components are also there such as, H base was originally
developed at powerset, now a department at Microsoft such as hive also is developed at
the facebook pig zookeeper and all these particular different variants are basically using
this Hadoop file system and its basically the applications.
622
So, HDFS is a Hadoop file, Hadoop distributed file system is a part of Hadoop project.
So, we are going to discuss this particular component here in this particular lecture.
So, HDFS is a file system, component of Hadoop while the interface to HDFS is
patterned after the UNIX file system faithfulness to the standards were sacrified in
favour of improved performance of for the applications at hand.
HDFS stores file system metadata and application data separately, with that we will see
how this particular data and metadata are 2 separate entities here in HDFS and it
manages them as in the other distributed file system like PVFS, Lustre and HDFS stores
metadata on a dedicated server called the name node. So, there is a server which is called
a name node in HDFS file system. So, name mode stores the metadata.
So, application data are stored on the other servers called the data nodes there will be a
data nodes, many data nodes 1, 2 and so on up to n, but here there is only one name node
all the servers are fully connected and communicate with each other using TCP based
protocol. So, HDFS design assumption single machine tends to fail that is it is prone to
fail, failing due to the different components also failing like hard disk power supply and
so on.
623
(Refer Slide Time: 06:.35)
So, more machines means the increase failure probability will be there and data also does
not fits in a single node. So, basically the so this will is a motivation to have a cluster
which is basically a scalable and also is basically architecture can be a fault tolerant. So,
the architecture of HDFS will involve one name node and many data nodes.
As name node is concern it stores the metadata, metadata is where the file blocks are
stored that is the namespace image and basically also the edit and that is operation log
also is maintained secondary name node that is it also basically keeps track off or
624
maintaining the secondary name node which is also called as a master or a shadow
master.
So; that means, if this particular master fails this particular shadow is basically up to date
available and it will switch to the master in that case that is called secondary name node.
Now, another thing is called data nodes this is also called chunk server stores and
retrieves the file blocks that is the data blocks by the client or a name node, it can be this
particular information about where these blocks are stored can be obtained through the
name node by the clients these data nodes they report to the name node with a list of
blocks that they are restoring.
So, the function of the data nodes which are many in number they are reporting with the
help of a heartbeats with the name with the name node. So, it reports to the name node
these data nodes they report to the name nodes with a list of blocks that they are
restoring. So, whenever a request comes by the client to store the data nodes or through
the name node whenever there is a request comes. So, they will be storing and this
information will be reported back to the name node so that the metadata can be updated.
So, name node in HDFS name node is an hierarchy of files and directory.
Files and directories are represented on a name node by the inodes the file content is split
into a large blocks that is of 128mb and each block of a file is independently replicated at
625
many data nodes the name node maintains the namespace tree and the mapping of the
file blocks to the data nodes.
Let us see this particular concept of name node and if let us say a file let us say one is
having the data blocks as 257 and another file 2 having the data blocks as 4, 6. So, this is
the name node and there will be a data nodes which is storing the blocks of the file now
here there is a replication of each block by default it is 3.
So; that means, if block number 2 is stored. So, out of 3 any 2 out of 4, let us say 4
blocks we are having out of 4 data nodes any 3 of them will be selected to store the first
one that is block number 2, similarly block number 5 will be stored on any 3 of them. So,
if one node crashes or is down or failed. So, a particular data is already available in 2 of
the other nodes. So, that is why the availability in spite of failure is assured, that is what
we have discussed here.
So, HDFS clients wanting to read the file first contacts the name node for the locations
because it maintains the locations of the data blocks comprising the file and then reads
the block contains from the data node closest to the client. So, once the client so the
client contacts to the name node and we know knows the file blocks and then the
addresses of the data nodes. So, then client in turn will contact to the data nodes and can
access the file data that is in the form of data blocks.
So, when writing a data the client requests the name node to nominate the suit of 3 data
nodes that I mentioned to host the block replicas, client then writes the data directly to
the data node in a pipeline fashion.
626
That means it will push the data along 3 data nodes which is given by the name node and
then single write operation will write them.
So, HDFS keeps the entire name space in ram.
The inode data and the list of blocks belonging to each file comprise the metadata of the
names of the name system called the image; the persistent record of the image is stored
in the local hosts native file system called the check point. So, name node also stores the
modifications a log image called the journals in the local host native file system. During
627
restarts the name node restores the name space by reading the name space and replaying
the journal.
Now, the another important component here is called data nodes, each block replicas on a
data node is represented by 2 files in the local host native file system.
The first file contains the data itself and the second file is the blocks metadata including
the checksum for the block data and the blocks generation stamp. Now during startup
each data nodes connects to the name node and performs the handshake.
628
The namespace ID is assigned to the file system instance when it is formatted,
consistency of software version is important because incompatible version may cause
data corruption or a loss data node that is newly installed and without any namespace ID
is permitted to join the cluster and receive the clusters namespace.
So, after the handshake the data node registers with the name node and data nodes
persistently stores their unique storage ID’s, date node identifies the block replicas in its
position to the name node by sending the block report, subsequent block reports are sent
every hour and provides the namespace name node with the update up to date view of
where the block replicas are stored on the cluster.
During the normal operation data nodes send heartbeats to the name node that I have told
you, the default heartbeat interval is 3 seconds and if the name node does not receive the
heartbeat from the data node in 10 minutes, the name node considered considers the data
node to be out of service and the block replicas hosted by that by that data node to be
unavailable.
So, heartbeats from the data node also carry information about the total storage capacity,
fraction of the storage in use and the number of data transfer currently in progress. The
name node does not directly call data nodes; it uses the replica heartbeats to send the
instruction to the data nodes. So, these commands are important for maintaining the
629
overall integrity and therefore, it is critical to keep heartbeats frequent on the big
clusters.
HDFS client, third component is the HDFS client user application access the file system
using HDFS client, similar to the most conventional file system HDFS supports
operations to read, write, delete files operations to create and delete directories also.
When an application reads the file HDFS client first asks the name node for the list of
data nodes that I have explained and this interaction can be seen over here.
630
So, client has to first contact to the name node because name node has the metadata and
through that metadata it can directly access those set of data nodes which contains those
block information data blocks. So, it first writes in a pipeline, this is called a pipeline and
then issue a write command.
So, all these particular data blocks are made or basically is written and then this
particular information is basically informed to the name node about the writing of that
particular data in a cluster.
Similarly, as far as file read is concerned. So, HDFS first step is to open a file and then
contacts the name node to get the block locations and then perform the read operations
on these block locations. Now since this particular blocks are stored in a 3 different
copies. So, it will try to read any of these 3 copies if it is not successful in this read then
it may read the another set of copies or may send simultaneously 2 different read out of 3
and once this particularly read operation is complete then it will close it.
In write it has to inform to the name node when the file is closed, but in read it will just
close without informing. So, here this is the anatomy of a write. So, when once this write
is complete then it has to inform to the name node that the entire operation is complete.
So, here we see that for file writing it will it will contact to the name node, name node
will inform about the data nodes and the data nodes then it will basically write packets,
write files and this will be done in a pipeline fashion. It is shown as a pipeline of the data
631
nodes and once it is done then basically the acknowledgement will flow back and then
HDFS client will also inform the name node about that write operation is complete.
Now there are other issues because here the HDFS is having the information about the
rack. So, most of the decisions of data node placement or a block placement on a data
node basically is having the awareness of the position of the nodes in which rack and
which center all these information are required. So, that the availability in spite of failure
is to be ensured in a much comprehensive manner.
Thank you.
632
Distributed Systems
Prof. Rajiv Misra
Lecture - 23
Spark
Motivation. So, the MapReduce and its variants have been highly successful in
implementing large scale data intensive applications on commodity clusters. Meaning to
say that, this MapReduce has successfully shown that, it is able to compute large scale
data sets over cluster machines.
However, if you see this particular MapReduce execution, then we will see that this map
and reduce, they work in a lockstep manner, and the output will be recorded in HDFS,
here also HDFS. So, if the next iteration, if the program has more than one iteration of a
MapReduce, then the next iteration has to basically come out, the input has to come out
from HDFS.
So, if the applications which has more than one iterations, then they have to each every,
means the next iteration has to touch or has to access the HDFS, and that basically have
an intensive input and output and serialization.
633
And will take 90 percent of the time in I O operations.
That was basically the disadvantage. So; however, most of these systems are built around
an acyclic data flow model, and that is not suitable for many applications. In this lecture
we will focus one such class of applications that re use the working set of the data across
multiple parallel operations.
So, this includes many iterative applications; such as machine learning algorithms have
the iterations, and also the interactive application such as data analysis tools. So, these
634
applications may require, this kind of access to the HDFS file system between different
iterations or across different parallel operations.
So, therefore, a new framework which is called as sub is spark will support such
applications. So, not only it will support such applications, but also it will improve the
execution time; that is it will be tend hundred times faster, compared to the Hadoop
mapreduce.
So, the new framework which is called a spark will support such applications, while
retaining the scalability and fault tolerance of MapReduce. So, to achieve these goals,
spark introduces an abstraction which is called a resilient distributed datasets. Resilient
distributed datasets is a read only collection of objects, and they are partitioned across
the set of machines that can rebuilt if the partition is lost. So, this RDD is a read only
collection of objects, meaning to say they are mutable objects.
Mutable objects means they cannot, they only, they are only read only, and cannot be
modified. So, if it is a read only, and if failure has happened at some partition, at some
node then it can be rebuild, and can resume the execution without any much disruption.
So, spark can outperform Hadoop by at least hundred times in the iterative machine
learning jobs, that we will see how it achieves this.
635
So, if we see the difference between the Hadoop MapReduce and the spark, we will see
that its speed is hundred times faster than the MapReduce operations, and here it also
supports real time processing, and it also stores the data in the memory, and its particular
spark applications are written in Scala language.
This is an example which will show that, this particular Hadoop MapReduce, will during
this particular iterations, has to access this hard disk or HDFS file system, which is on
the persistent memory. So, spark in contrast to the MapReduce, you can see that there
636
will be only one read operation from HDFS, and after that thus particular data will be
cached, and across several iterations, this particular in memory cache will be utilized,
and the results will be accessed.
So; obviously, it is going to reduce 90 percent of the I O operations, why because in

memory operations will be most effectively utilized. And hence a newer application like
interactive and iterative application, will be executing here efficiently; that is hundred
times faster compared to the MapReduce operations.
So, the introduction. The new model of cluster computing; that is called the spark, has
become a widely popular in which the data parallel computations are executed on the
cluster of unreliable machines, by the system that automatically provides locality aware
scheduling fault tolerance and load balancing.
Now, these systems achieve their scalability and the fault tolerance by providing a
programming model, where the user creates a cyclic data flow graph to pass the input
through the set of operators.
637
So, while this dataflow programming model is useful for large class of application. There
are some applications that cannot be expressed efficiently as a cyclic data flows. So, one
such application is called an iterative, having an iterative jobs. So, where many machine
learning algorithms, they apply a functions repeatedly on the same data sets, to optimize
the parameter.
For example, in a gradient descent, machine learning algorithm, or a PageRank

algorithm, while each iteration can be expressed as Map Reduce each job must be, must
reload from the disk in the MapReduce. So, how this can be avoided, and spark will
show that this axis can be avoided.
638
Interactive analysis Hadoop can be used. Hadoop is used to run ad hoc exploratory
queries on large datasets through SQL interfaces. So, user would be able to load the data
sets of interest into the memory across the number of machines and query it repeatedly.
So, a new cluster computing spark will support applications with working sets, while
providing scalability as we have seen. So, the main abstraction of the spark is resilient
distributed datasets. Users can explicitly cache an RDD in memory across the machines
and reuse it in multiple MapReduce like parallel operations.
639
RDDs achieve fault tolerance through the notion of lineage; that is if the partition of an
RDD is lost the RDD has enough information, about how it was derived from other
RDDs to be able to rebuild just that partition. Although RDDs are not a general shared
memory abstraction, they represent a sweet spot between expressivity on one hand and
scalability and reliability on the other hand.
Spark is implemented in a Scala, spark can be used interactively from a modified scalar
Scala interpreter. Spark is the first system to allow efficient general purpose
programming model to be used interactively to process large datasets on a cluster
Programming model.
640
To use this spark developers write the driver program, that implements the high level
control flow of their application and launches, various operations in parallel. So, for
example, this is a driver program, which basically will drive the whole entire execution,
and then driver will give the instructions, and that will be through a high level control
flow, and they will be the kind of workers, which will run in parallel by the spark driver
program.
So, the spark provide two main abstractions for parallel programming; one is resilient
distributed datasets, and parallel operations on these datasets. In addition this spark
provides supports two restricted type of shared variables that can be used in the functions
running on the cluster.
641
So, as I explained that the spark program spark run time will have the driver program,
and this driver will distribute the task to different workers. So, there can be three it is
shown, but there can be more than many, as many as the number of nodes possible,
maybe thousands or hundreds, such workers can work for a driver, and driver will run
one application.
So, this is the scenario of one application execution. So, RDDs resilient distributed data
set is a read only collection of object, partitioned across a set of machine that can be
642
rebuild if the partition is lost. The element of an RDD need not exist in the physical
storage instead a handle to an RDD contains enough information to compute RDD
starting from the data in a reliable storage. This means that RDDs can always be
constructed, reconstructed if the node fails.
So, here we can see that, once an RDD is defined, then we can perform an different
transformations from an RDD. So, we will see what these transformations are supported
by the spark, and then on RDDs various actions can be performed which will give the
final values.
Now, we will see that these actions are important, in between the transformations which
are defined, will not be executed until actions are fired, and these actions will give the
values that we are going to see.
So, in spark each RDDs represented by any scale object. So, there are four ways. So,
from file in a, shared file system such as Hadoop distributed file system, by paralyzing
Scala collection in a driver program which means that dividing it into a number of slices
that will be sent to the multiple nodes, by transforming an existing RDD, a data set with
the elements of type a can be transformed into a into datasets, with the element of type B
using operation flatmap, which passes each element.
643
Other transformation can be expressed using flatmaps. So, flatmap is a transformation
which takes RDD in one form. Let us say A, and it will transform and give RDD in
another form, and this is called a transformations.
So, whether it is flatmap or a map all are transformations defined in by Scala that we will
see.
So, by changing the persistence of an existing RDD, by default RDDs are lazy and
ephemeral; that is partitions of the data sets are materialized on demand when they are
used in parallel operations, and discarded from the memory after the use is over
The cache action leaves the data set lazy, but hints that it should be kept in the memory
after the first time it is computed, because it is going to be reused. Save actions evaluates
the dataset and write it to the distributed file system.
644
If there is not enough memory in the cluster to cache all the partitions of a data set, spark
will recomputed them when they are used.
Now, parallel operations there are several parallel operation that can be performed on
RDDs; one is called reduce that will combine the datasets, data set elements using an
associative function to produce the result at the driver program. Collect, sends all the
elements, of the dataset to the driver program, for each passes each element through a
user provided function. This is only done for the side effects of the function.
645
So, if you see that this is the driver program, and these are the RDDs. So, here as far as
the parallel operations are concerned, when it is said as collect; all the RDDs, all the
nodes, which are having an RDD, which will get an result, they will be sending the
values back to the to the driver program, and for each means here at each RDD a, user
defined function will be invoked and transformation will happen, and reduce also will
perform the transformation on RDDs.
Now, these are the shared variables. Programmers invoke the operations like map filter
and reduce by passing the closure functions to the spark. The spark lets the programmer
create two restricted types of shared variable to support two simple, but common usage
patterns; such as broadcast variables and accumulators. So, broadcast variables, if very
small piece of data which is to be communicated to all the workers, then the broadcast
variables are used. For example, if it is the driver and these are the workers. So, the
broadcast means that the driver will communicate through the messages, through the
broadcast variables, to all the workers. So, this way these share variables they are going
to be utilized.
646
So, broadcast variable if a large read only piece of data, is used in a multiple operations,
it is preferably to distributed to the workers, only once instead of packaging it with every
closure. Spark lets the programmer create a broadcast variable object that wraps the
value and ensure that it is only copy to each worker once that I have explained in the
previous slides.
Accumulator: These are the variables that the worker can only add to using an
associative operation and that only the driver can read. So, they can be used to
647
implement counter; such as Map Reduce, and to provide a more imperative syntax for
parallel sums.
Accumulators can be defined for any type that has an add operations and a zero values.
Due to their add only semantics, they are easy to make the fault tolerant.
RDD operations: So, I told you that RDDs spark provides the operation in the form of
transformations, operation in the form of actions.
The plan of transformation in action is denoted by a dag; that is directed acyclic graph,
and this particular dag is build in the execution engine. For example, from the start, if
this kind of dag is being produced here, the edges are called the transformations, and
here the nodes are called basically the actions.
. So, transformations from one RDDs to another RDDs. So, transformation will be
applied on the RDDs. For example, the map will take RDD, and do the transformation
filter sample union group by key reduce join and cache. They will be performed on
RDDs. And it will transform from one form to another form, as we have seen in a
flatmap earlier. So, flat map is also there.
Now another kind of operations is called basically the action and this action whenever
an action is encountered in a dag, then only the execution will takes place, and the values
are returned back to the driver program. So, this is very important that actions have to be
648
fired. For example, actions is reduced collect count sale and lookup key. So, when count
in the sense, the word count program you have seen in the Map Reduce, is being
supported here directly by the spark.
So, the word count or a count will. So, once an action count happens. So, automatically
the dag, which basically has already transformed into this kind of graph, will be
executed, and the value will be returned back.
So, the transformations. Let us see the flatmap, because we have already seen, is similar
to the map, but each input can be mapped to zero or more output items. So, let me give
you an example of flatmap. So, if the line is given, if a line is given and flatmap is asked
to split on the blanks then the input is this particular line, but as far as the flatmap is
concerned it will output all the words. These words will be the output as far as the
flatmap is concerned.
. So, the input is one, but output are basically a list of words will be output. So, similar to
the map, but each input can be mapped to zero or more output items, that I explained you
through an example. Similarly all other for example, map you know that returns a new
distributed dataset, from by passing each element. So; that means, it has the function
which will transform the RDD from a particular input to a particular output.
649
Similarly, other transformations are grouped by keys, reduced by keys, sort by key, join
and so on.
Actions: They are basically different from the transformation, as I told you actions are
very important, because they will trigger the entire operations, which is build the form of
the dag, and the spark will return the value to the driver, and driver in turn will get the
values, and completes the entire function.
650
So, reduce function and aggregate element of the dataset. Collect means it will return all
the values of the data set as an array, at the driver program. For example, these are all the
worker nodes, and this is the driver program. So, when the collect operation, collect
action will be there, then all the values whatever is computed will be collected back. So,
collect is an action. So, all the actions will perform on the application. So, that is why
these actions are very important, and they are listed here in this particular slide.
So, spark community is most active open source community in a big data, because it is
going to be used in a big data scenario, because a very large dataset, it is able to compute
this only platform, and it is also hundred times faster compared to the Hadoop map
reduce.
So, just see that the contributions are growing in spark, built in libraries.
651
Standard libraries for big data is supported. Here big data applications, lack libraries for
common algorithms sparks generality and support for multiple languages make it
suitable to offer this, much of the future activity will be in the form of a library.
So, the languages which are supported here, are by the spark is python Scala Java R, and
SQL machine learning and a graph are basically the standard library, which runs over the
core spark.
652
So, these are the standard libraries which are included with the spark, in the spark SQL
spark streaming that is for the real time, graph X is for the graph applications, and MLlib
is for the machine learning.
These are all written over here.
So, graph X is for the graph computations.
653
Spark streaming is for large scale streaming computations spark SQL.
For the structured data. Some examples of spark.
654
So, let us see the PageRank computation, which is to be done by the spark. As you know
that PageRank computation is an iterative application and is heavily used in the Google,
to find out the ranks of the web page or the ranks of the website.
So, it gives pages rank. Sorry score based on the links, which are basically pointing to
that particular page. So, the links from many pages, will give a high rank, and links from
a high rank page to a particular page also will give a high rank to a particular page.
So, here you can see that. So, many links are pointing towards a particular node, is going
to be a high page, rank page or a website. Similarly here also now it will be ranking all
the web pages, of those pages which are being leaned over here.
655
Let us see this particular algorithm. This algorithm works in the iterations. So, initially
all the nodes are given the rank one, and then iteratively, it will compute the rank by
iterative operation. So, let us see this, this particular node, is basically outgoing links are
two. So, basically this is divided, 1 is divided 0.5 and this is also divided 0.5. Now the
rank of this particular node has to be recomputed. So, how many links are basically
indicating to this particular node, will be the new computation, which will be changing
its link, as I told you this.
So, now, this particular PageRank of this particular node or for every other nodes, is now
will change why, because the weights of these pages are going to have now changed in
this iterations, and according to this formula it is going to change.
656
You just see that, this particular ranks are changed in the iteration number one. Similarly
second iteration will again happen, and this basically will go on, and finally, here it stops.
So, here you can see that this particular node is having, the highest rank followed by this
one. So, if it is the highest rank why, because you see that it has all the links, from all
three nodes, and as far as this is the list because this highest rank node is now in pointing
to this particular node. So, it is also receiving the higher compared to the other two
nodes. So, this is basically the program.
657
Let us go and see the program. Now here in the spark program for PageRank algorithm,
first two steps if you see, it is nothing, but it is going to compute the spark and spark
context; that is called SC, is going to build a spark context. And next iterations, you see it
is a for loop, and this far loop will perform the transformations over RDDs. So, here you
can see that. So, links and ranks, they will get the RDDs linked out of this URL neighbor,
and URL rank pairs.
Then this particular transformation, flatmap will be performed, and will basically do the
transformations on these particular links, and then it will be performed another
transformation which are reduced by key, reduce by key in the sense, these particular
rank or ranked by link size. These particular values are being given from the map
transformations, these values are now being added; and that is called reduced by key. So,
for the same page, all the values are going to be added, and these values will be now
given as the rank.
So, this particular program is written in the Scala, which will have two points; one is
called transformations, most of the transformations are done on RDDs, and then it will
take an action. And once an action is performed then the entire transformation will be
executed, and the values will be returned. So, during these iterations the values, which
are generated across the iterations, will be stored in memory.
658
So obviously, this particular program becomes faster. So, you can see that this is the
hundred times faster compared to the Hadoop version of the same program.
Now, another program machine learning is called logistic regression. This is also to be
carried out in the number of iterations.
659
The code also you can see that, here this is, this map is the transformation, it will take a
file perform a transformation, and it will generate a RDD which is called a data. Here
also it will form a spark context, and then it will perform another transformations and up
to reduce it will do, and finally, this is the action; that is this action will be performed in
the local disk.
660
Why, because the result will be in the local form of the disk. So, here also if you see the
running time, this is faster compared to the Hadoop version. Now another important
program is called basically word count.
Word count is important why, because in the line the words which are most frequently
occurring, is not going to impact in the page rank, as it is seen for example, the is and or,
they are not going to contribute, significantly as far as the page rank is concerned of a
particular website.
661
So, basically that is why the word count is the most important application, and it is also
supported here in the spark.
So, just see as I told you that it will take a file, it will create a Scala context, it will read a
file, and it will generate a file in a variable f, as far as the flatmap is concerned, it will
take these lines which are read from the file, and it split according to the blank. So,
basically it will generate the words, and these words. As far as the map transformation is
concerned, it will output for every word comma one; that means, this particular tuple,
will be output for each word, a tuple will be generated. Tuple means word comma one,
and then reduce function will be performed reduced by key is a transformation on this
particular RDD. The reduce will take the; that means, all the words having this particular
value one, and they will be added up
So, for a particular word t h e the, if it is appearing, many number of times it is

appearing. So, that many number of ones will be added, and it will be reduced by key.
So, these are all transformations. So, this particular word count, when it is saved this is
an action, save is an action, then the entire program will be, the entire output will be
saved and it will return.
Other is spark programs and applications are there. For example, twitter spam
classification and so on.
662
Various machine learning algorithms. Now lot of reading materials are available, that can
be referred.
Conclusion: Spark provides three simple data abstraction for programming clusters,
resilient distributed data sets, and two restricted type of shared variables; that is the
broadcast and accumulators. Why these abstractions are limited? It is found that they are
powerful enough to express several applications, including iterative and interactive
computations. Furthermore, it is believed that core idea behind RDDs of a dataset,
663
handle that has enough information to reconstruct the dataset from data available in the
reliable storage. So, in this lecture we have discussed the HDFS components,
architecture, framework of spark and its application.
Thank you.
664
Distributed Systems
Prof. Rajiv Misra
Lecture - 24
Distributed Algorithms for Sensor Networks
Distributed Algorithms for Sensor Network Introduction the idea of virtual

backbone
routing for ad hoc wireless network is to operate routing protocol over the virtual
backbone.
The purpose of virtual backbone routing is to alleviate the serous broadcast system
problem suffered by many existing on-demand routing protocols for route detection and
thus constructing a virtual backbone is very important. In this lecture we study the virtual
backbone construction by an approximation algorithm which is approximated by a
minimum connected dominating set MCDS in a unit-disk graph this is an NP-hard
problem.
The distributed algorithm for approximation this particular problem has the performance
ratio of 8 that will be discussed.
665
So, before going ahead let us define some of the basic terminologies. So, sensor
networks is an Ad hoc network- ad hoc wireless sensor network has an application in
emergency search-and-rescue operations, decision making in battlefield, data acquisition
operation in inhospitable terrain and so on and so forth. So, such network can be
spontaneously formed to basically do the surveillance and the data acquisition such kind
of activities.
So, it is featured by the dynamic topology and that dynamic topology should basically be
infrastructure less for spontaneous construction. Multihop communication, limited
resources and limited security are some of the characteristics of such networks. These
characteristics put special challenges in routing protocol design which are inspired by the
physical backbone in a wired network. Many researchers propose the concept of a virtual
backbone for unicast, multicast, broadcast in ad hoc networks.
666
So, virtual backbone is mainly used to collect the topology information for route
detection. It also works as a backup when the route is unavailable temporarily. An
effective approach based on overlaying a virtual infrastructure which is termed as core
on an ad hoc network is very popular scheme.
Routing protocols are operated over this particular core, which is formed using virtual
backbone. Router request packets are unicasted and a small subset of non-core nodes, no
broadcast is involved in the core path detection.
Let us before we go ahead let us take this particular example of the virtual backbone. So,
virtual backbone is basically in contrast with the physical backbone in the wired
network. So, in the physical backbone system you might see a backbone to which all
other nodes are connected. This backbone is a high speed backbone in the physical
network maybe an optical fiber networks and these are the other non backbone nodes
they can communicate through this particular backbone not for the data communication.
Inspired from this infrastructure less networks like sensor networks and ad hoc networks
they also inspired by this physical backbone and they have come out to construct a
similar kind of backbone structure which is called a virtual backbone.
So, if the virtual backbone is in place. So, the nodes which are not in the backbone they
can communicate through the backbone for example, x want to communicate to y. So, x
will send the message to the backbone and the backbone which is closest to y will
667
basically transmitted. So, this particular backbone is very important to basically facilitate
the routing activity also this particular backbone will solve the broadcast storm problem
that we are going to see in the next slide.
So, the routing protocols in the wireless networks like sensor networks and ad hoc
network can be classified in 2 different types. The first one is called proactive the other
one is called reactive proactive routing protocols they ask each node or many node which
is also called as a host to maintain a global topology information and thus route can be
provided immediately whenever it is requested. But a large amount of control messages
are required to keep the each host updated for the newest topology changes in it is
routing table. So, routing table has to be updated whenever there is a slight change in the
topologies are encountered and this is called proactive routing protocol whether it is used
or not the routing table has to be updated at all point of time.
In contrast to the proactive routing protocol there is an alternative solution which is

called reactive protocols; reactive routing protocol have a feature of on demand
construction of the routing path. So, each node computes the route for a specific
destination only whenever it is necessary so; obviously, in contrast to the proactive the
reactive routing protocols are affordable are basically can be used in such a network
which are resource constrained network. So, topology changes which do not influence
668
that to route do not trigger any route maintenance function and thus the communication
over it is lower.
So, basically proactive routing protocols are way. So, in compared to proactive routing
protocol reactive routing protocols is very popular for ad hoc networks such as sensor
network.
Now On-demand routing protocol attract much attention due to their better scalability
and lower protocol overhead that I have just talked about, but let us see the more detailed
more problematic in this kind of setting, but most of them most of them means most of
these on demand routing protocol which are on demand construction of a routing path
uses the flooding to discover the routes, but this flooding suffers from a broadcast storm
problem broadcast storm refers to excessive flooding may result into an excessive
redundancy excessive contention and excessive collision.
So, this particular problem of excessive redundancy contention and collision is called a
broadcast storm problem this causes high protocol overheads and interference to the
other ongoing communication channel. So, due to the broadcast system problem this
particular flooding based on demand routing protocol requires some more innovative
intuitive or innovative solutions. So, on the other hand unreliability of the broadcast
obstructs the detection of the shortest path or simply cannot detect any path at all even
there exists one.
669
So, Problem of efficiently constructing the virtual backbone for ad hoc wireless networks
such as the sensor networks.
So, the alternative solution is of the broadcast problem is basically to maintain a virtual
backbone which is quite similar to the physical backbone, but in a infrastructure less
network like sensor network. So, how efficiently maintain this virtual backbone which
can be used by the routing purposes on this particular network. So, constructing a virtual
backbone efficiently becomes an important task, which we are going to see in this part of
the lecture. So, number of host forming the virtual backbone must be as small as possible
to decrease the protocol overhead this is the most important issue in designing the virtual
backbone for such networks.
So, the algorithm must be time and message efficient due to the resource scarcity. So, we
use the connected dominating set to approximate the virtual backbone. So, we used to we
will construct we will see that how this connected dominating set will approximate the
virtual backbone and then if the virtual backbone is in place, then the routing will
become more efficient and will not have the problem like broadcast storm problem.
670
So, for that let us see the assumptions. So, we assume a given ad hoc network and
instance which contains n different hosts, each host is in the ground and is mounted by
an omni-directional antenna this will facilitate the transmission network or a
communication network among these nodes. This transmission range of a host can be
constructed in a form of a disk of a particular radius R because it is omni-directional it
can communicate around a circle of a radius R.
Thus the footprint of an ad hoc wireless network is a unit disk graph. So, the graph which
have this kind of overlapping disks they can be considered as an edge of a graph and
these nodes are the vertices. So, they will form a graph in this particular manner this
particular graph is called a unit disk graph. So, in such networks which are called ad hoc
networks or the sensor network the graph which is constructed is of unit disk graph. So,
in brief we can call it as a graph, but before that let us see what the unit disk graph
structure is.
671
So, in a graph theoretic terminology the network topology is in the graph where we
contains all the hosts and E is the set of links if they are overlapping in their unit disk and
the 2 nodes are in their communication range within each other.
Now, we consider we also assume that these links are bidirectional that is both nodes
both sides of the edge they can communicate with each other wirelessly, we assume that
kind of bidirectional links are established using the omni directional antenna in the
wireless sensor network. From now on we will use the host and node inter changeably to
represents a wireless mobile node.
672
The existing distributed algorithm for MCDS means minimum connected dominating set
problem. So, there are various metrics like message time message length information and
cardinality and based on that different algorithms, they have given their optimal results
that we will discuss a bit later.
Now, let us go ahead with a preliminaries of the graph which is formed out of the unit
disk and that is called unit disk graph. Some of the preliminaries we are going to see
673
from a graph theoretic point of view why because the footprint of ad hoc network or the
sensor network becomes a graph.
So, let us see that what this graph has what are the properties which it has let us take the
preliminaries, let graph G which is denoted as V, these 2 vertices which are independent
they are neighbor. Hence this particular set of independent neighbor of v is called
basically the independent neighbors of v, independent set S(G) ⸦ V such that for all
vertices u, v such as all edges which are in u, v.
So, for any particular for any 2 pair of vertices they do not have an edge. So, such set of
subset of vertices is an independent set. So, again I am repeating an independent set
S(G) ⸦ V, the set of vertices of G such that for any 2 vertices or any pair of vertices in S,
let us say u v is a pair of vertices in S, then u v pair is not having an edge in that
particular graph or they are independent a set of a subset of independent vertices is called
independent set. Now this independent set is maximal if the vertices which are not in S
that is G of v minus S these vertices has a neighbor in S then it is called maximal
independent set.
So, a dominating set D of G is a subset of vertices such that any node which is not in D
that is not in D has at least one neighbor in D. So, if the induced sub graph of D is
connected then D is a connected dominating set otherwise D is a dominating set. Now
among all connected dominating sets in the graph the one with a minimal cardinality;
minimum cardinality is called minimum connected dominating set or MCDS. So, this is
the definition of a minimum connected dominating set.
674
Computing and MCDS in a unit disk graph is an NP-hard problem that is the
construction of MCDS in a unit disk graph is NP-hard. Note that the problem of finding
an MCDS in a graph is equivalent to the problem of finding a spanning tree with a
maximum number of leaves. We are all non-leaf nodes in the spanning tree will form the
MCDS, there this particular problem MCDS or a non leaf nodes of a maximum number
of leaf is also NP-hard problem.
Now, an MIS is also or this or dominating set. Now for a graph G if e is an edge where
the edge of is of length(e) < 1 for all edges, which are having length less than 1 then this
particular graph is called unit disk graph. Take this particular example if u and v is
having an edge. So, if we draw a disk of radius 1 and here also if we draw a disk of
radius 1, then you see this particular length is having the length less than 1. So, such
edges if it is formed in the graph then this graph is called unit disk graph, where the
lengths of these edges are less than 1.
675
Now, we are going to discuss the algorithm to construct a connected minimum connected
dominating set.
So, the algorithm description that initially each node is colored white, the dominator is
colored black and the dominate is colored gray. We assume that the each vertex knows it
is distance 1 neighbor and their affective degrees. This information can be collected by
periodic event driven hello messages; the effective degree of a node is the total number
of white neighbors.
676
Here the host is designated as the leader, this is a realistic assumptions. For example, if
the leader is not there then it is possible to elect a leader using a distributed leader
election algorithm and that will at a additional complexity of in the time O(n) and using
the best possible algorithm uses the message complexity of O(nlogn). Hence our
assumption of existing or assuming a leader is the realistic assumption and let s is a
leader which is given for this particular algorithm.
677
Let us see the algorithm this algorithm works in 2 phases; the phase 1 goes like this host
s color itself a black let us color it as a black and broadcast a message dominator. So, it
will broadcast in an a disk of radius 1 and this particular message is called dominator
will be broadcast.
Now any host u which basically is there inside this particular radius or is a neighbor of
node s will receive this particular message dominator. So, this is they are called white
hosts on receiving dominator message first time from v they will color itself as a gray.
So, let us color it as a gray and broadcast the message called Dominatee let us consider
that this particular node broadcast a message dominatee. So, if it broadcast a message
dominatee then this message will go towards outside this particular range of this node s,
the white node receiving at least 1 dominatee message becomes active. So, this is the
range where the white nodes are there. So, when they will receive this is the dominating
message they will become active, an active white host with a highest d star d star is
effective degree and the id which will form the total order among all of it is active white
neighbors will color itself the black.
So; that means, among those white neighbors the one with a highest number of d star and
id will color itself as a black let us say this will color a black and broadcast a dominator.
A white host decreases it is effective degree by 1 and broadcast a Degree whenever it
receives a Dominatee message.
678
So, from here up to this point it is going to modify the effective degree of the node. So,
the message the degree contains the senders current effective degree. So, that is
broadcasted, all the white node will in turn will modify. So, that the effective degree
value correct value or updated value is known in the neighborhood of a particular white
neighbor.
Now, each gray vertex will broadcast the message number of black neighbors when it
detects that none of it is neighbors are white and the phase one terminates when no white
vertex left.
Phase 2 when s receives a message that the number of black neighbors from all of it is
gray neighbors it is start phase 2 by broadcasting the message M. A host is ready to be
explode if it has no white neighbors a Steiner tree is used to connect all the black hosts
generated in phase 1, the idea is to pick those gray vertices which connect to many black
neighbors let us understand these steps of phase 2.
Let us say that in phase 1 it starts with the leader s and in it is neighborhood there are
some gray nodes formed and in these gray neighborhoods these nodes are the white
nodes, among these white nodes the 1 which is having the highest affective degree they
will become the black nodes. So, let us see this is becomes a black node and when they
will send the dominator message. So, they will become the gray node also. So, at the
679
phase when the phase 1 ends at the end of phase 1 there is no white node left then only
the phase 1 will finish and phase 2 will begin.
Now, when phase 2 begins then these gray neighbors these gray nodes knows that how
many blacks are in it is neighborhood for example, this particular gray has 1 black and 1
black 2 black neighborhoods. Similarly this also will have these many number of black
nodes and so on. So, when these s message so; that means, these gray nodes will inform
to s about how many number of black neighbors it has and this information s is now
having and s will now send a message m in phase 2. So, the host to be explored if it has
no white neighbors so here you just see that in this when the phase one ends there is no
white nodes.
Now, a Steiner tree will be constructed to connect all the black nodes using those grey
nodes which can connect 2 or more black nodes. So, now, some of this in phase 2 some
of these gray nodes will become the black nodes, those gray nodes which basically have
the maximum number of black neighbors they will be blackened up and they will be the
Steiner tree construction which will connect all the black nodes. So, the phase 1 will
form a dominating set and phase 2 is going to construct out of the gray nodes the
connected dominating set.
So; that means, the connection among the dominators or dominating set is established
using a tree which is called a Steiner tree, which will connect all the black nodes. Now
you know that if there are c different black nodes, then to have a connected tree that is
how many nodes are required that is (c-1). Because (c-1) is basically if there are n
number of nodes then basically how many nodes will be required to form a how many
edges are required to form a tree is (c-1).
680
So, the Steiner tree will construct minimum of c minus or maximum of c minus 1 node
formation and that is done through a modified depth first search distributed depth first
search spanning tree construction and that is the phase 2 approach.
So, here let us go ahead quickly about the phase 2 a black vertex without any dominator
is active meaning to say that that if you see. So, the black node without any dominator
why because they are independent they are not connected if they connect through another
node then it is called a dominator. So, these 2 blacks will have a dominator now at this
point of time of at the end of phase 1 these black nodes do not have the dominators if
they do not have the dominators then they are active. So, all the black nodes who do not
have the dominator they become active initially no black vertex has the dominator and
all the hosts are unexplored that after the end of phase 1.
Now phase 2 will try to connect those black nodes with the grey node becoming or gray
node are becoming blackened and they are called as dominators. So, those black nodes
will now get dominated. So, message m contains a field next which specifies the next
host to be explored in dfs formation. So, the grey vertex with at least 1 active black
neighbors are effective.
681
Effective means they are going to be a potential for becoming blackening and will
contribute to the dominators to the other black nodes. So, if m is built by a black vertex it
is next vertex contains an id of unexplored gray neighbors, which connects to the
maximum number of active black hosts.
If M is built by the gray vertex it is next field contains an id of an unexplored black

neighbors. So, either the black or the gray both basically are able to find out the
dominators for the black nodes. So, any black host u receiving M message the first time
from the gray vertex v sets it is dominator to v by broadcasting the message parent to it.
682
So, when a host u receives a message M from v that specifies u to be explored next, if
none of u’s neighbors is white, u then color itself a black, sets it is dominator to v and
broadcast it is own message M; otherwise, u defer it is operation until none of it is
neighbors is white. Any gray vertex receiving the message parent from the black will
become will broadcast the message number of black nodes neighbors which contains the
number of active black nodes.
A black node become inactive after it is dominator flag is set a gray becomes ineffective
none of it is black neighbors is active.
683
So, a gray node without active black neighbor or a black node without active gray never
will send the message done to the host which activates it is acceleration or it is or to it is
dominator. So, when s that is the leader or the initial node gets done it has no effective a
gray neighbors then the algorithm terminates by the construction of a connected
dominating set.
Now the complexity in phase 1 sets the dominators for all gray vertices phase 2 may
modify the dominator of some gray vertex.
684
So, I told you that in phase 1 it constructs MIS and in phase 2 it will add the additional
Steiner tree node and together basically will form the total nodes of connected
dominating set. So, we are going to count how many nodes are there. So, the main job of
the phase 2 is to set the dominator for each black vertex, all the black vertices whether it
is through the Steiner or whether through the MIS total count is the CDS that is what is
represented here in phase one each host broadcast each of the message dominator and
dominating at most once that we have seen.
So, the message complexity is dominated by the degree because degree means they are
sending to their neighbors. Since it has broadcasted that big delta times by the host ∆ is
the maximum degree of a graph this is the property of a graph. So, the message hence
does the message complexity of a phase 1 is of the O(n.∆) and the time complexity of the
phase 1 is of the O(n).
So, this particular computation of time and message becomes the Theorem.
685
Now let us see that there is a lemma which we have seen earlier that the size of MIS is
equal to 4 opt plus 1 where opt is the optimal size of connected dominating set this is
lemma 1 and the lemma which we have seen earlier.
So, using this particular result another lemma 2 says that if there are c black node after
phase 1, then at most c - 1 gray host will be blackened in phase 2; that means, the Steiner
tree will be having c - t. Now another construction is that if this is the gray this is these
are the black nodes and this is the gray node, now if this particular gray node is going to
connect 3 different black nodes; that means, 1 gray is going to connect 3 black nodes,
then if there are c number of black hosts then these particular how many gray nodes are
required is c - 2 gray nodes are required to connect all the black nodes; that means, c plus
c - 2 that is 2c - 2 is basically total size this is MIS and this is the Steiner tree this is the
total size.
Now, you know that the size of the MIS from lemma 1 if we basically put it. So, it will
become 8 opt + 2 - 2 by lemma 1, this will become 8 opt where opt is the size of
minimum connected dominating set so; that means, the approximation of this algorithm
is 8 approximate algorithm it is that I have proved here in this part of the discussion and
this is what is mentioned whatever already proved.
686
Now further References more details of the minimum connected dominating set using
other heuristics called collaborative heuristics for ad hoc sensor networks you can refer
this particular reference by Dr Rajiv Misra that is IEEE transaction parallel and
distributed computing system that is published in 2010 2010.
Conclusion in this lecture we have discussed a distributed algorithm which compute a

connected dominating set of a smaller size we have discussed how to find maximal
independent set, then we have used about the use of Steiner tree to connect all the
vertices in the set this algorithm gives the performance ratio of 8 the future scope of this
687
algorithm is to study the problem of maintaining connected dominating set in
environment.
Thank you.
688
Distributed Systems
Prof. Rajiv Misra
Lecture - 25
Authentication In Distributed Systems
Authentication in distributed systems.
689
Introduction distributed system is susceptible to a variety of security threats. A principal
can impersonate other principal and authentication becomes an important requirement.
So, principal means the entities which are participating or communicating with each
other for example, client server both can be the entities or the host, machines and the
users etcetera they are called principals. So, principals can impersonate other principals
and the authentic authentication becomes an important requirement in this particular
susceptible scenario in a distributed system.
So, authentication is a process by which one principal verifies the identity of other
principal. For example, the client, the client can verify about the identity of the server
whether it is that particular server with whom it is intended to it is communicating.
Similarly, server can also want to verify the identity whether it is the client with whom it
is going to communicate. So, in this particular scenario it is called mutual authentication.
So, authentication is a process by which one principal verifies the identity of the other
principal, both the principals they want to verify their identities then it is called a mutual
authentication otherwise it is one way authentication. So, one way authentication only
one principal verifies the identity of the other principal for example, if the client want to
verify whether it is that particular specific server or identity of that server then it is one
way authentication or server want to identify the identity of the client mutual
authentication both the communicating principles verify each other that I explained.
So, the background and definitions authentication is a process of verifying that principles
identity is claimed so there are two things here one is the verification the other is the
identity.
690
So, identification of a principal and then as claimed the identity it has to be verified,
when both these things are satisfied then it is called authentication.
So, again I am reading it because both the terms are there in this particular definition of
authentication. Authentication is a process of verifying that the principals identity is as
claimed. So, authentication is based on position of some secret information in the
distributed system like password known only to the entities participating in the
authentication, when an entity want to authenticate another entity the farmer will verify
if the letter possesses the knowledge of that secret information like password etcetera.
A simple classification of authentication protocol so classified based on the

cryptographic technique which are used the authentication protocols are listed as follows.
691
There are 2 kinds of cryptographic techniques symmetric key cryptography that is called
also properly known as a private key, the other cryptography technique is called
asymmetric key cryptography and that is also called a public key. So, based on two type
of cryptography techniques which is used, we will see the classification of authentication
protocol later on.
So, first we will classify the cryptographic techniques based on these two techniques. So,
the first type of cryptography is called symmetric cryptography, it uses a single private
key to both encrypt and decrypt the data and asymmetric cryptography is called public
key cryptography, that uses the secret key that must be kept secret from unauthorized
users and another key that is also used called public which is made public for the
encryption.
So, data encrypted with the public key can be decrypted only by that corresponding
private key. So, it is a pair of keys involved in a symmetric cryptography that is a pair of
keys that is the secret key and a public key, together is basically the keys which are used
in asymmetric cryptography. So, data is encrypted with the public key can be decrypted
only by the corresponding private key and the data signed with the private key can only
be verified with the corresponding public key.
692
So, in symmetric crypto graph crypto system authentication protocol can be designed
using the following principle, if a principal can correctly encrypt a message using the key
that the verifier believes is known only to the principal with the claimed identity this act
constitutes sufficient proof of identity.
So, let us go ahead for the case studies.
So, in this lecture we are going to see two different case studies one is called Kerberos
protocol the other is called secure socket layer protocol, Kerberos protocol.
693
So, Kerberos primarily addresses client server authentication using symmetric

cryptography, Kerberos is an authentication system designed for MIT project Athena.
The goal of the project athena was to create an educational computing environment
based on high performance workstations, high speed networks and the servers of various
types. Researchers envisioned a large scale 10000 workstation to 1000 servers, open
network computing environment, in which individual workstation could be privately
owned and operated therefore, the workstation cannot be trusted to identify its user
correctly to the network services. Kerberos is not a complete authentication required for
secure distributed computing in general it only addresses the issues of client server
interactions.
Here we will discuss Kerberos authentication protocol.
694
Kerberos design is based on the use of symmetric cryptosystem together with the trusted
third party authentication server, before we go ahead let us see where this particular
Kerberos system is useful in a distributed system. So, the Kerberos is primarily used in
the distributed system, in the distributed systems you know that there are the client the
servers. So, if a client want to take the services which are offered by the servers maybe
there are different type of services which are available in this mode or a distributed
system for example, the payment system and the services like file services and so on.
So, if a client want to access to these services which are provided through different
servers. So, the most important part is that the client has to produce its credentials in the
form of a password. So, the password has to flow through the network and you know that
this network is unsafe as far as if the password is flowing openly on the network; even if
it is encrypted the each dropper can tap this particular password and can replay it again.
So, there are various issues involved that how can these services can be offered without
these particular passwords to be flown on the network for this a Kerberos authentication
protocol is there which will provide the authentication between client and server, without
exposing the passwords on the network.
So, you will see in this particular protocol in the discussion that the client and server they
are authenticated with the help of passwords, but the passwords will not flow on the
network and yet the authentication is completed. So, that particular protocol is designed
695
with that particular specific use, it is designed it is nowadays has become a standard and
this will be used everywhere in most of the systems which are operating in the network
that is in the form of distributed systems. So, this Kerberos authentication protocol has 2
components. So, the first component is called the authentication server or it is also called
Kerberos servers, the other component here which will provide the authentication is
called ticket granting server or TGS.
So, we will see that how using these Kerberos authentication server and using ticket
granting server it can authenticate between client and server without exposing the
password to flow on the, on the network. So, there is a process which is called initial
registration where the each client registers with the with the Kerberos server.
So, this is Kerberos server and the client has to register by providing its user id and the
password. So, the Kerberos server computes a key for that particular user or for that
client that is called ku by applying a hash function on a password and this is one way
function and it stores this particular key in a database. So, it has a database constructed
of all the registered clients on Kerberos authentication server and the passwords are is
stored by hashing that is one big hash function is used and that password is stored in the
Kerberos database for that user.
So, authentication in Kerberos proceeds in 3 different steps. So, initial authentication at

the login.
696
So, Kerberos server authenticates user login at the host and installs a ticket for the ticket
granting server at the login host, the second step says that obtain a ticket for the server
and third is requesting service from the server. So, the terminology which is coming at all
3 step is called a ticket. So, the Kerberos authentication protocol will not expose user
password on the network rather it is a ticket based system. So, Kerberos authentication
server will issue the ticket for a particular session to the client and client will show that
ticket to the server and server can understand the authentication and client and server can
communicate thereafter with the secret key which this shared in this particular protocol
in these stages.
So, how the shared keys between the client and server is being exchanged with the help
of a Kerberos server that we will see and what is the ticket why, how the ticket is
basically solving the problem of not exposing the passwords on the network that also we
are going to see. So, all 3 steps will involve the tickets. So, how the tickets are generated
without exposing the password that we are going to see. So, the steps all 3 steps are
shown over here. So, the components I explained you that Kerberos contains the
Kerberos authentication server.
697
Then second component here in this Kerberos authentication protocol is ticket granting
server and third is the actual server where the client and server this is called client, we
are the client want to communicate with the server.
So, this particular process will do the authentication, you see in nowhere the password is
now flowing. So, user will send the request for ticket granting ticket with putting his
username and authentication server will give the ticket and a session key and now then
client will use its password there itself in the on the host to decrypt this particular ticket
and session key because the client is already registered with the Kerberos with its user id
and a password. So, now, the password will not flow and then it will extract the ticket
and session key and then make a request for this particular server to a ticket granting
server.
So, ticket granting server through the Kerberos authenticator it will try to understand our
identity of the client and then issue a ticket and a session key for this particular server.
Client will use this ticket and the session key and directly get the request for the service
from the server and the server will authenticate, and in this particular in these process a
secret key will be exchanged between the client and the server and thereafter all the
messages which client will send using that particular secret key it will be encrypted and
send.
698
So, the information which will be flowing on the network will be secured, having the
secure communication. So, I have explained you the real application and the use of
Kerberos authentication process let us see all three steps of the Kerberos authentication.
So, initial authentication at the login uses at a Kerberos server and is shown in the
algorithm this is the algorithm, let u be a user who is attempting to log on to a host. So,
user will give its identity not in the form of a password, but only, but it has to identify the
name of the user to the host. Let us say it is U who want to use the Kerberos
authentication service or who want to use the server, but it has to a authenticated with the
help of Kerberos authentication protocol. So, this particular host so user and host there
are two things means in the normal system user can come put its login and can start
accessing to the host.
Now, in the distributed system it is connected to the network of several other things. So,
user has to give its identity to the host and host will send this identity user for a ticket
granting server service to the Kerberos. Kerberos will retrieve the k u and k of TGS from
the database based on the username record which is stored in its database in the Kerberos
and then generate a new session key k and create a ticket granting ticket and this
particular ticket that is that will be generated by the Kerberos is in this particular form
which is encrypted with ticket granting service.
699
That is a secret key of a ticket granting service and this is given to the host, host after
receiving this particular ticket, it will since it is encrypted with the secret password of U.
So, as far as the user is concerned it will produce its password and use the password to
decrypt this particular message which is sent by the Kerberos, recover a session key,
recover the ticket for ticket granting service and if the decryption fail then the login will
also be failed.
So, this is the initial authentication at the login step.
So, it is explained again here in this particular slide. So, in a step one user u initiates
login by entering his username here there is no password is given. In step number 2 login
host forwards the login request and the id of the ticket granting service to the Kerberos.
So, there no password is being sent only the login request and the id of ticket granting
service is given to the Kerberos you send on the network. So, network message does not
close the password so that you have to observe.
700
So, in step number 3 Kerberos server retrieves the secret key of U and also the key of a
ticket granting service from the database generating a new session key k and creates a
ticket granting ticket as I explained you. We are use the identity of the user who wishes
to communicate TGS is the identity of the ticket granting server and K is the session key
T is the timestamp L is the tickets lifetime K of TGS is the key shared between TGS and
Kerberos. So, this is the secret key between the Kerberos and the ticket granting service.
So, the ticket granting service this particular message is encrypted with KTGS. So, then
step number 4 Kerberos server encrypts the ticket granting service the identity of TGS
the session key and so on gives to the host.
The step number 5 on receiving this message from Kerberos server host h prompts the
user for his password and using that particular password it will decrypt this particular
message which is sent by the Kerberos and recovers the session key K which it is going
to use it. So, the so, session key will be retrieved out of this particular message thus the
user is authenticated if the host is able to decrypt the message which is received by the
Kerberos, upon successful authentication the host saves the new session key K and a
ticket granting ticket for further use and erases the password from the memory.
701
So, password use is over and the key and ticket granting, ticket for the ticket granting
service both are restored, the ticket granting ticket is used to request the server tickets
from the ticket granting service note that the tickets are encrypted using K of TGS. So,
the key shared between TGS and server Kerberos server. So, the user will not be able to
decrypt this part so this part will go as it is without decryption.
So, this is how the client will now communicate with the ticket granting service. So, as I
told you that this ticket of TGS which is encrypted will go as it is. So, the client executes
702
the I step shown in the algorithm to request the ticket for the server from TGS basically
the client sends the ticket TGS to TGS requesting a ticket for the server s. So, ticket
granting service will issue that ticket for that particular server session. So, the first step is
the client will request to the ticket granting service through this particular message some
part of the message is basically received by the Kerberos authentication service. So, that
ticket granting service will authenticate that this is the client which is authorized by the
Kerberos server and the request for that server name is being encrypted and sent.
So, ticket granting service after receiving this particular message recovers K, K means
the session key from TGS by decrypting with that particular key the secret key between
Kerberos and the ticket granting service. Then it will recover the T1, T1 is the time
stamp from C of T which is decrypted with that session key to check the time lines of T1
if the. So, if the lifetime of this particular ticket is over then it will not be allowed to
make a session that is why. So, why this is done is to avoid the replay attack, with respect
to the local clock and it will generate a new session key K. So, create the server ticket TS
which is encrypted with that particular session key. So, it will contain the new session
key the client credentials and the server also credentials and the timestamp and the
lifetime of this particular ticket, this particular message TGS will send to C encrypted
with the session key.
Now, C will recover k this k and this particular ticks, ticks is basically the ticket for the
server which is sent by the ticket granting service by decrypting the message with its
own key that is K. So, because the ticket is susceptible to the interception and replay it
does not by itself constitute sufficient proof of identity.
703
For authentication the principle presenting a ticket must also demonstrate the knowledge
of session key K named in the ticket. So, the authenticator which contains the timestamp
to avoid this replay attack and also this is the client’s identity this is called authenticator,
this also will be given or produced to the ticket granting service.
So, K is the session key provides this particular demonstration.
So, in a step number 1 to request a ticket for the server the client C presents its ticket
granting ticket along with the authenticator that I explained. In a step number 2 ticket
704
granting service decrypts the ticket, ticket granting ticket with its secret key which is
being encrypted with the help with by the Kerberos and we shared with that particular
ticket granting service and this will recover the expression key K will verify the
authenticity of the authenticator by decrypting the authenticator that is the identity of the
client and the timestamp T1. And this is has to be decrypted with the with the key K, if
both the decryptions in step number 2 are successful and T1 is timely 30 ticket granting
service is convinced of the authenticity of the ticket and create an a ticket for the server.
This ticket is returned to the client in step number 3 and in step number four the client
recovers K session key and ticks that is the ticket for the server by decrypting it with K.
Third step is now the client will request the service from the server using this much of
information which it has obtained by communicating with Kerberos and communicating
with ticket granting service.
So, ticket granting service has given ticket to communicate with the server and also it
contains a session key or a secret key which will be used for encryption between client
and the server, let us see these steps. So, the client will send to the server that encrypted
ticket and also the authenticator of the client and the timestamp which will contain and
which will have the session key which is given by the ticket granting server.
Now, S the server will recover the session key K from ticks by decrypting it with its
secret key Ks it will recover t two by decrypting with K and check the timeliness of that
705
time is time T2 with respect to the local clock. Now server will communicate to the
client with the new timestamp T2+1 and which is encrypted with the, with the common
key K which is being given by the ticket granting service.
So, all these things are explained. So, the weaknesses is that there is no provision for the
host security; that means, the passwords are stored in the host and if the host is untrusted
then it will become a problem to the entire network.
So, this problem is resolved in the Kerberos version 5 which will introduced the pre
authentication to solve this particular problem that is explained here in the slide. Now the
next important protocol is secure socket layer protocol. So, the secure socket layer
protocol was developed by Netscape and is a standard internet protocol for secure
communication. So, the Kerberos authentication was that without flowing the password
on the inter on the network how the authentication is to be carried out with the help of
servers.
706
Now another protocol called SSL where in the host if it want to access to a website or a
server on the internet then what are the problems it is going to face is that authentication,
how it is going to carry out the authentication.
So, we are going to use this particular protocol for the authentication on the internet
And if the authentication between the hosts in the website; that means, host how it has to
authenticate then it is communicating with the proper website and the proper website has
to authenticate to the to the host this is very important in the sense. If let us say it is a
707
payment system or a banking payment system then the authentication of the host as well
as the website is very very important otherwise credit card information may we had or
may be leaked and it will siphon out all the money. So, this particular SSL protocol is
going to be very useful, we are going to see the use of this SSL protocol.
The secure hypertext transfer protocol HTTPS is basically a communication protocol

designed to transfer encrypted information between computers and worldwide web.
HTTPS is http using secure socket layer SSL, SSL resides between the TCP IP and the
application, requiring no changes to the application layer the different application can
use this SSL to get this particular communication secure. So, SSL is used typically
between server and the client to secure the connection, one advantage of SSL is that it is
an application independent protocol. So, higher level protocol can layer on top of SSL
protocol transparently.
SSL protocol let us go and detail about the features of SSL protocol.
SSL protocol allows client server applications to communicate in a way. So, that each
dropping, tampering, message forgery are prevented which is very important for the
customers who are doing the transaction the banking related transactions where the
information of credit card or a bank details are basically sent. So, SSL protocol in
general provides the following features endpoint authentication the server is the real
party that the client want to talk to, not someone faking the identity for example, if
708
whether it is a proper bank website or some other fake website so it has to be
authenticated by the person or a client who want to use it. Then second part second
feature of SSL is the message integrity if the data exchanged with the server has been
modified along the way it can easily be directed. So, the messages which are being
exchanged at both the principal has proper (Refer Time: 35:08).
Third one is confidentiality. So, the data which is flowing is encrypted hacker cannot
read this information by simply looking at the packet on the network.
So, SSL record protocol takes the application message and fragments the data into the
manageable blocks, encrypts, add headers and the resulting unit is sent to the TCP.
709
Now, another important part of this SSL protocol is the handshake protocol. So, the client
and the server so if the client is let us say a customer of a bank and the server is let us say
online bank then the SSL protocol first it will handshake and establish the parameters on
which they are going to provide you the secure way of communication, that is why it is
called handshake protocol. So, allows the server and the client to authenticate each other
and to negotiate an encryption algorithm and cryptographic keys before application
transmits or receives its first byte of data, this is very very important. That means, both
they will communicate and negotiate this information; that means, which encryption
algorithm it is going to use and if it is decided then what is the keys and they have to
exchange once this is being negotiated and also it is supported at the clients browser and
at the servers website. So, those technical details are negotiated and they will follow this
in further communication for the secure way of communication.
710
Let us see the handshake protocol of SSL. So, first SSL client sends a client hello
message and in this hello message it will give the information about the clients order of
preferences, about which of these encryption algorithms it is supporting at its end and
also some random byte signature is being sent on this particular hello message. SSL
server it will respond to the hello message after choosing from the options given by the
client about the encryption methods, encryption algorithms or the cryptographic keys
from this particular list it will it will choose and which is provided by the client. So, so
SSL server also sends digital certificate. So, digital certificate is basically a certificate by
which the client can authenticate that this is the correct website.
So, digital certificate is a cert is provided through a certification agencies, which is well
known to everyone for example, if let us say a certificate is taken out from some
company let us say x which is not well known then that certificate is not authorized
certificate or is not known to be a proper certificate. Now, if Microsoft gives a certificate
Microsoft is well known company then that certificate is authenticated or properly
verified. So, digital serve certificates are issued from certification agencies which are
well known. So, if a server requires the distance certificate from the client also for
authentication then server also informed that please send the client certificate, that
includes the list of types of certificate supported and the distinguished names of
acceptable certification authorities, also it will server will provide that from these listed
certification authority certificate has to be provided by the client.
711
Now, SSL client verifies the digital signature on SSL servers digital certificate and
checks the cipher suite chosen by the server acceptable, SSL client using all the data
generated in the handshake so far create a pre master secret for the session. If SSL server
sent a client certificate request the SSL client sends another signed piece of data which is
unique to this and will obtain the digital certificate and send it in the part of handshake.
712
So, SSL server verifies that such a signature on the client certificate, SSL client sends
SSL server a finished message; SSL server sends SSL client a finished message on the
other side for the duration of SSL session.
SSL server and SSL client can now exchange the messages that are encrypted with the
shared symmetric key which are exchanged in this handshake.
So, all the handshake a steps which I have told that is being listed over here in this
handshake protocol. So, both client and server authentication there is a step that requires
data to be encrypted with one of the keys in the asymmetric key pair and is decrypted
with the other key of the pair.
713
For server authentication the client uses servers public key to encrypt the data that is
used to compute the secret key the server can generate the secret key only if it can
decrypt that particular data with the correct private key. So, for that basically will do the
authentication of a server so only genuine server or authenticated server can only be able
to do this step. For client authentication the server uses the public key in the client
certificate to decrypt the data the client sends during step number 5 for handshake, the
exchange of finished information messages that are encrypted with the secret key of the
of the client confirms that that authentication is complete.
714
So, SSL server requires the client authentication, the server verifies the clients identity
by verifying the clients digital certificate that I have already explained. So, conclusion
authentication is a process by which one principle verifies the identity of the other
principle.
So, it involves the identification is to be produced, identity is to be produced in the form

of digital certificates and these particular and also with the help of secret key in a public
in a asymmetric key cryptosystem.
So, for example, in a client server system the client and server may need to verify each
other’s identity to assure that each is talking to the right entity that we have seen, in both
these protocols that is in the Kerberos as well as in the SSL. Generally authentication is
based on possession of the secret information like password that is known only to the
entities participating in the authentication, for a successful authentication the entity must
demonstrate the knowledge of the right secret information.
So, most importantly we have seen 2 different type of protocol one is Kerberos wherein
the password is protected from the network and the other protocol SSL wherein how the
secure communication between the two partners that is client and server can takes place
on the internet. So, both this particular protocol has lot of use and is being heavily used
and we have basically able to express it in this part of the lecture the details.
715
Thank you.
716
Distributed Systems
Prof. Rajiv Misra
Lecture – 26
Bitcoin: A Peer-to-Peer Electronic Cash System
Bitcoin a peer to peer electronic cash system, introduction a purely peer to peer version
of electronic cash would allow online payment to be sent directly from one party to
717
another without going through the financial institution. Digital signatures provide part of
the solution, but the main benefit are lost if the trusted third party is still required to
prevent double spending. This lecture we will discuss the solution to the double spending
problem using a peer to peer network.
A network timestamp transactions by hashing them into an ongoing chain of hash based
proof of work forming a record that cannot be changed without redoing proof of work.
The longest chain not only serves as a proof of sequence of the events witnessed, but
proof that it came from the largest pool of CPU power. As long as the majority of CPU
power is controlled by the nodes that are not cooperating to attack the network we will
generate the longest chain and outpace the attackers.
The network itself is minimal structure messages are broadcast on the best effort basis
and the nodes can leave and rejoin the network at will accepting the longest proof of
work chain as a proof of what has happened while they were gone. So, in a nutshell
before we go ahead into more discussion into more detail, let us see that this part of the
discussion is basically a case study of a applying peer to peer to peer system concept of
decentralization.
So, let me give you an example that between the transaction if let us say A one to give
some money to B, one to transfer cash it requires to involve the bank then only the fund
can be transferred from A to B with the help of a bank. Or if it is online transaction then
718
basically it is called a payment gateway like these are and other things. So, this particular
way of the transfer of money is basically in the form of a client and server manner. So,
this is these are all call server. So, they are the clients, which they want to transfer the
money with the help of a server. Now there are different issues how this particular
centralized system can be made the distributed form of or the how it can be
decentralized. So, that between two parties A and B can directly transfer their money
without involving the bank or any payment gateway and this is done with the help of
using the decentralization technology which is called a peer to peer systems.
So, we will see in this particular discussion how Bitcoin using peer to peer system has
able to achieve this kind of decentralization without involving centralized system like
bank or a payment gateway. Yet it ensures all kind of safety and security and avoid the
double spending issues in this part of the work. So, what is the transaction? So,
electronic coin can be defined as a chain of digital signatures.
So, the each owner transfers a coin to the next by distantly signing a hash of previous
transactions and the public key of the next owner and adding these to the end of the chain
a payee can verify the signature to verify the chain of the ownership. Meaning to say that
Bitcoin owner has a private key, the other part which is called a public key is basically
stored in a thing which is called the block chain.
719
So, if a request for a money is basically sent by a user. So, it has to sign digitally using
the private key and the private key will basically indicate he is the owner of that Bitcoin
or a digital signa or digital currency by signing the digital signature which is nothing, but
a private key. So, an electronic coin can be defined as a chain of digital signatures. So,
each owner transfers the coin to the next by digitally signing the hash of a previous
transactions and the public key of the next owner and adding these to the end of the coin.
So, basically here these particular transactions are chained together and this is shown
here in the picture. So, every transaction is chained with the previous transactions and so
on. So, just see that this particular transaction is chained with the previous transaction
and this in turn chain with the previous transactions and so on.
So, whenever a new transaction comes it will be joining at the tail of the transactions and
this particular process at the end of the coin and the payee can verify the signature to
verify the claim of the owner. So, we are going to see all these aspects how the how the
transactions are basically implemented so that it can avoid the double spending of a coin.
So, the problem of course, is that paying payee cannot verify that one of the owners did
not double spend the coin.
So, a common solution is to introduce a trusted central authority or a mint, but for that
purpose we are going to go back again in a scenario which is basically a client over and
720
involving a payment gateway; how you without involving the payment gateway and a
bank how this all how this double spending can be avoided.
So, after each transaction the coin must be returned to the mint to issue a new coin and
only coin issue directly from the mint are trusted not to be double spent. The problem of
this solution is that the fate of the entire money system depends on the company running
the mint that is the bank. We need a way for the pay to know the previous owners did not
signed any earlier transactions. So, the only way to confirm the absence of a transition is
to be aware of all the transactions. So, this is basically the way in the mint based model
the mint was aware of all the transactions, but if we are going for a decentralization who
will be how this particular awareness of all the transactions is being incorporated
implemented. So, to accomplish this without a trusted third party a transaction must be
publicly announced and we need a system participants to agree on a single history of the
order in which they are received.
So, the payee needs a proof that at the time of each transactions the majority of the node
agreed it was first received.
Now, timestamp servers; the solution begins with the timestamp server timestamp server
works by taking the hash of the block of item to be time stamped and widely publishing
the hash such that such as in the newspaper or in a use net post. The timestamp proves
that the data must have existed at the time; obviously, in order to get into the hash.
721
So, each timestamp includes the previous timestamp in its hash forming a chain with
each additional timestamp reinforcing the one before it.
So, here we can see that these timestamps are nothing, but they are the hash and they are
being chained with each other, and this particular chaining will indicate that a particular
transaction at that instant it has happened and if it is in the chain; and this will also give a
proof of work.
722
To implement the distributed timestamp server on a peer to peer basis, we need to use a
proof of work system instead of publishing it on a newspaper. So, proof of work involves
scanning of the values that one hashed such as SHA-256, and the hash begins with the
number of 0 bits.
The average work required is exponential in the number of 0 bits required and can be
verified by executing a single hash. For a timestamp network the proof of work can be
implemented by incrementing a nonce in the block until the value is found that gives the
blocks hash the required zero bits, now who will be doing this? The peers who will be
doing this kind of work for the proof of work they are called miners. So, all the miners
will try to find out the solution of this particular puzzle and the ones and the minor who
basically gives the solution for the first then he will be awarded with the bitcoin.
So, once the CPU, it requires to compute this particular proof of work it requires the
CPU effort.
723
So, the block cannot be changed without redoing the work in this particular manner later
blocks are changed after that the work to change the block would include redoing all the
blocks after that. So, see all the blocks are being changed in this particular manner. So, if
somebody want to change a particular block, it has to be changed at all the places then
only the change is possible and computing it and doing changes is not possible why
because its computationally expensive and many peers together they are solving this
particular they are maintaining it and they are solving it this particular puzzle.
So, proof of work also solves the problem of determining the representation in a majority
decision making. So, if the majority were based on 1 IP address, one vote it could be
supported by anyone able to allocate many IP’s.
724
Now to compensate for increasing the hardware speed and varying interest in running
nodes over the time of proof of work difficulty is determined by moving an average
targeting and average number of blocks per hour If they are generated too fast the
difficulty increases, hence it is basically very very difficult for the attacker to basically
do the double spending and the timestamps or work is basically its nothing, but they are
using the peer to peer concept of this particular hash chains and this has chain
computations by the attacker is basically very very difficult computationally hence it is
store and proof of work basically is established.
Now, the network the steps to run the networks are as follows new transactions are
broadcast to all the nodes. So, all the peers involved. So, whenever a new transaction
comes it has to be broadcasted to all of them all the nodes. So, each node collects new
transition in a; into a block, each node works on finding a difficult proof of work.
725
For its block when a node finds a proof for work it broadcast, the block to all the nodes
nodes accepts the block only if all the transactions in it are valid and not already spent
node express their acceptance of the block by working on creating the next block in the
chain using the hash of the accepted block of the previous hash.
So, nodes always consider the longest chain to be the correct one and we will keep
working on extending it. So, if 2 nodes broadcast a different versions of the next block
simultaneously, some nodes may receive one or the other first. In that case they work on
the first one they receive and that will be the winner and that will be awarded by the
bitcoin.
726
Hence these selfish miners or a peers they basically are responsible to implement this
particular proof of work and they are awarded also in this particular process.
So, the new transition broadcasts do not necessarily need to reach all the nodes and
basically the majority will be useful. So, incentive by convention the first transaction in a
block is a special transaction that starts a new coin owned by the creator of the block this
adds an incentive for the node to support the network and provides a way to initially
distribute the coins into the circulation. Since there is no central authority to issue them
as I told you already. So, the steady addition of a constant amount of new coins is
analogous to gold miners, expanding the resources to add gold to circulations.
727
So, here in our case, this mining is nothing, but involving the CPU time and the
electricity and that is why the miner who basically establishes or starts a new coin will
get the incentive for that. So, incentive can also be funded with the transaction fees, if the
output of the output value of a transition less than the input value the difference is the
transaction fees that is added to the incentive value of the block containing the
transaction.
So, the incentive may help encourage the nodes to stay honest. So, if a greedy attacker is
able to assemble more CPU power than all the honest nodes, he would have to choose
between using it to different people by stealing back his payments or using it to generate
the new coins. He ought to find it more profitable to play by the rules such rules that
favor him will be more new coins to everyone else combined.
Reclaiming the disk space;: So, once the latest transaction in the coin is buried under the
enough blocks by spent transaction before it can be discarded and to facilitate it basically
a Merkle Tree is being used.
728
So, Merkle Tree is nothing, but all these transactions if you see they are they basically
pass through one time hash function and they will generate a hash and they will combine
these hash and finally, root hash will come and this particular root will have the
knowledge if there is any change at any point, then basically it cannot be recomputed and
that is why that is how it will be verified that some changes has happened nobody no
transaction can now change afterwards. So, hence the Merkle Tree is basically a
solarization. So, to facilitate this without breaking the blocks hash transactions are
hashed in a Merkle Tree like this with only the root include in the blocks hash. So, this
will get the complete information with the root hash itself.
A simplified payment verification; it is possible to verify the payment without running

the full network node, the user only need to keep a copy of the block headers of the
longest proof of work chain which he can get by querying the network nodes.
729
Until he is convinced that he has the longest chain and obtain the Merkel branch linking
the transaction to the block its timestamp in. He cannot check the transaction for himself,
but by linking it to a place in the chain he can see that a network node has accepted it and
the block added after it further confirms the network has accepted it. So, this is the
example of a longest proof of chain, here you see that each block will contain only the
Merkel root.
730
And here the martin root will basically keep the transaction that particular link which
basically is linked to a transaction which is linked to a Merkle node and this will be the
indication of proof of work that is the longest chain.
So, as shown in the figure the user only need to keep the copy of the block headers of the
longest proof of work chain, which he can get by querying the network nodes until he is
convinced that he has the longest chain and obtain the Merkel branch linking the
transaction to the block a timestamp in. So, combining any splitting value, although it
would be possible to handle the coins individually it would be unwieldy to make a
separate transaction for every cent in a transfer to allow the value to be split and
combined transaction contains multiple input and output that is shown over here.
731
So, the privacy; the traditional banking model achieves the level of privacy by limiting
access to the information to the parties involved and trusted third party, the necessity to
announce all the transaction publicly precludes this method of privacy, but the privacy
can be maintained by breaking the flow of information in another place by keeping
public keys anonymous a public can see that someone is sending an amount to someone
else, but without information linking the transaction to anyone.
This is similar to the level of information released by the stock exchanges, where the
time and size of the individual trades the tape is made public, but without telling who the
parties were. So, you see the in a traditional privacy model there will be a trusted third
party and custard third party the transaction having an identity linked, when they
approaches with trusted third party then basically it can be verified. In a new privacy
model all the there is no need of the identity that transactions are made public and they
are basically implementing the privacy in this particular system.
So, the calculations in the form of for example, in honest chain and an attacker chain can
be characterized as a binomial random walk, the success event is the honest chain being
extended by one block increasing its lead by plus 1 and the failure event is the attacker
chain being extended by one block reducing the gap.
732
So obviously, the honest if the if the honest nodes are more compared to the dishonest
one, then basically this particular gap will keep on increasing and will never be able to
overpower and basically this particular chain keeps on growing in this manner.
So, this can be modeled using poisson distribution and the probability of attacker could
still catch up now multiply the poisson density and summing up will rearranging to avoid
the summing the infinite tail of the distribution and so, it is basically computationally
becoming very very impossible to find out the length of the longest chain or to construct
a length of the longest chain. Now reading materials are available for a Bitcoin a peer to
peer electronic systems and Bitcoin and cryptocurrency technologies.
733
Conclusion in this lecture we have discussed a system for electronic transaction without
relying on the trust, we have started with a usual framework of coin made from the
digital signatures, which provides strong control of ownership, but is incomplete without
the way to prevent the double is spending. To solve this we have discussed the peer to
peer network using the proof of work to record the public history of the transaction that
quickly becomes computationally impractical for an attacker to change, if the honest
nodes control a majority of CPU power. The network is robust in its unstructured
simplicity nodes can leave and join the network at will and work their CPU power
expressing the acceptance of the valid blocks.
734
So, they needed the rules and incentives they can be enforced with this consensus
mechanism.
Thank you.
735
Distributed Systems
Prof. Rajiv Misra
Lecture – 27
Block Chain
Technology introduction; block chain is essentially a distributed database of records or a

public ledger of all transactions or digital events that have been executed and shared
among participating parties. So, in a blockchain we can also say that it is kind of
distributed consensus. So, a blockchain is basically categorized in a distributed
consensus its nothing, but essentially a distributed databases of records; so, for the
operations which are called as ledgers. So, it is a distributed ledger implementation. So,
bank also maintains the ledger of different people a, b, c having different money values
in their accounts and the bank maintains the ledger centrally how this particular ledger is
to be made in a decentralized manner. So, this is to be achieved using a distributed
consensus and blockchain is the technology which will basically maintain this kind of
distributed ledger or a distributed records.
So, a blockchain is essentially a distributed database of records or a public ledger of all

the transactions or digital event that have been executed and shared among the
participating parties. So, it is not only one is not maintained at centrally located place,
736
but it is to be maintained a in a distributed manner how it is all done through a
blockchain that we are going to discuss this new technology which is also called a
disruptive tech technology. In this part of their discussion involves the concept of
distributed consensus which will basically see that there are different impossibility
results and how due how in spite of this impossibility this particular technology which is
called a blockchain is going to make it possible impossibility to a possibility to see in
this part of the discussion.
So, each transaction in a public ledger is verified by a consensus of majority of the

participant in the system that is why it is called basically the distributed consensus. So,
the distributed ledger will be implemented with the help of the distributed consensus of
the participating parties. So, a blockchain contains a certain and verifiable records of
every single transaction ever made. So, to use a basic analogy it is easy to steal a cookie
from a cookie jar kept in a secluded place then stealing a cookie from a cookie jar kept in
a market place being observed by thousands of the people meaning in the sense there are
2 models of security the existing model of security employs a particular security
provision to secure some resources from unauthorized access and this particular security
mechanisms have to be implemented in a very strong manner and it requires a lot of
overheads and it has to be maintained centrally.
The other models says that it is a decentralized security model. So, the un-trusted people
set up un-trusted people will be given the access to these resources and resources are
being basically monitored by the un-trusted people and this is another model of security.
So, if let us say that cookie is placed over here and it is being secured nobody can take
the security will ensure it similarly the cookies are put in a public. So, public is watching.
So, then also basically; they will witness and nobody can steal then in that case. So, this
is another model and we are going to see how this particular model will be useful to
implement the distributed consensus and how the blockchain is going to implement it.
737
So, Bitcoin is the most popular example of a blockchain technology it is a disruptive

technology means the way the banks are operating with the help of a of a single
centralized secure system, it is going to basically provide an alternative wherein
identities are also anonymous and the technologies which are called as the distributed
consensus is basically used in the in the blockchain technologies. So, we are going to see
and we have already seen that Bitcoin has successfully implemented with the help of a
blockchain all security provisions which is otherwise possible only through the
centralized banks now is possible using the blockchain technology to realize it.
So, it is the most controversial that is a bit Bitcoin. Since it helps the enable the
multibillion dollar global market of anonymous transactions without any governmental
control hence it has to deal with the number of regulatory issues involving the national
and financial institutions.
738
Blockchain technology itself is non controversial is what flawlessly over the years and is
being successfully applied to the financial non financial applications. So, blockchain
distributed consensus model is the most important invention.
So, that is why we are discussing this particular that is technology that is called a
blockchain technology which is not only going to be used in a financial institution, but as
well as non financial institutions wherever there is a centralized system.
739
So, current digital economy is based on the reliance on trusted authority. So, wherever
such kind of scenarios are there this particular blockchain technology will going to solve
the problem, I believe in our inner life precariously in a digital world by relying on the
third entity for security and privacy of our digital assets and blockchain technology
comes into an handy.
So, it has a potential to revelation as the digital world by enabling a distributed

consensus we are each and every online transaction passed prison involving the digital
assets can be verified at any time in the future. So, the distributed consensus and
anonymity are the 2 important characteristics of a blockchain technology that we are
going to see here in this particular discussion.
740
Now, another application is in the form of digital smart contracts. So, the advantage of
block chain technology outwits the regulatory issues and technical challenges one key
emerging use case of blockchain is smart contracts.
Smart property is another related concept controlling the ownership of the property or I
said why are the blockchain using smart contracts is another non financial kind of
application.
741
So, financial and non financial application financial institutions in bank no longer see the
blockchain technology is a threat with a traditional business model non financial
application opportunities are also endless.
So, let us see this particular example in a traditional system here the bank has to say that
or has to verify the flow of the transaction from A to B, yes, he has sent the money. So,
that is being done in a traditional transactions where this is the trusted third party that is a
bank or a payment gateway which basically both these partners they trust or everyone
742
else trust and that is the existence of the bank is involved transactions which is
happening how this is to be broken up how 2 people can communicate or can transact
without using the third party of a blockchain technology.
So, in this lecture we will also focus few key applications of the blockchain technology
such as in the area of notary insurance private securities and other film interesting non
financial applications.
743
So, this is the blockchain technology the concept of the blockchain is understood by
explaining how the Bitcoin works; since it is intrinsically linked with a Bitcoin.
So, Bitcoin uses cryptographic proof initial initial of the trust in the third party of 2
billing parties to execute the an online transaction over the internet. So, each transaction
is sent to a public key of the receiver a digitally signed; using a private key of the sender
in order to spend the money owner of the cryptocurrency needs to prove the ownership
of the private key the entity receiving the digital currency verifies the digital signature,
there is the ownership of the corresponding private product key on the transaction on the
transaction uses using the public key of the sender whenever the owner has a private key
it use; it will digitally sign the transaction and this particular transaction is broadcast on
the receiving side this particular using public key this particular transaction is verified
and if it is verified that he is the owner who want to do a transaction then basically the
process of chaining the transactions will basically involve that we are going to see.
So, each transaction is broadcast to every node in a in a Bitcoin network and is recorded
in a public ledger after verification.
744
So, every transaction needs to be verified for validity before it is recorded in a public
ledger verifying mode needs to ensure 2 things before recording any transaction spender
owns the cryptocurrency it has to digitally sign and spender has sufficient cryptocurrency
in his account checking every transaction against the spenders account to make sure that
he has sufficient.
Is now the question of maintaining the order of these transactions because to every other
node in the in the peer to peer network the transition do not come in the order in which
745
they are normally generated by because of delays in the network. Hence there is a need
of for a system to make sure that double spending of a crypto currency does not occur
believe there are various peers and if it is the message is broadcast it may happen that
this particular transaction may reach earlier than other transactions
So, the ordering is not basically guaranteed at all the peers, but what happens is majority
of the peers of this particular ordering is basically taken care and the double spending of
a cryptocurrency currency can be avoided in using the distributed consensus though
double is pointing due to the propagation delays in a peer to peer network I explained
well spending means.
So, they are going 2 transactions are started looser and it may. So, happen that they are
being received at different instant of time, but it will be basically ordered using the
distributed consensus.
746
Hence only one transaction will be basically used and double spending will be avoided
using this particular concept of distributed consensus.
So, Bitcoin solve this problem by a mechanism that is now properly known as
blockchain technology. So, the transaction in one block are considered to have a
happened at the same point of time these blocks are linked to each other like a chain in a
proper linear and chronological order with every block containing a hash of the previous
blocks.
747
So, this particular timestamp is maintained in the form of a blockchain that is shown over
here Bitcoin solves this particular problem by introducing a mathematical puzzle.
So, each block will be accepted in the blockchain provided it contains an answer to a
very special mathematical problem which is also called a proof of work loading a block
need to prove that it is; it has put enough computing resources to solve a mathematical
puzzle.
So, for instance a node can be required to find out nonce which when hashed with the
transactions in hash or previous blocks produces a hash with a certain number of leading
0s; as I mentioned earlier that finding out this particular nonce is a computationally very
expensive it requires lot of CPU and the energy resources and the one the node which is
called a minor who calculates first we will get the incentive for this particular purpose.
748
So, the transactions transaction order is protected by the race here the mathematical race
to protect the transactions.
We small probability that more than one block will be generated in the system at a given
time first note to solve the problem broadcast the blocks to the rest of the network.
And that is accepted occasionally; however, more than one blocks will be solved at the
same time leading to several possible branches; however, the matter of solving is very
749
complicated and hence the blockchain quickly stabilizes meaning that every node in the
agreement about the ordering of the blocks a few back from the end of the chain.
So, the nodes donating their computing resources to find out the nonce are called minor
nodes and they are financially awarded to obtain the nonce and basically continue in
forming the longest proof of work chain. So, network only accepts the longest
blockchain is the valid one hence it is next to impossible for an attacker to find out that.
750
This particular example shows that the attacker must outpace or out of luck the network
effort.
Which is becoming impossible as the chain length increases the existing market
blockchain technology is finding application both in financial and non financial areas.
That we have already discussed there are some open companies like Ethereum and
Codius are enabling smart contracts. So, they are all already available in the literature.
751
So, there are alternative a blockchain in the system of using blockchain algorithm to
achieve distributed consensus on a particular digital asset they may share miners with a
current networks such as Bitcoins and this is called a merged mining that coins is an
open source protocol that describes the class of methods for developer to create digital
asset on top of Bitcoin blockchain by using its functionalities beyond the digital
currency.
752
We various applications, we have already touched upon in the beginning another

application is about the decentralized storage cloud file storage solutions.
Such as Dropbox, Google drive and one drive are growing popularity to store documents
photos videos music files despite their popularity cloud file storage solution typically
face challenge in the area such as security privacy and data control the major issue is that
one has to trust the third party with ones confidential file.
753
Hence there is another solution which is called decentralized IOTs; IOTs are increasingly
becoming popular technology in both consumer and enterprise space vast majority of
IOT platforms are built on centralized model in which as broker or hub controls the
interaction between the devices; however, this approach can become impractical in many
situations and need to exchange data between themselves autonomously.
So, blockchain technology facilitates this decentralization of IOT in IOT platforms. So,
there are many applications.
754
Which is basically we have summarized in the form of these particular slides. So, there
are reading materials available blockchain technologies beyond point.
755
So, that you can refer conclusion blockchain is the technology backbone for of Bitcoin
the distributed ledger functionality coupled with security of blockchain makes it very
attractive technology to solve the current financial as well as non financial business
problems there is enormous interest in blockchain; these business applications enhance
numerous startups working on it, we have to financial institution like these are must
record or investing in exploring applications of the current business model or blockchain.
In fact, some of them are searching for the new business models in the world of
blockchain; it is envisioned that blockchain will go through slow adoption due to the risk
associated most of the start of will fail with few winners.
Thank you.
756
THIS BOOK
IS NOT FOR
SALE
NOR COMMERCIAL USE
(044) 2257 5905/08

nptel.ac.in
swayam.gov.in

106106168

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

106106168

Uploaded by

Copyright:

Available Formats

INDEX

S. No Topic Page No.

So this is the first lecture Introduction to the Distributed Systems.

(Refer Slide Time: 00:21)

The course structure of a distributed system goes like this.

(Refer Slide Time: 02:00)

(Refer Slide Time: 03:19)

(Refer Slide Time: 03:47)

(Refer Slide Time: 04:33)

(Refer Slide Time: 06:57)

(Refer Slide Time: 08:02)

(Refer Slide Time: 09:19)

(Refer Slide Time: 10:18)

(Refer Slide Time: 11:14)

Synchronization or the coordination among the processes is essential. Mutual exclusion

Now, another system level challenge is the fault tolerance.

(Refer Slide Time: 12:38)

(Refer Slide Time: 13:13)

(Refer Slide Time: 17:13)

(Refer Slide Time: 17:56)

(Refer Slide Time: 19:19)

The other problem other algorithmic challenge is the synchronization coordination

(Refer Slide Time: 23:48)

Another important algorithmic challenge is basically forming the group communication,

(Refer Slide Time: 24:26)

Sensor is a processor equipped with an electro-mechanical interface and that is capable

(Refer Slide Time: 26:18)

Another application is basically called Peer-to-Peer computing: Peer-to-Peer represents

(Refer Slide Time: 27:39)

Distributed data mining is another application of a distributed computing. So, distributed

Another application of a distributed computing is found in the grid computing:

(Refer Slide Time: 28:33)

(Refer Slide Time: 30:17)

Lecture 2 is based on the basic algorithms on a message passing system.

(Refer Slide Time: 00:20)

(Refer Slide Time: 01:55)

(Refer Slide Time: 02:57)

(Refer Slide Time: 06:00)

(Refer Slide Time: 06:24)

(Refer Slide Time: 07:38)

(Refer Slide Time: 08:23)

(Refer Slide Time: 09:40)

(Refer Slide Time: 10:56)

(Refer Slide Time: 11:39)

(Refer Slide Time: 14:18)

(Refer Slide Time: 15:27)

(Refer Slide Time: 20:42)

(Refer Slide Time: 21:10)

(Refer Slide Time: 22:34)

(Refer Slide Time: 22:41)

(Refer Slide Time: 24:23)

(Refer Slide Time: 26:10)

(Refer Slide Time: 27:29)

(Refer Slide Time: 28:29)

(Refer Slide Time: 32:01)

(Refer Slide Time: 33:02)

(Refer Slide Time: 35:55)

Lecture 3: leader election in the rings preface recap of previous lecture.

(Refer Slide Time: 00:24)

(Refer Slide Time: 01:37)

(Refer Slide Time: 03:05)

(Refer Slide Time: 03:47)

Similarly, if we are navigating over number 2 that is p 0 is basically communicating with