Professional Documents
Culture Documents
Unit 3
Unit 3
Unit 3
Election Algorithms
Many tasks in distributed systems require one of the processes to act as the coordinator.
Election algorithms are techniques for a distributed system of N processes to elect a coordinator
(leader). A coordinator can be chosen amongst all processes through leader election. There are
two typical algorithms: Bully Algorithm and Ring-Based Algorithm.
Bully Algorithm
− The bully algorithm is a simple algorithm, in which we enumerate all the processes
running in the system and pick the one with the highest ID as the coordinator. In this
algorithm, each process has a unique ID and every process knows the corresponding ID
and IP address of every other process.
− A process initiates an election if it just recovered from failure or if the coordinator failed.
Any process in the system can initiate this algorithm for leader election. Thus, we can
have concurrent ongoing elections.
− There are three types of messages for this algorithm: election, OK and I won. The
algorithm is as follows:
1. A process with ID i initiates the election.
2. It sends election messages to all process with ID > i.
3. Any process upon receiving the election message returns an OK to its predecessor
and starts an election of its own by sending election to higher ID processes.
4. If it receives no OK messages, it knows it is the highest ID process in the system.
It thus sends I won messages to all other processes.
5. If it received OK messages, it knows it is no longer in contention and simply drops
out and waits for an I won message from some other process.
6. Any process that receives I won message treats the sender of that message as
coordinator.
Figure 1 The bully election algorithm. (a) Process 4 holds an election. (b) Processes 5 and 6 respond,
telling 4 to stop. (c) Now 5 and 6 each hold an election. (d) Process 6 tells 5 to stop. (e) Process 6 wins
and tells everyone.
If the neighbor of a process is down, it sequentially polls each successor (neighbor of neighbor)
until it finds a live node. For example, in the figure 2, when 7 is down, 6 passes the election
message to 0.
− There are two major issues with this algorithm, related to failures.
o When coordinator process goes down while one of the processes is waiting on
a response to a lock request, it leads to inconsistency. The new coordinator that
is elected (or reboots) might not know that the earlier process is still waiting for
a response.
o The harder problem occurs when one of the client process crashes while it is
holding the lock. In such a case, coordinator is just waiting for the lock to be
released while the other process has gone down.
Advantages:
• Fair: requests are granted the lock in the order they were received (actually other
strategies can also be implemented)
• Simple: three messages per use of a critical section (request, grant, release).
Disadvantages:
• The coordinator is a single point of failure.
• Hard to detect a dead coordinator.
• Performance bottleneck in large distributed systems.
− This algorithm, developed by Ricart and Agrawala, needs 2(n−1) messages and is based
on Lamport’s clock and total ordering of events to decide on granting locks.
− After the clocks are logically synchronized, the process that asked for the lock first gets
it. The initiator sends request messages to all n − 1 processes stamped with its ID and
the timestamp of its request. It then waits for replies from all other processes.
− Any other process upon receiving such a request either sends reply if it does not want
the lock for itself, or is already in the transaction phase (in which case it doesn’t send
any reply and the initiator has to wait), or it itself wants to acquire the same lock in
which case it compares its own request timestamp with that of the incoming request.
The one with the lower timestamp gets the lock first.
This algorithm is based on event ordering and timestamps. It grants the lock to the
process which requests first.
Process k enters critical section as follows:
• Generates a new time stamp TSk = TSk + 1
• Sends request (k, TSk) to all other n-1 processes
• Waits for reply(j) from all other processes
• Enters critical section
When a process j receives process K’s request (k, TSk), it:
• Sends reply(j) if it is not interested
• If already in critical section, does not reply, queues request (k, TSk).
• If it wants to enter: compare TSk & TSJ and sends reply(j) if TSk < TSJ, otherwise,
queues request (k, TSk).
This approach is fully decentralized but there are n points of failure, which is worse than the
centralized one.
Figure 4 . (a) Two processes want to access a shared resource at the same moment. (b) Process 0 has
the lowest timestamp, so it wins. (c) When process 0 is done, it sends an OK also, so 2 can now go
ahead.
Figure 5 (a) An unordered group of processes on a network. (b) A logical ring constructed in software.
− When the ring is initialized, process 0 is given a token. The token circulates around the
ring. It is passed from process k to process k +1 (modulo the ring size) in point-to-point
messages.
− When a process acquires the token from its neighbor, it checks to see if it needs to
access the shared resource. If so, the process goes ahead, does all the work it needs to,
and releases the resources. After it has finished, it passes the token along the ring.
− It is not permitted to immediately enter the resource again using the same token.
− If a process is handed the token by its neighbor and is not interested in the resource, it
just passes the token.
− As a consequence, when no processes need the resource, the token just circulates at
high speed around the ring.
− The correctness of this algorithm is easy to see. Only one process has the token at any
instant, so only one process can actually get to the resource.
− Since the token circulates among the processes in a well-defined order, starvation
cannot occur.
− The centralized algorithm is simplest and also most efficient. It requires only three
messages to enter and leave a critical region: a request, a grant to enter, and a release to
exit.
− In the decentralized case, messages need to be carried out for each of the m
coordinators, but now it is possible that several attempts need to be made (k).
− The distributed algorithm requires n − 1 request messages, one to each of the other
processes, and an additional n − 1 grant messages, for a total of 2(n − 1) [assuming only
point-to-point communication channels between processes].
− It takes only two message times to enter a critical region in the centralized case, but
3mk times for the decentralized case, where k is the number of attempts that need to be
made. Assuming that messages are sent one after the other, 2(n − 1) message times are
needed in the distributed case. For the token ring, the time varies from 0 (token just
arrived) to n – 1 (token just departed).
− With the token ring algorithm, if every process constantly wants to enter a critical
region, then each token pass will result in one entry and exit, for an average of one
message per critical region entered. At the other extreme, the token may sometimes
circulate for hours without anyone being interested in it. In this case, the number of
messages per entry into a critical region is unbounded.
− All algorithms except the decentralized one suffer badly in the event of crashes. Special
measures and additional complexity must be introduced to avoid having a crash bring
down the entire system. The decentralized algorithm is less sensitive to crashes, but
processes may suffer from starvation and special measures are needed to guarantee
efficiency.
Clock Synchronization
− To implement Lamport’s logical clocks, each process Pi maintains a local counter Ci.
These counters are updated as follows steps.
o Before executing an event (i.e., sending a message over the network, delivering
a message to an application, or some other internal event), Pi executes Ci ← Ci
+ 1. i.e., each processor has a logical clock which gets incremented whenever
an event occurs.
o When process Pi sends a message m to Pj, it sets m’s timestamp ts(m) equal to
Ci after having executed the previous step.
o Upon the receipt of a message m, process Pj adjusts its own local counter as Cj
← max {Cj, ts(m)}, after which it then executes the first step and delivers the
message to the application.
− This makes sure that the timestamp assigned to the receiver event is higher than the
sending event. This technique was invented by Leslie Lamport.
Vector Clocks
− Lamport’s logical clocks lead to a situation where all events in a distributed system are
totally ordered with the property that if event a happened before event b, then a will
also be positioned in that ordering before b, that is, C (a) < C (b).
− However, with Lamport clocks, nothing can be said about the relationship between two
events a and b by merely comparing their time values C (a) and C (b), respectively.
− In other words, if C (a) < C (b), then this does not necessarily imply that event a indeed
happened before event b.
− The problem is that Lamport clocks do not capture causality. Causality can be captured
by means of vector clocks.
− A vector clock VC (a) assigned to an event a has the property that if VC (a) < VC (b)
for some event b, then event a is known to causally precede event b.
− Vector clocks are constructed by letting each process Pi maintain a vector VCi with the
following two properties:
1. VCi [i] is the number of events that have occurred so far at Pi. In other words,
VCi [i] is the local logical clock at process Pi.
2. If VCi [j] = k then Pi knows that k events have occurred at Pj. It is thus Pi’s
knowledge of the local time at Pj.
− The first property is maintained by incrementing VCi [i] at the occurrence of each new
event that happens at process Pi. The second property is maintained by piggybacking
vectors along with messages that are sent. In particular, the following steps are
performed:
1. Before executing an event (i.e., sending a message over the network, delivering
a message to an application, or some other internal event), Pi executes VCi [i] ←
VCi [i] + 1.
2. When process Pi sends a message m to Pj, it sets m’s (vector) timestamp ts(m)
equal to VCi after having executed the previous step.
3. Upon the receipt of a message m, process Pj adjusts its own vector by setting
VCj [k] ← max {VCj [k], ts(m) [k]} for each k, after which it executes the first
step and delivers the message to the application.