DC Techmax Everything You Need To Prepare For Distributed Computing Subject in The Final

Communication
aw
4|Syllabus
Layered Protocols, Interprocess communication (IPC): MPI, Remote Procedure Call. (RPC), Remote Object
Invocation, Remote Method Invocation (RMI), Message Oriented Communication, Stream Oriented Communication,
Group Communication.
In distributed system, processes communicate with each other by sending the messages. Soit is necessary to study
the ways processes exchange the data and information.
2.1 Layered Protocols

issues
Open Systems Interconnection Reference Model (ISO OSI Model) is considered to understand various levels and
format,
related to communication. Open system communicates with any other open system by using standards rules about
connection-
contents of messages etc. These rules are called as protocols. These protocols can be connection-oriented or
less.
Application protocol 7
Application |4----------oeee __
Presentation protocol |
Presentation 4-----See a re 6
Session qeetecitesSessionprotocol ________ 5

Transport protocol 4
Transport genannten wa anne nnn n----
Networkprotocol |
Network d= sh gresSas Sreserny Samana sors 3
Data link
4 ink protocol
Datalin
A) .-4-- 2. | >
Physical _....--.. Physicalprotocol 4
Network . |
Fig. 2.1.1: Interfaces, Layers and Protocols in OSI model
Interface is present
4 In this model total seven layers .are present. Each layer provides services to its upper layer.
between each pair of adjacentlayers. Interface defines set of operations through whichservices is provided to users.
message and handoverit to the
Application running on sender machine is considered at application layer. It builds the
application layer on its machine. Application layer adds its header at the front of this message and passes to
downto the session
presentation layer which adds its header at the front to the message. Further message is passed
so on.
layer which in turn passes the message to transport layer by adding its header and
W TechKnowledge -
Publications
Scanned with CamScanner

or
-|
ae
2-2
Communication
|
Distributed Computing (MU-Sem 8-Comp.)
2.1.1 Lower Layer Protocols :
Followingis discussion about functionsof each layer in OSI model.

transfer per second, i n binary form information how |
4 Physical layer: This layer deals with issues such as: rate of data
d be simplex, duplex orhalf |
muchvolts to use for 0 and 1, whether transmission between sender and receiver shoul
: physical |
by this layer. The
. The issues like cables, connectors, number of pins and its meaning are taken car e
duplex
ng interface to receive Correct
layer protocols also deals with standardizing the electrical, mechanical and signali
information as sent by sender.
4 Data Link Layer (DLL): This layer deals with the issues such as:
O Framing: To detect start and end of received of frame.
© Flow Control: This is mechanism controls data transfer rate of sender t 9 avoid load on receiver. DLL implements _
protocols to get feedback from receiver to sender about slowing downthe data transfer rate.
oO Error Control : DLL deals with detection and correction of errors in received data by using mechanism such as 2
checksum calculation, CRC etc.
4 Network Layer : This layer mainly deals with routing. Messagestravel through manyrouters to reach to destination
machine. These routers calculate shortest path to forward received messageto destination. Internet protocol is most |
widely used today and it is connectionless. Packets are routed through different paths independent of others.
Connection-oriented communication now popular. For example, virtual channel in ATM network.
4 Transport Layer : This layer is very important to construct network applications. This layer offers all services that are
not implemented at interface of network layer. This layer breaks messages coming from upper layer into pieces which
are suitable for transmission, assigns them sequence number, congestion control, fragmentation of packets etc. This
layer deals with end to end processdelivery. Reliability of transport services can be offered or implemented on the top
of connection-oriented or connection-less network services.
4 {n this transport layer, transmission contro! protocol (TCP) works with IP today in network communication. TCP is
connection oriented and user datagram protocol (UDP) is connection-less protocol in this layer.
4 Session and presentation layers : Session layers deal with dialog control, it keeps track of which parties are currently ;
talking and check pointing to recover from crash. It takes in to account the last checkpoint instead of considering all |
since beginning. Presentation layer deals with syntax and meaning of transmitted information. It helps the machines 4
with different data representation to communicate with each other.
_ Application Layer: Initially in OSI layer, electronic mail,file transfer and terminal emulation was considered. Today,all
such as HTTP.
the applications running are considered in application layer. Applications, application specific protocols
and general purpose protocol such as FTP is now consideredinthis layer. | | .
2.2 Interprocess Communication (IPC)
2.2.1 Types of Communication
4 Synchronous Communication: In this type of communication,client application waits after sending request until reply
sent request comes from server application. Example is Remote Procedure Calls (RPC), Remote Method Invocation
(RMI).
Oy Techtaestelf,
pupiicatl® 4
:
E

Distributed Co mputing (MU-Sem 8-Comp,) on Communication
- Asynchronous Commun
ication
8In this type of communication,client application continues other work after sending
request until reply of sen
t requ est comes from serverapplication. RPC can be implemented with this communication
type. Other examples can be
tra nsferring amount from one account to other, updating database entries and so on.
- Transient Communication -
N= In this type of communication, sender application and receiver application both should be
running to deliver the Messages sent between them. Example
is Remote Procedure Calls (RPC), Remote Method
Invocation (RMI).
- Persistent Communic . 8
ation : In this type of communication, either sender application or receiver application both need
not be runningto deliver the Messages sent between them.
Exampleis email.
2.2.2 Message Passing Interface
Sockets are considered asinsufficient to deal with communication among high-performance multicomputers.It is
concluded that, highly efficient applications cannot be implemented with sockets in order to incur minimal cost in
communication. Sockets are implemented with general-purpose protoccl stack such as TCP/IP which does not suite
with high-performance multicomputers. Sockets supports only simple primitives like send and receive.
- Sockets are not suitable with proprietary protocols developed for high performance interconnection network, for
example, used in cluster of workstations (COWs) and massively parallel processors (MPPs).
- Therefore such interconnection network uses proprietary communication libraries which provide high level and
efficient communication primitives. Message passing interface (MPI) is specification for users and programmers for
message passing libraries. MPI basically is developed for supporting the parallel applications and adapted to transient
communication where sender and receiver should be active during communication.
- MPI assumesprocess crash or network failure as incurable and these failures do not require recovery action. MPI also
assumes that known group of processes establishes communication between them. To recognize process group,
group!D is used and for process, processIDis used. These are identifiers and the pair (groupI/D, process!D) exclusively
identifies the sender and receiver of the message. This pair of identifiers excludes using the transport-level of address.
2.3. Remote Procedure Call (RPC)

Program executing on machine A calls procedure or function that is located at machine B is called as Remote
Procedure Call (RPC). Althoughthis definition sempsto be simple butinvolves some technical issues. The primitives (send
2nd receive) used in sending and receiving the messages do not hide communication. In RPC,calling program calls remote
Procedure by sending parametersin call and callee returns the result of procedure.
23,1 RPC Operation

In Rernote ProcedureCalls (RPCs) model, a client side process calls a procedure Implemented on a remote machine.In
this call, parame ters are transparently sent to the remote machine (server) where the procedure actually executed.
The result of execution then server sent back to the caller. Althoughcalled procedure is executed on remotely, it
appears as call was executed loc ally. In this case the communication with remote process remains hidden from calling
Process,
ee ee
Tech Knowledge
Puettitations

Communication
~ While calling a remote procedure,theclient programbinds with a small library procedure called theclient stub.Client
Stub stands for the server procedurein the client9s address space. In the same way,the server program
bindswith 4
Procedure called the server stub. Fig. 2.3.1 showsthe steps
in making an RPC.
*eP 7+Clientcallstheclientstub.ASclientstubisonclientmachine,thiscallisalocalprocedur
P2:° Clientstubpacks the parametersinto a message andissueasystem call tosend themessage.Thisprocessis
>& iCalled marshalling9) oO We ee ote ere
fep3:Kernelsendsthemessagefromthe client machinetothe servermachine.
servermachine,send the incomingpacketto the server stubs e008 ate
=P: Thekernelononserver
ae ~ Serverstub callstheserverprocedure.
Client CPU Server CPU
Client Server
1 stub stub
Client |".
4 |Server
4]
Operating system : , 4-Operating system
eee)
\
Network
Fig. 2.3.1 : Steps in RPC
The server then handover the result to server stub which packsi
t into message. Kernel then send the messagetoclient
where kernel at client handover it to client stub. Client
stub unpack the message and handover result to client
2.3.2 Implementation Issues
Machinesin large distributed system are not of same archi

tecture and hence data representation of these machin
esis
different. For example, IBM mainframe uses EBCDIC charac
ter code representation, Whereas IBM PC use ASCII code.
Hence, passing character code between these pair of machin
eswill not be Possible. In the s ame way,
machines numbers their bytes from MSB to LSB (litt Intel Pentium
le endian). SPARK machine stores bytes from
~ (big endian). LSB to MSB
4 As client and server applications are in different address space

in client and server machine, Passi
ng pointer as
parameter is not possible. Instead of pointer to data,it is possibl
e to send data as Para meter. In C language,
passed to the function as reference parameter. Hence,it is necessary to arrayis
. ., | send complete array as a parameter to servel
as array address will be different at server machine. |
4 Server stub at server machine then creates pointer to this data

(passed array from client side) in and 4 to server
_ application. Server then sends modified data back to server stub
which in-turn sends to client stub via kernel. This
works for simple data type not for complex data structure
like graph. In this example calling sequenceof call by
reference is replaced by copy-restore. ae
YF TechKnowletf 8dl
Pubitcatiee= 47j
ij
Det
euevieiian!
Neea7 bros
palit tas thn

Distributed Computing (MU-Sem 8c
omp.) » Communication
- In weakly typed lang
uages like C, it is en
without spec ifying how large
tirely legal to write a procedure that computes the inner product of twoarrays
either e is. Each could be terminated by a specialvalue knownonly to thecalling and ~9
called procedures.
In this case, it is hs
no way to know how la ically impracticable for the client stub to marshal (pack) the parametersas it has
rge they are .
a
Soo tite
met noxeish idi nca, i-oeees ane paramare
mesit rs; even from a formal specification or the code itself. Calling
nn <hitter mee : - ly impo
ssinie as it takes any number of parameters and of mixe types. If
d
te nitata, contidereg 7 ne then calling and called procedure cannot shar glo va
ane e bal riables and nence
ve limitations and taking care of above things, still RPC is used widely NU
2.3.3 Asynchronous RPC Clie own444 >/ lao7

heeft re UA
- 4
Sometimes it is n ot necessary forclient application to block when there is no reply to return from server application.
; , ,
For exam ple, transferring amount from one account to other, updating database entries. In this case, client can
continue to Carry out its useful work after sending request to server without waiting for reply.
Asynchronous RPC allows client to continue its work after calling procedure at server. Here, client immediately
continue to perform its other useful work after sending RPC request. Server immediately sends acknowledgement to
After receiving
client the moment request is received. Server then immediately calls the requested procedure.
waits till acknowledgement of receipt
acknowledgement, client continue its work without further blocking. Here client
of request from server arrives.
t waiting for acknowledgementfrom server for receipt of
In one-way RPC, client immediately continues its work withou
sent request.
2.3.4 DCE RPC

system developed by Open Software
RPC is one of the examples of RPC
Distributed Computing Environment
base system for distributed computing. DCEis true
nda tio n (OS F). Its spe cif ic ations have been adopted in Microsoft
Fou
major operating systems.
sy st em ini tia lly de signed for UNIX. Now,it has been ported to all
middlewa re
vices remotely from
model as programmin mod
g el where user processes are clients to access ser
rver
DCE usesclient se npleemented by application
chine. Some services are
in built in DCE and other are impl
ts er ve r ma
server processes 4 ugh RPC.
ation takes place thro
ers. Cl ie nt a nd server communic
programm
by DCE are :
Services provided parent wayto access anyfile in the system in same manner.
Service - It offers trans ; , ; , ,
o vetri
Distribute d F ile 8eheens track on all resources distributed worldwide including machines,printers, data,files,
e -
o Directory servic
ny more.
servers and ma o the resources to only authorised pers
ons.
ic e - It of fe rs access .
o Security Serv the clocks of different machines.
r onize
d T ice e : To syn ch
ime Servic
Dist ri bu te
o
DCE RPC -
Goals and Working of EtRPC system can
m server.
remote service froont
to call loca l procedure in order to access|
nt tO
il RPCSystem allows clie ection between client
! and server application (binding).
he § server and establishes conn
the
automatically locate 4_4_4
Ep TechKnvutedgi
OO ; 8
Publications

Distributed Computing (MU-Sem 8-Comp.) 2-6 Communication
4 Itcan fragment and reassemble the messages, handle transport in both directions and converts data types between |
v client and servers if machine on which they running have different architecture and hence, different data
representation.
4 Clients and servers can be independent of one another. Implementation ofclient and servers in different languagesis
supported. OS and hardware platform can be different for client and server applications. RPC system hidesall these
dissimilarities. |
4 DCE RPC system contains several componentssuch aslibraries, languages, daemons, utility programs etc. Interface
VY definitions are specified in interface definition language (IDL). IDL files contains type definitions, constant declarations,
function prototypes, and information to pack parameter and unpack the result. Interface definitions contains syntax of
. , oN
call not its semantics.
J Each IDL file contains globally unique identifier for specified interface. This identifier client sends to server in first RPC
message. Server then checksits correctness for binding purpose otherwise it detects error.
Steps in Writing a Client Server Application in DCE RPC
4 First, call uuidgen program to generate prototype IDL file which contains unique interface identifier not generated
anywhere by this program. This unique interface identifier is 128 bit binary number represented in ASCII string in
hexadecimal. It encodes location and time of creation to guarantee the uniqueness.
4 Next step is to edit this IDL file. Write the remote procedure names and parametersin file. This IDL file then is
compiled by using IDL compiler. After compilation, three files are generated: header file, Client stub andserver stub.
4 Header file contains type definitions, constant declarations, function prototypes and unique identifier. This file is
included (#include)in client and server code. The client stub holds procedures whichclient will call from server.
- These procedure collects and marshals parameters and convert it outgoing message. It then calls runtime system to
send this message. Client stub also is responsible for unmarshaling the reply from server and delivering it to client
application. The server stub at server machines contains procedures which are called by runtime system there when -
message from client side arrives. This again calls actual server procedure. °
Write server and client code.Client code and client stub both are compiled to convertit in objectfiles which arelinked
with runtime library to produce executable binary for client. Similarly at server machine, server code and server stub
both are compiled to convert it in object files which are linked with runtime library to produce executable binary for
server.
Binding Client to Server in DCE RPC
Client should be able to call to server and server should accept the clients call. For this purpose,registration of serveris
necessary, Client locates first server machine and then server process on that machine.
Port numbers are used by OS on server machine to differentiate Incoming messages for different processes. DCE
daemon maintains table of server-port numberpairs. Server first asks the OS about port number and registers this end |
point with DCE daemon. Server also registers with directory service and provide it network address of server machine
2.3.2, ei
and name of server. Binding a cllant to server Is carrled out as shownIn Fig.

LL ee
i
AN
|
Communication
¥<¢ Distributed C omputing (MU-Sem 8-Comp.) 2-7 a
4_4
Directory machine
|
Directory
2. Serverregisters service with directory }
Server
Server machine
oN
3. Look up server
44444 |e»!
5. Do RPC Server
Client 1. Register the

> port number in
table
Client machine -
DCE
4. Ask for port number Daemon
to DCE daemon
re
Fig. 2.3.2
t DCE |
server. Client then contac
which ret urns network address of
Client passes name of server to directory server ess and port
Cc lient knows both network addr
on to get port number of server running on server machine. Once
daem
number, RPC takes place.
2.4 Remote Object Invocation

e the
al wor ld by means of well-def
ined interface. CORBA and DCOM ar
details fro m ext ern
Objects hi des its internal
systems.
examples of object based distributed
(RMI - Remote Method In
vocation)
9.4.1 Distributed Objects
d as methods. Methods are invoked
lled the st a te and operations on those data, calle
data, ca may
- Objects enca psulates ugh object's interface. Single object
ca n ac ce s s or manipulate state only thro
Process
through interf ace. n, several objects may provide implementation for
Si milarl y, for given interface definitio
implements Sev eral in
terfaces. att
it.
a |
Sewer SOS ie
Clieft invokes State || :
method 8
: Yace
Sameinterface Methods
as object
Skeleton invokes same - i OseShade

method at object | Skeleton | x Paces ~~
Marshalled invocation is
passed
Across the network. 4
eg: |
Network \
at client side
zation of Remote object with proxy
Fig. 2-4-1: Organi
e
ee Teck Knowledge
Publications

W Distributed Computing (MU-Sem 8-Comp.) 2-8 : Communication 4 d
eee
4 Wecan keep object's interface on one machine and object itself on other machine. This organization is referred as _
' : distributed object. Implementation of object's interfaceis called as proxy. It is similar to client stub in RPC and remains
in client address space.
4 Proxy marshals clients method invocation in message and unmarshal reply messages which contains result of method
invocationby client. This result, proxy then returns to client. Similar to serve r stub in RPC, skeleton is present On server ©
machine
Skeleton unmarshals incoming method invocation requests to proper method invocationsat the object s interface at
: : . . 1. 4
4
the server application. It also marshals reply messages and forward them to client side proxy. Objects remains on
single machine and their interfaces only made available on different machines. This is also called as remote object.
Compile-time Versus Runtime objects
4 Language-level objects called as compile-time objects. These objects are supported by object oriented languages such
as Java, C++ and other object oriented languages. Objectis an instance of class. Compile time objects makes it easy to
build the distributed applications in distributed system.
4 In Java, objects are defined by means of class and interfaces that class implements. After compiling interfaces, we get
server stub and client stub. The generated code after compiling the class definition permits to instantiate the java
objects. Hence, Java objects can be invoked from remote machine. The disadvantage of the compile-time objectsisits:
dependency on particular programming language.
4 Runtime objects are independent of programming language and explicitly constructed during run time. Many
distributed object-based system uses this approach and allows construction of applications from objects writtenin
multiple languages. Implementation of runtime objects basically is left open.
4 Object adapter is used as wrapper around implementation so that it appears to be object and its methods can be
invoked from the remote machine. Here, objects are defined in terms of interfaces they implements. This
implementation of interface then registered with object adapter which makes available the interface for remote
. stiofanj
~invosation.
ieee A
} .| ersistent and Transient Objects
. 8
illy
:
f
4 Persistent object continue to exist although server exits. Server at present managing persistent object stores object
7 L state on secondary storage beforeit exit. If newly started address needs this object then it reads object state from
* Sip}
2hsfyyIC= secondary storage.
oovrs4y In contrast to persistent object, transient object exists till server manages that object exists. Once server exit, the ©
*
Sit2
Coe pte
transientSbiect also exit.
i RNruby)
, Object References F
i $y
reference which must contain '
ab ent |invokes the method of remote object. The client binds with an object using oblect
_ sufficient information to permit this binding. Object reference includes network address of machine on which object isg
placed, plus port number(endpoint) of server assignedbyits local operating system.
4 The limitation of above approach is that, when servercrashes,it will be assigned different port number by OS after
aoe Pe. befall gS 022,53.fal83
recovery.In this case, all the object references become invalid.

en 8a
W TechKnowledg?
Gober
Publications...»

. Distributed Comput
ing (MU-Sem g-¢
<Comp.)
; 2-9 | | Communication
~ The solution to this Problem to
have Io
records server-to-port number al daemon process on machine whichlistens to a well-known port numberand
Cc * = *
assign 8 ; . . 8
used as index into endpoint ta Bnment in endpointtable. In this case, server ID is encoded in object reference Is
: ble. While bind;
number. Server should always regi While binding client to object, daemon process is asked for server's current port
register with local dae mon.
- The problem with encoding ne |
two rk address iin n object
obj referenceis that, if server moves to other machine i then again
all the object reference S becomes
invalid. Th; |
on machine whereserver js Curr '¢- This problem: can be solved i ich wlil keepke} track
. by keeping locati server whichwil
; onnetwork '
ently running. In this case, object reference contains address of location
server and systemwide identi fier
for
server,
oe relerence may also
include more information such
as identi fication of protocol that is used to bind with object
and the protoco| supported by
object server, for example T CP or UDP. Object reference may also contain
implementation handle which
refers to com plete implementation of proxy which client can dynamically loads for
binding with object.
Parameter Passing
- Objects can be accessed from remote machines. Object references can be used as parameter to methodinvocation
and these references are passed by value. As a result object references can be copied from one machine to the other.
If process has object reference then it can bind to the object whenever required.
ere
- Itis not efficient to use only distributed or remote objects. Some objects such as Integers and Booleans are small.
Therefore it is important to consider references to local objects and remote objects. For remote method invocation,
object reference is passed as parameter if object is remote object. This reference is copied and passed as value
parameter. Whereas, if object is in the same address space of client then entire object is passed along with the
method invocation. That means, object is passed by value.
-~ Asa example, suppose client code is running on Machine 1 having reference to local object O1. Server code is running
with
on Machine 2. Let the remote object 02 resides at machine 3. Suppose client on Machine 1 calls server program
machine 3. Client
object O1 as a parameter to call. Client on machine 1 also holds reference to remote object O2 at
reference to O2 is passed as
also uses O2 as a parameter to call. In this invocation, copy of object O1 and copyofthe
and remote objects are passed
parameter while calling server on Machine 2. Hence, local objects are passed by value
by reference.
Example : Java RMI |

bias iects have bee n integrated in languageitself and goal was to keep semantics as much of the
In Java, distributed obje on high degreeofdistribution transparency.
is B!V
The main focus !5 given
hondistributed objects.
The java Distributed Object Mode!

4 object as @ remote object which resides on remote machines. Interfaces of the remote
~~ Java supports distributed © J
y mean s of proxy which Is local object in client inter
face. Proxy offers precisely same
objects is implemented b
ce as rem ote object.
fa tly same COPY of object and its state.7 Clon)ing the remote objeec ct.. Clonliing the remote
nteer
8int
, exac ject. Therefore cloning is carried out only by
~ Clonin ; g
of object creates f proxies whichcurrently are bound with ob
es cloniing O
object also requir case, no need to clone proxies of actual
of object in server address space. In this a.
ac t CO P Y
.. Server which cr eates ex oned object in order to accessit.
bi nd t o cl
Object, only clien t has to
_ « .
WF lechKnowledge
B
he i We Publications

Communication
3 Distributed Computing (MU-Sem 8-Comp.) 2-10
4 In Java, method can be declared as synchronized.If two processestry to call synchronized methodat the same time
access to object's internal data is
then only one process is permitted to proceed and other is blocked. In this manner,
entirely serialized.
side. At client side, client is blocked in
4 This blocking of process on synchronized method can be at client side or server
other machines are blocked at client side,
client-side stub which implements object interface. As every client on
synchronization among them is required which is complex to implement.
is executing its invocation then
Server side blocking of client process is possible. But, if client crashes while server
to proxies.
protocol is needed to handle this situation. In Java RMI blocking on remote object is restricted
Java Remote Object Invocation
4 In Java, any primitive or object type can be passed as parameter to RMI if it can be marshaled. Thenit is said that the
objects are serializable. Most of the object types are serializable except platform dependent objects. In RMI local
objects are passed by value and remote objects are passed by reference. |
4 In Java RMI, object reference includes network address, port numberof server and object identifier. Server class and
client class are two classes used to build remote object. Server class contains implementation of remote object that -
runs on server. It contains description object state, implementation of methods that works on this state. Skeleton is
generated from interface specification of the object.
4 Client class contains implementation of client-side code. This class contains implementation of proxy. This class is
generated from interface specification of the object. Proxy builds the method invocation in message which then it
sends to server. The reply from server is converted in result of method invocation by proxy. Proxy stores network
address of server machine, port numberand object identifier.
4 In Java, proxies are serializable. Proxies can be passed to remote'process in terms of message. This remote process
then can usethis proxy for method invocation of remote object. Proxy can be used as reference to remote object.
Proxyis treated as local object and hence can be passed as parameter in RMI.
4 As size of proxy is large, implementation handle is generated and used to download the classes to construct proxy.
Hence, implementation handle replaces marshaled code as part of remote object reference. Java virtual machine is on
every machine. Hence, marshaled proxy can be executed after unmarshaling it on remote or other machine.
2.5 Message-Oriented Communication
In Remote Procedure Call (RPC) and Remote method invocation (RMI), communication is inherently synchronous
whereclient blockstill reply from server arrives. For communication between client and server, both sending andreceiving
sides should be executing. Messaging system assumesboth side applications are running while communication is going on.
Message queuing-system permits processes to communicate although receiving side is not running at the time
communication is initiated. '
a
2.5.1 Persistence Synchronicity in Communication

| \ |
4 In message-oriented communication, we assume that applications executes on hosts. These hosts offer interface to
the communication system through which messages are sent for transmission.
44
W TechKnowledge
Publications

ge
OCCT
*¢ Distributed Computing (MU-S
-Semem 8-Com p.)
Wa
\ye pe Commun
icatio n
2-11
ponsible
All the hosts are connect ed throu These communication servers are res
-
gh network of communication servers.
|
urati on O h
Buffer is available at ea ch host a nd communication server. Consider Email system with above config
s or
and communication serv ers. Her e, user agent program runs on host using which users can read, compose, $ end
rece
,
ive the messages. Use r agent sends message permits to send messageat destination.
host for _ then forward this message toits local mail
transmission. Host
ser agent progra m at host s ubmits message to
, -. t level address of d estination
server. This mail server st ores the received messagein output buffer and lo ok up transpor
a
| ination mail server an d forward message by removing| t from
mail server. bliishes
tabl
en esta the connecti, on with dest
output buffer.
agent at
buffer in or der to deliver to designated receiver. User
At destination mail server, message is put in input there. In this case,
n regula r basis by usin vices of interface available
g ser
receiving host can check incoming messageso
messages are buffered at communication servers (mail servers).
is successf ully delivered to next
by ¢o mmunication server as long as it
Message sent by source hostis buffered application need not continue
for transmission, sending
tting messa ge
communication server. Hence, after submi for receiving application
to be
d by comm unic atio n serve r. Also, it is not necessary
executi on as message is store persistent communicati
on. Email is
was subm itte d. This for m of commu nication is called as
executing whe n message
OOOO
.
munication.
example of persistent com message if
in above discussi on) stores the
n, com mun ica tio n sys tem (communication server
tio discussed above,
In transient communica e is discarded. In the exam ple
lic ati ons are exe cutin g. Otherwise messag
app eiver then message is
both sender and re ceiver a ge to nex t communication server OF rec
ive r me ss
not able to del
if communication serve ris
|
discarded. ation router plays
r
tran sien t comm unic atio n and in this communic

rvic es offer
ication se e to forward it to next ro
uter or
All the transport-leve | commun op s me ss age if it is not abl
mp ly dr
ver: Router si
the role of communicat ion ser
destination host. ion. This sent

ting message for transmiss
r con tinues execution after submit
m munication, sende n server. In synchronous com
munication,
In asynchronous co din g host or at first communicatio
buffer of sen t or deliveredto receiving app
lication.
remains | n al buffer of destination hos
message either ed in lo c
stor
message is
sender is bl ocked until
munication TYPES
2.5.2 Combinati
on of com
of commun ication
ons O f different types
nati n, message is persistently stored in buffer of
4
Following are the © ombi th is type of communicatio
n: In
unicatio stent asynchronous communication.
er. Email is the example of persi
Se
Persistent Asynchr munic at io n se rv

irst com which process A conti
nuesits execution after sending the message.
local host or in buffe r off cati on in
e of commu ni
this 8YP n it will start running.
Fig. 2.5.1 shows e whe ae
the messaB
thi B rece ives
ae ea rl
eee eas
After r this,
ee
Ty
: ALL
en
| Q OF FeetRowuleag
444
ee
4

aa Uy 4
q Ves igs= * Mh, \ :- / - je
Distributed Computing(MU-Sem 8-Comp.) Communication 4

A sends message A stopped
and continues running
444_
A 7
Besser ecresssr=ss
. B starts and
B is not < receives
running message
Fig. 2.5.1
4 Persistent Synchronous Communication : In this communication sender is blocked until its message is stored at
receiver's buffer. In this case, receiver side application may not be running at the time message is stored. In Fig. 2.5.2,
Process A sends messageand is blocked till message is stored in buffer of receiver. Note that, process B starts running
after some timeand receivesthe message.
| Asends message A stopped
and waits until accepted 8running
A
Messageis stored
at B's location for
later delivery #, Time
- B =-------Qrannone,
B is not B starts and
running receives
message
Fig. 2.5.2
4 Transient Asynchronous Communication : In this type of communication, sender immediately continues its execution
after sending the message. Message may remain in local buffer of sender9s host. Communication system the routes
the message to destination host. If receiver process is not running at the time message is received then transmission
fails. Example is transport-level datagram services such as UDP. In Fig. 2.5.3, Process A sends message and continues
its execution. While at receiver host, if process B is executing at the time message is arrived then B receives the
message. Example is one way RPC.
A sends message
and cantinues
= ee ee
sent onlyif B is
running
Time
4->
' B receives
message
Fig. 2.5.3
~ Recelpt-Based Transient Symrennnepous. Communication: In this type of communication, sender is blocked until
message is stored at local buffer of receiver host, Sender contl ues Its execution after it receives acknowledgement of|
(Jj NF a
receipt of message. It is shown In Fig, 2.5.4, tema) in
a TechKnotsledgia
Pupricarteneah

the ty
4 8 . |
Distributed Computing (MU-Sem 8-C » | ba|
ede,
omp.) 2-13 Communication iba

Send request and wait regia | Hi 8
until received
; j
| \
|
A P=== ===: Hi
ofa
ate
_Request at REL
IS received K ty i
qhie
3 aTime i
' 444 Wy
Running, butdoing Process te
something else request
i
? Fig. 2.5.4 {|
ti
- Delivery-Based Transient Synchronous Communication : In this type of communication, sender is blocked until i
ale
message is delivered
i to the target process. It is shownin Fig. 2.5.5. Asynchronous RPC is example of this. type of ey
communication. I
Send request and wait ; |
until accepted | ! H
gf Hit
A - === == ~~ me == === === =~ Pai hid.
: Request Accepted a
is receive au
44 ie i
4
Running, but doing Process iit
something else request | 4
1 fi!
Fig. 2.5.5 | :
ae
ication: i icati se nder is blocked until bt
I
- Response-Based Transient Synchronous Communication: In this eee of communication,
, in Fig. 2.5.6. RPC and RMI sticks to this type of communication. / Hii
response is received from receiver process.It is shownin Fig. 2.5 6. YP aay
Send request and wait Ha
for reply | bihad |
--<-- tid
A seme) We
Accepted ii
|4|3
Request
is received Time ya
8 8 .
- 4
He
ya
W Falls
| Lae
Running, but doing Process
something else request i ii
} i
Fig. 2.5.6 tis
ta
to be developed. | | ih
is need ofall communication types as per requirements of applications ae
~ Indistributed system, there | |
sient Communication at
25.3 -Oriented Tran ii |
luge Message i veral ap lications
ich several applications and and distributed sy t are built
distributed systems Duillt a as ee aa
ee
_oriented model using wh pp Ht,

Transport layer offers mess
e
s whic h is the example of messaging. «
level socket
pec dansree wee lS eee
a TechKnowledga
Publications
i8
SE
SOS
n se
.

.
Distributed Communicati
Computing (MU-Sem 8-Comp.) 2-14 =
Berkley Sockets
4
Transportlayer offer programmersto use all the messaging protocols through simpl
e set of primitives. This is Possible
ow to port the application op
due to focus given on standardizing9 the interfaces. The standard interfaces also all
different computers. Socket interface is introduced in Berkley UNIX.
ications running 8 on ot her
Socket is communication endpoint to which application writes data to be sent to applicatio
4
ines
machines in icati
| the network. Application also reads the incoming data from this communication endp oint. Followin g are
socketprimitives for TCP.
a. , , r
4 Socket : This primitive is executed by server to create new.communication endpoint for particular transport protocol,
The local operating system internally reserve resources for incoming and outgoing messages for this specified protocol
This primitive is also executed by client to create socket at client side. But, here binding of local addressto this socket
is not required as operating system dynamicallyallocate port when connection is set up.
4 Bind : This primitive is executed by server to bind IP address of the machine and port number to created socket. Port
number may be possibly well-known port. This binding informs the OS that server will receive message only on
specified IP address and port number.
4 Listen : Server calls this primitive only in connection-oriented communication. This call permits operating system to
reserve sufficient required buffers for maximum number of connections that caller wanted to accept. It is nonblocking
call.
4 Accept : This is executed by server and call to these primitive blocks the caller until connection requestarrives. Local
operating system then creates new socket with same properties like original one. This new socket is then returned to
the caller which permits the server to fork off the process to handle actual communication through this new
connection. After this, server goes back and waits for new connection request on original socket.
| 4 Connect : This primitive executed by client to get transport-level address in order to send connection request. The
client is blocked till the connection is set up successfully.
|| . 4
.
Send : This primitive executed by bothclient and server to send the data over the established connection.
4 Receive : This primitive executed by both client and server to receive the data over the established connection.
4 Both client and server exchange the information through write and read primitive which
establish sending and
receiving of the data.
4 Close: Client and serverbothcalls close primitive to close the connection.
Message Passing Interface
4 The message passing interface (MPI) is discussed insection 2.2.2. It uses messaging primitive to supporttransient
communication. Following are some of the messaging primitives used.

4 MPI!_bsend: This primitive is executed by senderof the message to put outgoing messageto local send buffer of MPI
runtime system. Here, sender continues after copying messagein buffer. Transient Asynchronous Communication in
i | Fig. 2.5.3 is supported through this primitive.

_ 4
Commun
n icat
e ion
-Sem 8-Comp )
pistributed Computing (MU-S
ee
; 2-15
pied in
mpi_send : This primitive is ex ecut either until mess age is co
ed by sender to send the message and then waits
unti :
second
local buffer of MPI runtime or is given by Fig. ae 5.4 and
.
has initiated receive operation. First case
until rece;iver Saas
vase is given in Fig. 2.5.5,
of the message
mPI_ssend : This primitivej age and then waits until receipt
xecuted by sender to send the mess
starts by receiver. This case is given in Fig. 2.5.5
is blocked until reply
MPI_sendrecv: This primitiive supports synchronous } communication given in Fig. 2.5.6 Sender
from receiver.
m user9s buffer. Sender passes
MPI_isend : This primitive avoi
avoids the copying of mes sage to MPI runtime buffer fro
P Ive
. ;
inter to message & les the communication.
ge and afte r this MPI runtime system hand ;
po ime system,
der passe inte runt |
imesystem. Onc e mes sage processing is done by MPI runt
MPI_issend : Sen p S$ pointer to MPI
ge for processing.
sender guaranteed that receiver accepted the messa
It blocks caller until message arrive $.
MPI_recv : It is called to receive a message. r
to accept message. Receiver checks whethe
v : Sarn e as abov e but here receiver indicates that it is ready
MPI_irec
us communication.
message is arrived or not. It supports asynchrono
munication
95.4 Message-Oriented Persistent Com
ronous
Middieware (MOM) supports persistent asynch
Message-oriented
Message-queuing systems Of pas
of the messag e need not be active

during message
type of communication, se nder or receiver
communication. In this
in interme 8diate storage.
transmission. Message is stored
Message-Queuing Model ion has its

munication serv ers. Each applicat
\
ss ages are for warded through many com
m, me
- Inmessage-queuing syste system, guaranteeis given t
o sender that its
application se nds the message. In this
other
own private queue to ¥vhich can be delivered at any time.
li ve re d to recipient que ue. Message
sent message will be de queue from sender. Also, sender nee
d not
en message arrives in its
cu ti ng wh rec er,
eiv
between sender and
xe
lv need not be ¢ es
Inthis system, rece er chan ge of me ss ag
- red to the receiver. In ex
é
be executin g after its messag e is delive

des.
llowing exe cution mo
f ot
messageis delivered
' . to receiver with OWEe
.
running
iver bot h are
Sender and rece
o
sive
Sender run ning bu
t receiver P as 8
oT
<
t receiver running»
Sender passive bu
OF
passixe-
d receiv er are If message size is large then underlying system
Both sender an destinati on queue.
6
dr
ad ; es s fo r
! used as ent to communicating applications. This leads to having
u n i q u e name ner transpar
a System wide eint e m a
| n
e messag ives offered.
ments an d assembles th ll ow in g are the primit
frag ations. Fo
er to appli
c
essage to system in order to putin design
ated queue.This callis non-
Simple interface to ff a s s t h e m
op
primitive t
© Put: Sender calls this ue. If queue is empty then
blocking. ch pro ces s having rights removes message from que
by whi
© Get: It is blocking cal
process blocks. TechKuowledge
Publicati ons
Po

Communication
is empty or expected
Oo Poll: It is nonblocking call and proc essexecuting it polls for expected message. i queue
message not found then process continues.
Qo Notify : With this primitive, installed handleriis called by receiver to check queue for message.
General Architecture of a Message-Queuing System
4 Inthe message-queuing system, source queueis present either on the local machine of the sender or on the same
of
LAN. Messages can be read from local queue. The message put in queue for transmission contains specification
destination queue. Message-queuing system provides queues to senders and receivers of the messages.
Message-queuing system keeps mapping of queue names to network locations. A queue manager manages queues
and interacts with applications which sends and receives the messages. Special queue managers work as routers or
relays which forward the messages to other queue managers. Relays are more suitable as in several message-queuing
systems, dynamically mapping of queue-to-location not available.
4 Hence, each queue manager should have copy of queue-to-location mapping. For larger size message-queuing system,
this approach leads to network management problem. To solve these problems, only routers need to be updated after
adding or removal of the queues. |
4 Relays thus provide help in construction of scalable message-queuing systems. Fig. 2.5.7 showsrelationship between
queue-level-addressing and network-level-addressing.
PE MEROTS Hs Look-up at
"| Sender _- transport-level <| Receiver

; ( address of queue
3 Queue-level _}_ b umd:

-| Queuing address TE ame :
: layer KY : 1 wr ls
8Localos hee IN Address look-up
: database
|rocatog |
eae amet tires eee
a ) ~ Transport-level
$1 ~~ - 22-2 nnn nnn nnn nee eee naan nnn ene = address
Message Brokers
4 It is required for each application that is added in message-queuing system to understan

d the format of messages
received from other applications. This will leads to fulfil the goal of integration of new and old applicatio
ns into single
coherentdistributed system.
4 For diverse applications, consideration of sequence of bytes is more appropri

ate to achieve above goal. In message: _
queuing systems, message brokers nodes which are application level gateway, is dedicatedfor conversion of.
messages. It converts incoming messages in the formats understandable by destination applications. Message
brokers i
are external to the message-queuing systems and notintegral part ofit.
4 In more advancesettings of conversion, some information loss is expected. For example, conversion between x.400 .
and internet email messages. Message broker maintains database of rules to convert message fone format: nto:
other format. These rules are manually inserted in this database. 4
___4 " i
tal aoe
* +
vee
ouledg!
Punitcarien! a

ation
e
| pistributed Computing (MU-Sem 8-Comp.) ee o
2-17
444
System
examp! e : IBM's WebSphere Message-Queuing
ows Ge neral organization
Fig. 2.5.8 sh eue. It
mas eries is IBM9 s WebSphere product which is now known as WebSphere M Q. it 5 send qu
me ss ag es from
queues and extr act m in put the
messa
ie
a pick up the incoming
also forward the mess ges to other queue managers. It
ues. Receiving client
appropriate incoming que Client's receive
Send queue
Routing table
cms client |
Queue Program
neue TS manager
Program cms
sana
MQ Interface a
Server ||nica MCA}|MCA

Stub stub MCA
.
RPC Local netswork Enterprise network
(Synch ronous) To other remote
Message passing queue managers
(asynch ronous)
ing system
's message-queu
organization of IBM
Fig. 2.5.8 : General size is 2 GB but it
can be
up to 10 0 MB . Normal queue
reased
4 MB bu t can be inc managers i s th
rough
xi mu m defa ul t size of message is Co nn ec ti on be tw een two queue
Ma ng system. reliable
rlying oper ati a unidirectional,
mo re siz e de p ending on unde tio ns. A me s sage channel is
of vel con nec tion,
W hich is an id
ea of transp ort-le Through this connec
ss ag e ch an nels , er an d re ce iver of messages.
me ac ts as se nd
gers which
een qu eue mana
Aare eee wee ee

co nn ec ti on be tw
are trans port
ed. de checks send
queuéd messages o en ds of me ss ag e ch annel. MCA at sender si
s each of the tw receiving MCA along th
e
e ch an ne l a ge nt (MCA) manage nsp ort lev el packet and send it to
_ Mess ag
wraps message
In tra
s it and store it in appr
opriate
for me ss ag e. if found, It ts for in co mi ng packet and unwrap
queu e MCA wai
receiving
ec ti on . On th e oth er hand,
conn
r
icates with queue manage
queue. to th e sam e process. Application commun
ed in
can be link be kept on separate machines.
ma na ge ra n d application qu eu e manager and application can
Queue ization,
her Or gan it at receiver end. If
sta nda rd int e rface. In ot from this queue to send
throug h ue. It fetches the message SS
tly on e se nd que nne l is p ossible. Both MC As can be started
has exac en transfer along the
cha
Each message © hannel d running th Ny
U p an
and rec eiv e 6 MCA are or l eceiver MCA in order to start its end
of channel.
sen der
vate sender er is
App lic ati o n can also acti e trigger when mess
age is put in send queue. This trigg
manually. setting off th
is started by control mess ag e can be send to start the other
end MCA when, ws ,
In other case sending MCA rt the sendin g MCA. A attribute
wit h message channel agents. These
dle r to sta at tributes associated
han
setting up the
associated wit h me
lowing are so and its negotiation takes place prior to
One end is alread
y started. Fol nd receiving MCA
een sending 4 ~
patible betw J
Value should-be com
ed.
channel. ocol to be us
ansport prot
petermines ges will be in FIFO manner.
© Transport types ° |
very of messa
© FIFO delivery : Deli BE Tec hKnowledgse
ublication
.
[ead sche ae eye Fearn
Pager ne!

Communication
2-18
v Dxsintartod Computing (MU-Som 8-Comp.)
age.
o Message length: Maximum length of mess
es to start up the receiving MCA.
© Setups retry count: Se Lup maximum number of retri
put received message into queue.
© Delivery retries : Maximu m times MCA try to
Message Transfer
messag e by sending queue manager to receiving queue
4 A message should carry a destination address to send
of the receiving queue manage r. The second part is the
manager. The first part of the address consists of the name
is to be appended.
name of the destination queue of queue manager to which the message
to which queue manager message is to be
4 Route is specified by having name of send queue in message. It also means
, sendQ) in it. In this
forwarded. Each queue manager (QM) maintains routing table having entry as a pair (destQM
queue in which messageto be
entry destQMis name of destination queue manager and sendQ is name of local send
put to forward to destination queue manager.
manager extracts
Message travels through many queue managers before reaching to destination. Intermediate queue
name of destination QM from header of received message and search routing table for name of send queue to append
message to it.
Each queue manager has a systemwide unique identifier for that queue manager. This identifier is unique name tothis
queue manager. If we replace QM then applications sending messages may face problems. This problem is solved by
using a local alias for queue manager names.
2.6 Stream-Oriented Communication
in much communication, applications expect incoming information to be received in precisely defined time limits. For
example: audio and video streams in multimedia communication. Many protocols are designed to take care of this issue.
Distributed system offers exchange of such time dependent information between senders and receivers.
2.6.1 Continuous Media Support
Representation of information in storage or presentation media (monitor) is different. For example, text information is
stored in ASCCI of in Unicode format and images can be in GIF or JPEG format. Same is true for audio information. The
imecis 1% Categorized as continuous and discrete representation media. In continuous media, relationship about time
between diferent data items matters to infer actual meaning ofit. In discrete media, timing relationship doesn9t
matter
between cfferent data items.
Data Stream
~ sequence of cata units called as data streams. It is applied to both continuous and discrete media. Transmission
moces differ in term: of timing aspects for continuous and discrete media.
Wansfer Gata ems received a3 data streams is relevant toits transmission time by sender
; JS tn synchronous transmission mode, maximum end to end delay is defined for
every unit ina data stream. . Data units it
/ a . 8 ao
Gata sean can also be received at any time within defined maximum end to end delay. It will cause mo harm at
receiver wide. Iiy HOL important here that, data units in stream can be sent faster than maximum tolerated delay
ina data stream. This (ode ty important in distibuted multimedia watem to represent audio and video :
Peterey npr eet ei 2 hk maw Aer ie . Paria ana inartice enema meslens IN Ee hn DO ak ora aestuedlns iC Rhea . at 5
Penticetier9 9c
4
aa
4

62 {
: 3
. aie
istributed Computine
! e
. .
5246 . Communication He
- 4 1
!
(subple contains only Single sequence of data Complex stream consist of many such simpl cain
-
e a
streams). These subs
treams need synchroni nt
i n between- them. Time dependent relationship '!5 . presen ,

.
zat
-
izatio
.
betw een substreams in com Plex stream. Example of complex stream is transferring movie. This stream comprises one
:
yideo stream, two stream ® to transfer movie sound in stereo, The other stream may be subtitles to display for deaf.
- stre am is virtual connecti
| ection between source (sender) and sink (receiver). These source and sink could be process OF
:
ice. For exam .
across network to
byte by byte from hard disk and sending it |
ee nee machine reading data
her process on
anotner p :
machine.In turn, this, receiver process may deliver received bytes to local device.
other i
|
|
- Inmultiparty communication, data stream is multicast to manysinks. To adjust with requirements of quality of service |
|
for receivers, a stream is configured with filters
|
96.2 Streams and Quality of Service (QoS)
- It is necessary to preserve temporal relationship in a stream. For continuous data stream,timeliness, volume. and |
reliability decides quality of service. There are many waysto state QoS requirements.
bandwidth, transmission rate, delay etc. Token |
A flow specification contains precise requirements for QoS such as |
number of
A bucket (buffer) holds these tokens which represent
bucket algorithm generates tokens at constant rate.
are dropped. Application ng to remove |
bytes application can send across the network. If bucketis full then tokens
tokens from buffer to pass the data units to network. Application itself may remain unknown about its requirements. |
8Application 4 1? ei
Onetoken is added tH fi
ee . ra pti
AP es = to the
ae i < [rregular stream bucket every AT . ft | |
of data units é ee Ata

-|
net
Pi
Regular stream tI Hi 1
|i a
Fig. 2.6.1: Token Bucket Algorithm
| |!i
in flow specification.
ice requirement and characteristics of input specified
Following are the se rvi
. . _ Hl
- Characteristics of Input<+ cize in bytes:It defines maximum size of data units in bytes.
lete bucket to Hal)
burstiness, as application is permitted to pass comp
©
Token bucket
unit. Si%*
Maximum datarate in bytes/sec : TO p ermit the
©
networkin single operation: . Pt ate
specified maximum.in order to .
© Token bucket size in ian . i b e
/sec: This is to limit the rate of transmission to fy |
© Maximum transmission rate in byt | ee

avoid extreme burst.
Service Requirements 4 interval (micro second): Both collectively defines maximum acceptable lossrate.
,
dboss!
© Loss sensitivity . Number of d ata units my lost.
(bytes) an
0 second): Define s delay parameter that network can delay to deliver data before
. data units) . .
© Burst loss sensi tivity (
© Minimum delay
receiver notice it. | icro s2Cnd) * Maximum tolerated jitter.
(mic Ww TechKnowledge
© Maximum de lay variation Publications

sithMy si eieth
2-20 Communication,
network cannotgive guarantee -
But for high nu mber, if
© Quality of guarantee : If number is low then no problem. \ J
then system should not establish the stream.
Setting up a Stream
: iIn
n n network. These resources are
There is no single best model to specify QoS parameters and to describe resources
. oS param
bandwidth,. processing capacity, buffers. Also, there is no single best model to translate these Q05 p eters tg
; : rs.
resource usage. QoS requirements are dependent on services that networkoffer
A
f «4
resources at router for conti nuous
- Resource reSerVation protocol (RSVP)is transport-level protocol which reserves
streams. The sender handover the data stream requirements in terms of bandwidth, delay, jitter etc to RSVP process
on the same machine which then stores it locally. |
RSVP is receiverinitiated protocol for QoS requirements. Here, receiver sends reservation request along the path to
the sender. Sender in RSVP set up path with receiver and gives flow specification that contains QoS requirements such
as bandwidth,delay,jitter etc to each intermediate node,
aes aie aes a ee Se

Receiver when is ready to accept incoming data, it handover flow specification (reservation request) along the
upstream path to the receiver: may be reflectinglower QoS parameter requirement.
At sender side, this reservation request is given by RSVP process to admission control module to check sufficient .
resources available or not. Same request is the passed to policy control module to check whether receiver has
permission to makereservation. Resources are then reservedif these to tests are passed. Fig. 2.6.1 shows organization
of RSVP for resource reservation in distributed system. | |
Senderprocess | &
mere RSVP-enabled host 7
% Poli
Application |} | Mid iS RSVP process
Application |- Ries |
data stream ~~~ a) ES | nh S
fe - RSVP
program
| Local OS ee
Data link layer _| Admission : Reservation requests
y | control = | from other RSVP hosts »
Data link layer 44 7 eo. , .
data stream man nm AR 8
ee Su
oH 4(... Internetwork
Setup information to Local network Nes ee ey
other RSVPhosts
Fig. 2.6.2 : RSVP Protocol
from non significant ones,

|
d Computi (MU-Sem 8-Comp )
Distributet Communication
y 4OmMp. 2-24 4=
eS
Distributed system also

Offers buffers ¢ O reduce in buffer at receiver side for
jitter. The delayed packets are stored
-
maximum time which will

be deli vered to icati
application at regular rate. Error detection and correction techniques also
used for which allow Ss retransmissj me W hile
4
delivering to°applications ssion of erroneous packets. The lost packets can be distributed over time
9.6.3 Stream Synchronization
l
In case of compplex te on : stream. Consider
stream,it is necessary to maint
=xdviple of syrichroniation bet ain temporal relations between different data
maintai
er stores slide show
resentation which contei | _ discrete data stream and a continuous data stream. Web serv
e W .
*
ns sli es and audio stream. Slides come from server to client in terms of discrete data
8
stream. The client should play the sameaudio with i here gets synchronized with
wi respectpect to current slide. Audio stream
:
slide presentation.
audio stream. The data units in different streams

oe : ynche 4 eamnese to be synchronized with
at which data stream is
| eaning of data unit depends on level of abstraction
.
of 25 Hz or more. If we consider broadly used NTSC
considered. It is necessary to display video frames at a rate
units that last as long as a video frame is displayed (33
standard of 29.97 Hz, we could group audio samples intological
msec).
Synchronization Mechanisms
consider following twoissues.
For actual synchronization we have to
two streams.
o Mechanisms to synchronize
ms in network environment.
o Distribution of these mechanis
ams. In this approach, application should
l, syn chr oni zat ion is car ried ou t on data items of simple stre
At lowest leve evel facilities available. Other better alter
native
oni zat ion whi ch is not possi ble as onlyit has low-l
implement synchr devices in simple way.
plicatio n an interf ace whi ch permits it to control stream and
could be to offe r an ap zation, a
zed. To carry out synchroni
different su bstreams needs to be synchroni
complex stream,
es is ek ee
At receiver side of ation implicitly by

be available loc ally. Co mmon approachis to offer this inform
ication should
synchronizati on specif data units, including those for synchronization.
MPEG
st re am s in to a single stream having all
multiplexing the different
manner.
nized in this
streams are synchro oul d handle by sender or receiver. In cas
e of sender side9
th er synchronization sh
e ta unit.
am with a different type of da
sue is wh
Another imp ortant is e streams into a single stre
it may be possible
to merg
synchronization,
ion a t o r
icatc
7 o
C m
8Gro m
up u
Co uni
mmn
iple receivers. This type of
y pport for sending data to mult
It is necessar to have su
| vem, ation, many network-level and:
sup rt multicast communic
. '
ation. To po inat e
setup the path to dissem
,
In distributed SYte C8 | led multicast communic

8 * S e
4
s
t eio n pl em en ted. The issues in these solution were to
communicat !5
solution
s have been im
transport-level
the information. to peer solutions are usually
¢ appl icat ion level multicasting techniques as peer
j
solutions set u p communication paths.
Alternative to thes e became eas
ier to
;
n layer. It ne w
deployed at applicatio
UF fekinontedes
7

Communication 4
_®_
& Distributed Computing (MU-Sem 8-Comp.) 2-22 4444
. 8ctics,
;
Multicast messages offer goodsolution to build distributed system with following characteristl
o Fault tolerance based on replicated services : Replicated service Is available on group of a nen deere:
operation on this request. some
are multicast to all these servers in group. Each server performs the same
memberof group fails, still client will be served by other active members(servers). |
and clients can use TC eeais a find
© Finding discovery servers in spontaneous networking : Servers
interfaces of other services in the distributed
discovery services to register their interfaces or to search the |
system.
performance. If in oe cases, replica of data
© Better performance through replicated data : Replication improves
is placed in client machine then changesin data item is multicast to all the processes managing the replicas.
Oo Event Notification : Multicast to group to notify if some event occurs.
2.7.1 Application-Level Multicasting
4 In application-level multicasting, nodes are organized into an overlay network. It is then used to disseminate
information to its members. Group membership is not assigned to network routers. Hence, networklevel routing is the
better solution with compare to routing messageswithin overlay.
In overlay network nodes are organized into tree, hence there exists a unique path between every pair of nodes.
Nodesalso can be organized in mesh network and hence, there are multiple paths between every pair of nodes which
offers robustnessin case offailure of any node.
Scribe : an Application Level Multicasting Scheme
It is built on top of Pastry which is also a DHT (distributed hash table) based peer-to-peer system. In order to start a
multicast session, node generates multicast identifier (mid). It is randomly chosen 160bit key. It then finds SUCC(mid), |
which is node accountable for that key and promotesit to be the root of the multicast tree that will be used to send
]
|
data to interested nodes.
|
|
If node X wants to join tree, it executes operation LOOKUP(mid). This lookup message now with request to join MID
|
(multicast group) gets routed to SUCC(mid) node. While traveling towards the root, this join message passes many ;
}
8
nodes. Suppose this join message reaches to node Y.If Y had seen this join request for mid first time, it will become a
forwarder for that group. Now-node X will become child of Y whereas the latter will carry on to forward
the join
request to the root.
If the next node on the root, say Z is also not yet a forwarder,it will become one and record Y as its child and
persist to
send the join request. Alternatively, if Y (or Z) is already a forward
er for mid, it will also note the earlier sender asits
child (i.e., X or Y, respectively), but it will not be requirementto send the join
request to the root anymo re, as Y(or
Y(o Z)
will already be a memberof the multicast tree.
In this way, multicast tree across the overlay network with two types
. of nodes gets created. ; These nodes
nod are : pure
forwarderthat works as helper and nodesthat
| are forwardersas well, but have Clearly requested to join thet
in the tree.
Overlay Construction
ance is merely based on th . | :
4 tis not easier to build efficient tree of nodes in overlay. Actual perform
e routing
messages through the overlay. There are three metric to measure quality of an application-level mmullticaset tk
st tree.
are link stress, stretch, and tree cost. Link
stress is per link and counts how frequently
a Packet crosses the same link
link.
ST
Ww TechKnowledge
Publication9

<7 Distributed Computing (MU-Sem 8-c¢ = omp. Communication
2-23
|
if at logical level althou gh packetis f orw se connections may in
alo ng two dif ferent con nections, the part of tho
same physical fj arded The stretch or Relative
Delay
fact correspond to the s cas e, link stress is greater than 1.
tio in th ink. In thi
Penalty (RDP) measures
the ra y the delay that those tw° nodes
e: delay between two nodesin the overla and _4e
ng the
would experience in th ¬ underlyin § netw mizi
ic, generally related to mini
. ork. Tree cost is a global metr
aggregated link costs.
9.7.2 Gossip-Based Data Dissemination

ff
:
8etri uted system, epidem
Ags there are large numbe r of n odesin large distrib
ic protocols can be used effectively to
_
: ates for a specific data item are initiated at a

| se algorithms, consider all upd
\norde r to und ers tan d t he inci
principle of the
_
e can be avoided
single node. Becauseof the same, write-writ
Information Dissemination Models

by epidemic algorithmsto
ing of infe ctio us diseases . This basic principle is used
s the spr ead
4 Theory of epidemics studie to other node is called as
node sin net wor k. The node having data to spread
on among
disseminate the informati tible node.If the node is
not see n this dat a then it is considered as suscep
now any node has
infected node. If up till nodesis called as remove
d.
it 8s not wil ling to spread its data to other
already updated and e Y to exchange the
e X randomly selects nod
is tha t of ant i-entropy- In this mode, nod
The famous propagati on model ent approaches.
4
es pla ce wit h on e of the following three differ
updates exchanges tak
updates with it. These Y.
s to node
es its own update
8o Node X only push
s from node Y.
ls inn ew update dates to each oth |
er.
9 Node X only pul de s X and Y sends up
h no
o Push-pull approach
in which bot infected node will
ase d app roa ch of spreading updates by
ed then using push-b le node by each
ck spr ead ing of upd ate s Is requir pro bab ili ty of selecting the susceptib
- If qui many in number s the n
it is
ed nodes ar
e ger period as
on. If infect susceptible for a lon
be bad opti ult , a par tic ula r node may remains
less. AS a res
b e relatively
infected node will | to
fected node- e nodes contact
s case, susceptibl
not selected by an in approa.ch Is bett
er op ti on . In thi
the pull-based node swi ll also become infected one.
.
es, ble
.
y infect ed nod tl
4
In case of man
n
dat es. Th e sus cep ,

pull the UP
8fab using either push or pull
orde r t o s wil | spr ead rapidly across all nodes
infected nod e in g of update ed in which every
fected, spreadin app r ach . Consider round as time span involv
~ if only a single nod e 15 als
o be the best ;
h a randomly chosen oth
er node. Then, to
7
. -pull canf | x c h an ge updates wit
n O e em.
r of nodesin the syst
pus h p t
approach. In this cas e
nitiative 9
b e
taken
ca se, N is n u m
once have ired. In thisot
node will at least ) rounds are requ
re
8 oa | nodes 0(!08 r rumo r sp re ad in g. In |
this approach, suppose node
X has just updated
propagate single update

t me, if node Y is already
is gossiping 0 nd try to push the update toit. In the meanti
approach
The variation of abov
e approach
no de ¥ ra ndomly ? d node. Gossiping approach rapidly spreads news. In this
~ data item x then |
t contact any es remove
en it be com
her node th
updated by any ot
still there will be some nodes
Wy TechKnowledge
2S¬ Publleatrons
ans

Communication.
a Distributed Computing (MU-Sem 8-Comp.) 2-24
nt of update
no de s th at remain ignora
f
4 In large. number of nodes which takes part in epidemic, suppose fraction S 0
(susceptible) satisfies following equation.
gs = gtktD (es)
For k = 3, sis less than 0.02. Scalability is main advantage of epidemic algorithms.
Removing Data
, | destroys information
8on
to sprea d the deletion of data Item. It is because the deletio
It is hard for epidemic algor ithms receive old copies
ted to deleted data item. In this case if data item from node is8 suppose removed, then node will
rela
of data item and it willbe considered as new one.
: , p record of it. In this way,

item as another update and kee Y
The solution to above problem is to considerdeletion of data
new. The
updated by delete operation and not something
old copies will be treated as versions that have been
recording of deletion is executed by spreading the death certificates.
deleted data items. The solution to this problem is
Each node mayslowly build huge database of death certificates of
a nl
ed after updates propagate to
to create the death certificate with timestamp. The death certificates then can be remov
all the nodes within known finite time which is maximum elapsed propagation time.
is for those nodes to which
A few numbersof nodes (For example, say node X)still maintains this death certificate. This
|
previous death certificate was not reached. If node X has death certificate for data item x and update come to nodeX
for the same data item x.In this case, node X will again spread the death certificate of data item x.
i i a aa el
Applications of Epidemic Protocols
Providing positioning information about nodescan help in constructing specific topologies.

Gossiping can be usedto find out nodes that have a small number of outgoing wide-area links.
Collecting or actually aggregating information.
ReviewQuestions _
Explain ISO OSI model of communication?

9999999 9 9
Write short note on <Message Passing Interface=,
Whatis remote procedure call (RPC)? Write steps in RPC operation.
Whatarethe issuesin implementation of remote procedurecall (RPC)? Explain.

=4
Write short note on <Asynchronous RPC=.

$e atin Ok or
Write short note on :DCE RPC9.
Whatare the goals of DCE RPC?
Explain steps in writing a client server application in DCE RPC.
Explain binding of client to server in DCE RPC.

ape Ltellaasee
~ Whatis remote method invocation (RMI)? Explain distributed object model.

0
°
el
<or TechKnowledge
pean sn 8ds howe Pelecatagel
,
onan?
" Publications 9 ©
%
wats
ae Sans 8 Sa
eee

|
<+ puting (MU-_ Sem 8-Comp.)
Distributed Computi 2.05 Communicatior
0.11 Whatis the role of Proxy

and skeleton if RMI?
Q. 12 Whatis the role of client stu

b and server stub in RPC?
Q.13 Write note on <Compile-time Versus

Runtime objects=
Q.14 Write note on <Compile-time Versus Runtime
objects=.
Q.15 Write note on <Persistent and Transient Obje

cts9
Q. 16 Whatis object reference? Explain.
Q.17 Write short note on <Java RMI=
Q. 18 Explain parameter passing in RMI.
a. 19 Explain JAVAdistributed object model.
Q, 20 Define synchronous, asynchronous, transient and persistent communication.
Q.21 Explain with neat diagram different types of communication.
Q. 22 Explain Berkley Sockets.
Q. 23 Explain MPI primitives.
Q. 24 Explain Message-Queuing Model.

age-queuing system.
Q. 25 Explain general architecture of a mess
? Explain.
er in message-queuing system
Q. 26 Whatis role of message brok
System.
Sphere Message-Queuing
Q. 27 Explain in detail IBM's Web
transmission modes?
Q. 28 Whatare the different
Whatis data stream?
Q.29 d continuous media.

What is discrete an
Service (QoS)=
eams and Quality of
Q. 30 Write note on <Str
the stream
rk ing of RS VP protocol to set up
Q.31 Explain wo
in detail.
Q. 32 plain st ream synchronization
Ex
ting-
n-Level Multicas
Q. 33 Explain Applicatio
ication=.
Q.34 e short note on <group commun
Writ n.
ast communicatio
d Data Di ssem ination in multic
se
Q.35 Explain Gossip-Ba
e 000°
ee

Synchronization
4 Syllabus
seepiemmaei deen Uiocks, erecta Algorithms, Mutual Exclusion, and Distributed Mutual
Exclusion-
xclusion Algorithm, Requirements of Mutual Exclusion Algorithms, and Performance
Toke, w, en token ae pigontnmns : Lamport Algorithm, Ricart-Agrawala's Algorithm,
Maekawa's Algorithm,
ased Algorithms : Suzuki-Kasami's Broardcast Algorithms, Singhal's Heurastic Algorithm,
Raymond's Tree
based Algorithm, Comparative Performance
Analysis.
Introduction
4 Apar
part from the communicatio
icati n between processes in
in distrib
distri uted system, cooperation and synchronization between
processes Is also important. Cooperation is supported throug
h naming which permits the processes to share resources.
As an example of synchronization, multiple processes in distrib
uted system should agree on some ordering of events
occurred.
4 Processes should also cooperate with each other to access the shared resource
s so that simultaneous access of these
resources will be avoided. As processes run on different machines in distribu
ted system, synchronization is no easy to
implement with compare to uniprocessor or multiprocessor system.
3.1 Clock Synchronization
4 Incentralized system, if process P asks for the time to kernel, it will get it. After sometime if process Q asks f ti
sks for time,
naturally it will get time having value greater than or equal to the value of time of Process P received. In distributed
. istribute
system, agreement on time is important and necessary to achieveit.
4 Practically, different machinesclock .in distributed system differs in their time value. Therefore, clock synchronization is
| , Onization|
required in distributed system. Consider example of UNIX make program. Editor is running on machine A
an d Compiler
chi e B. Large UNIX program is collection of many sourcefiles.
is running on machin ,
} | If few files suppose modified then no
need to recompile all the files in program. Only modified files require recompilation.
After changing source files, programmer starts make program which checks timestamp of creation of so urce and
objectfile. For example xyz.c has timestamp 3126 and xyz.o has timestamp 3125. In this case, since object file xyz.0
is renulFed Idec fit
has greater timestamp, make program confirms that xyz.c has been changed and recompilation
;
n is required.
timestamp 3128 and abc.o has 3129 then no compilatio
aea Adana!ne Binge palin Ss he RAL ANT
Suppose there is no global agreement on time.In this case suppose abc.o has timestamp 3129 and abc.c is modified _4
So eerease
and assigned time 3128.

hs
i :
ee tend
!
ats
7= i) Fe
Tech t
Pudiicatiens - |

Ceereeet nectingSERGEia ee
Se
ert eee BTS eet ae ee Ee Pe
< <4
Y Distributed Co Mputing (MU-Sem 8-Comp.) 3.9 Synchronization
a
- This is due to slightly lagging of ma
notcall compiler
and object fi le abc.o w; chine clock on which abc.c is modified. Here make program will
.
will be old version although its source file abc.c is changed. The resulting executable binary of
large size UNIX Pro
pro gram will contain
mix of object files from old and new
sources.
ee ee = ee ee a
3.1.1 Physical Clocks
Ce i ad EET ee A St
a.
~- Every machine has timer which
ich jis machined
; quartzcrystal. This quartz crystal oscillates when kept undertension. The
SSS eo
ate
counte I and holding
} register
1 is
7 associated
*
with crystal. Counter value gets decremented by
rereetes
= j f
one arter completion Oo
one oscillati
ation |
of the crystal. When countervalue gets to O then interruptis generated. Again counter gets reloaded
ee Sere ee
from holding register.
| )
Each interruptis called clock tick and it is possible to program a timer to generate 60 clock ticks per second. Battery
4
+
backed-up CMOS RAM keepstime and date. After booting, with each clock tick interrupt service procedure
adds 1 to
4
current time. Practically it is not true that crystal in. all the machines runs with same frequency. The difference
between twoclocks time value is called clock skew.
=
- Inreal time system, actual clock time is important. So consideration of multiple physical clocks is necessary. How to
synchronize these clocks with real world clocks and how to synchronize clocks with each other are two important
issues that need to be considered.
4 Mechanical clock was invented in 17" century and since then time has been measured astronomically. When sun
_reaches at its highest apparent point in skyis called as transit of sun. Solar day is interval between two consecutive
transits of sun. Each day have 24 hour. 1 hour contain 60 minutes and in each minute contain 60 seconds. Hence, each
day have (24x60x60) total 84600 seconds. Solar second is 1/84600" solar day.
~ {n 1940 it was invented that earth is slowing down and hence period of earth9s rotation is also varying. As a results,
astronomers considered large number of days and taken the average of it before dividing by 84600.This is (1/84600) is
called mean solar second.
In 1948, physicist started measuring time with cesium 133 clock. One mean solar second is equal to time that is
Ret Fe ee eee ee ee eee ee

4
ee
needed to cesium 133 atom to make 9,192631,770 transitions. Now days, 50 laboratories worldwide kept cesium clock
and reports periodically time to BIH in Paris which takes averageofall the time called as International atomic time
7 a
(TAI). Now 84600 TAI seconds is about 3 msecless than mean solar day. BIH solved this problem by using leap second
SL ee we eS a Ac =
44
when difference between TAI and solar time grows 800 msec. After correction of time, it is called universally
=>
coordinated time (UTC).
Peery aco
- National institute of standard time (NIST) run shortwave broadcastradio station that broadcast short pulse at start of
each UTC secon d with accuracy of + 10 msec. In practice, this accuracy is no better than +10msec. Several earth
satellites also provide UTC service.
a ee 4
3.1.2 Global Positioning System (GPS)
~ GPS was launched in 1978. It is a satellite-based distributed system which has been used mainly for military
applications. Now days it is used in many civilian applications such as traffic navigation and in GPS phones.
tage tractot OS ee
aeee
SSS
==
WF TechKnowledge
Pudlications
<a
erica
[Se

4 | . Synchronizatic
se
_
me Distributed Computing (MU-Sem 8-Comp.) . 3-3 :
4
y ation
0,000 km.km Every satetellite has 4 atomic
GPS makes use of 29 satellites which travels in an orbit at a height of about 20,000 tomic
clocks which are calibrated regularly from special stations on earth. Each satellite constantly b roadcastits current
rent
position. This message is with timestamp asper its local time.
,
The receiver of this message on earth can accurately calculate ;its actual position
i by using three
sa tellites. Three
Satellites are required to resolve the longitude, latitude, and altitude of a receiver on Earth. Practically it is not true
thatall clocks are absolutely synchronized. Satellite message also takes some time to reach to receiver.
It9 may also happen that, receiver clock is lagging or leading with respect to clock time of satellite. GPS does not |
consider the leap second. Hence, measurement is not perfectly accurate.
3.1.3 Clock Synchronization Algorithms
If one machine is with UTC time then all other machines clocks time should be synchronized with
respect to this UTC
time. Otherwise, all the machines clocks together should be with same time. Consider timer
of each machine.
interrupts X times a second. Assume that value ofthis clock is Y. When UTC time is
t then value of clock at machine p is -
Y,(t). If all the clocks value is UTC time t then Y,(t) = t for all p, which is required. It
means, dY/dt should be 1.
In real world, timer do not interrupt exactly X times a second. If X=0
then timer should generate 216,000 ticks per
hour. Practically, machine gets this value in the range 215,998
to 216,002ticks per hour. For any constant a, if
1-a s dY/dtsl+a then timer is working within its specification.
The constant a is specified by manufacturer of timer and
called as maximum drift rate.
If dY/dt is greater than 1 then clock is faster. If dY/dt

is equal to 1 then clockis perfect clock. If dY/dt is less
than 1 the
clock is slower with respect to UTC time.
- Cristian9s Algorithm
In this algorithm, it is assumed that one machine (tim

e server) has WWV receiver. Hence, this machine clock
value is
UTC time. All other machines(client) synchronize their
clock with this UTC time. Let Cure is the time replied
by time 4
server. |
Ts ; | T
Client 4| : l
Request . Cute
Time Server 4
ty | _ Time
Fig. 3.1.1 |
Whenclient receives response from time server at time
Tr, it sets its Clockat time C This al orithm - wid9 |
- problems, one is major and other is mino . UTC: por 7
r problem. Major problem occur whenclien
ts clock is fast. The received Cyr
will be less than client9s current value C,
wW TechKnewledgé s a
Pubrications ..-
\ ae
4 h Va

y Distributed Computin 0
_a_
444
(MU-Sem 8-Comp.)
~ In UNIX NIX
program example explained
Previously if on client machine object file is compiled jus t after clock change will
have timestamp less
than the source which
The soluti was modified just before clock change.
io e solution to this ;
Problem is to introduce the change slowly.If each interrupt of timer adds 5 msecto the time then
add 4 msecto time il ti rein
4 Until time gets corrected. Time can be advanced by adding 6 msec for each interrupt Instead of
forwardingall at once.
- As aminor
P problem m, Ht; takes
. ;
some time to reach the reply from time server to client. This propagation time Is
r 4 Ts)/2. Add this ti
iu M this time to the Cure time which is replied by server to set the client9s clock time. Let time elapsed
between t, and t) is interrupt processing time (I) at serve r. Then message propagation time is (Tr Ts 4
|). One way
propagation time will be the h
alf of this time. In this algorithm,time server is passive as it replies as per query of client.
The Berkeley Algorithm
- In Berkeley UNIX, time daemon(server) is active as it polls every machine asking current time there, It then calculates
average time of received answers fromall machines. It then tells other machines either advance their clocks to new
time or slow their clocks until specified timeis set.
- This algorithrn is used in system where WWYVreceiver is not there for UTC time. The daemon time is periodically set
44-
manually by operator.
t 400 4:05
Time Server }44 Time Seorvor

4:00 4:00 ~15
/ 10 420 \ +15
| a | | Notwork i |
Chent 1 Ciont 2 Ciiont I Client 2
450 420 4.05 4:05
(a) (b)
Fig. 3.1.2
As shownin Fig. 3.1.2(a), initially time daemon has time value of 4:00, Client 1 has clock time value 3:50 and client 2
has clock time value of 4:20. Time server asks to other machines about their current time including it. Client 1 replies
that, itis lagging by 10 with respect ¢ o server time which is 4:00. Similarly, client 2 replies that, it is leading by 20
ch is 4:00. Server will reply as 0 toitself.
with respect to server time whi
- Meanwhile server time increases to 4:05. Therefore server sends client 1 and 2 messages about how to adjust their
clocks. It tells client 1 to 4 dvance clock time by 15. Similarly,oeit tells to client 2 to decrement the time value by 15.
ts own Clock time. It is shownin Fig. 3.1.2(b).
5 time
Server itself tells to add
Averaging Algorithms
~ Cristian algorithm and Berkeley algorithm both are centralized algorithms. Centralized algorithms have certain
disadvantage s. Averaging algorith m is decentralized algorithm. One ofthese algorithms works by dividing timein fixed
eed upon momentin past. The k" interval start at time TO +kS and runs until
TO is any agr
length intervals. Let
param eter.
TO + (k+1)S. Here Sis a system
Bo
Tech e
Publications

Distributed Computing (MU-Sem 8-Comp.) 3-5 : chronization
4 At the start of each interval, every machine broadcastits clock time. This broadcast will happen at different time at
different clocks as machines clock speed can be different. After machine finish the broadcast, it starts local timer in
order to collect broadcast from other machines in some time interval |. Whenall the broadcasts are collected at every
machine,following algorithms are used. = _
O Averagethe time values collected from all machines.
Oo Otherversion is to discard n highest and n lowest time values and take the average of rest.
o Another variation is to add propagation time of message from source machine in received time. Propagation
time can be calculated using known topology of the network.
Clock Synchronization in Wireless Networks
4 In wireless network, algorithm needs to be optimized considering energy consumption of nodes. It is not easy to
deploy time server just like traditional distributed system. So, design of clock synchronization algorithms requires
different insights.
4 Reference broadcast synchronization (RBS) is a clock synchronization protocol. It focuses on internal synchronization
of clocks just like the Berkeley algorithm. It does not consider synchronization with respect to UTC time. In this
protocol, only receiver synchronizes while keeping sender out of loop.
4 In sensor network, when message leaves the network interface of sender, it takes constant time to reach destination. 4
Here multi-hop routing is not assumed. Propagation time of the message is measured from the point messageleaves
the network interface of the sender. RBS considers only delivery time at receivers. Sender broadcasts reference
message say m. Whenreceiverx receives this message, it records time of receiving of m which is say Tym. This timeis
recorded with its local clock.
4 Two nodesx and y exchanges their receiving time of m to calculate the offset. M is total number of sent reference
messages.
M
> (1x, k-Ty,k)
k=1
Offset[x, y] = M
4 As clocks drift apart from each other, calculating average only will not work.
Standard linear regression is used to
calculate offset function.
Network Time Protocol (NTP)
- Inccristian algorithm when time server send reply message with UTC time, the problem
is to find actual propagation
time. The propagation time ofreply will definitely affect the reporting of actual
UTC time
4 Whenreply will reach at client, message delay will have outdated reported time. The good estimation
for this delay
can be calculated as shownin followingFig.3.1.3.
a ae
tect caries) |
Sona
e + eae
. 8 *
eee
i vonnae
Fa

pistributed Computing (MU-Sem -
Synchronization
3-6
ol
7 =P)
vy _
h
Fig. 3.1.3
, .
~ Asends message to B oe T, byits local clock. B then send reply
containing timestamp T,. B also recordsits receiving time
raining tl 6 .
tap:
containing B times tamp T3 along with its recorded timestamp Tp. Finally, A-records time T, at which responsearrives. Let
|
T,- T1 = Ta4 Ts. Estimation ofits offset by A is given as below.
_ (T, -T,) + (T, - T) (T, = T,) + (T, -T,)
5 _ -
0 = T;
- The value of the 0 will be less than zeroif A9s clock is fast. In this case A has to set its clock to backward time.
than the
Practically it will produce problem like objectfile is compiled just after clock change will have timestamp less
introduce the change slowly.If
source which was modified just before clock change. The solution to this problem is to
time until time gets corrected. Time can be
8each interrupt of timer adds 5 msec to the time then add 4 msecto
advanced by adding 6 msec for each interrupt instead of forwarding all at once.
current
n servers. In other word, B will also query to A for its
- Incase of NTP, this protocol is set up pair-wise betwee
tion of p for delay is carried out as
time. Along with calculation of offset as above, estima
(T, -T,) + (Ty, - Ty)
hoo= 4
for p is the best estimation for the delay between the two
. = Eight pairs of (0, u) values are buffered. Smallest value
t. Some clocks are more
servers. S ubsequently the associated value
of @ is the most reliable estimation of offse
are divided in strata.
accurate, Sa y B9s clock. Servers
than that of B. After
its time provided its own stratum level is higher
l onl y correct
~ If A contacts B, it wil Because of the symmetry of NTP, if A's
str atum | evel will become one higher than that of B.
synchronization, A'S server is with WWYV receiver or with
um-|
cesium
that of B, B will adjust itself to A. Strat
stratum level was lower than f errors and security attacks.
Many features of NT P are related to 8dentification and maskingo
clock,
Logical Clocks =
3.2
e with UTC time orreal
although this time does not agre
| machines should agree on same time ncy of
~ In many applications, d on internal consiste
I C onsistency of clocks is essential. Many algorithm otsworks base
time. In this case, inte rnal i g al clocks.
id to be logic
clocks are said
I time. For these algorithms,
clocks and not with rea synchronization will not cause any
cting then lack of t heir clocks
cesses are not intera y what
Lamport sho wed that, if two pro should agree on the orderin which events occ
ur and not on exactl
th at, proce sses
so pointedout
Problem. He al
- timeitis. 4
UE TechKnowledge
S i .
Publleations
: .

'® Distributed Computing (MU-Sem 8-Comp.) 3-7 Synchronization
= 44444444 444SS4amw4vew_nnwei=ewF
444
3.2.1 Lamport9s Logical Clocks

.
For synchronization of logical clocks, Lamport defined happen-before relation. The expression x > y is read a|
<x happens before y=. This means all processes agree that first event x occur and after x, event y occur. Following twa |
Situations indicates the happen before relation. |
)
© Ifx andy are the events in same process and eventx occur before y then x > y is true.
|
oO If x is event of sending the message by one process and y is the event of receiving same message by other |
Process then x 4 is true. Practically, message takes nonzero time to deliver to other process.
4 Ifx >yand y >2z then x > z. It means happen before relation is transitive. If any two processes does not exchange |
messagesdirectly or indirectly, then x > y and y > x where x and y are the events that occur in these processes.In this
case, events are said to be concurrent.
' |
4 Let T(x) be the time value of event x. If x 3 y then T(x) < T(y). If x and y are the events in same process and eventx |
occur before y then x > y then T(x) < T(y). If x is event of sending the message by one process and is the event of |
receiving same message by other process then x > y then all the processes should agree on the values T(x) and
T(y) )
with T(x) < T(y). Clock time T assumed to go forward and should not be decreasing. Time value can be corrected
by |
adding positive value.
|
4 In Fig. 3.2.1 (a), three processes A, B and C are shown. These processes are running on different
machines. Each clock
runs with its own speed. Clock has ticked 5 times in process A, 7 times in process B and 9 times in processC.
This rate ls _
different due to different crystal in timer.
4
0 0 0 0 0 0
|
10 14 18 10 14 18
|
15 21 q 27 15 21 27 |
~, q |
20 28 36 20 28 36 |
25. 35 45 25 35 AB?
39. 42 ya 54 30 42. PE
35 49 FT 63 35 cae ea
45. 63 | g- 81: 63 69. Bie |

(900 ge seal (o8e fe 90
(a) (b) °
Fig. 3.2.1 3
ee _ 2
wledgt
We ee : ,
hy
were rt, 20
ae
Bete
aii

<Lomp.) Synchronization
3-8
at time 5, process A sends mess a, that, messaBbe

MESSAaBe p to process B and it is received at time 14. Process B will conclude
time. Same Is true about message
-
has taken 9 ticks to travel fr Om process A

to process B, if message contains sending
q from processB to process C
ssage at
process B sends message S at t tIme
i it iis received by process A at time 45. Similarly Process C sends me
56 and it
hows sending
time 54 and itis received b Y proc ess A at time 49. This should not happen and it should be prevented.It s
timeis larger than receiving time
proces 5 Bat time 55 OF

ti
Message © from process C le avesat at time 54, As per happen before relation,it should reach at
clock
, eiver fast forwards the
e is tru e for me ss a pro ces s A at time 63 or later. Rec
later. Sam § es w hich sho uld reach to .
sen din g ti . sen der alw ays sends sen din gtim e in mes sag e. This is shown In
by one more than th e g time. That is why the ive message
.1 (b). |in sequence, then clock must tick at least once. If process sends an drece
ig.
FI 3.2 (b). If two even ts oc cur
and receiv e events.
nd
1 between these two se
in quick succession, then clock should be advancedatleas t by
point. If such
then it should be separa
ted by decimal
ses,
if two events occur at the same time in different proces
and later a t 20.2.
mer should be considered at 20.1
events occur at time 20 then for
< T(y).
O fx > y in the same process then T(x)
the message the n T(x) < T(y)-
event of sending the message y is the event of receiving
o -Ifxis
x and y, T(x) # T(y).
o -Forall distinguishing events
with above algorithm.
ng ofall the even ts can be carried out
Total orderi
ng
- Total Ord er Multicasti
Lam port Timestamp
32.2 Application of different sites. For
im po rta nt to ac hieve consistency at
ng processes is ve
ry
counts balance
een commu nicati example, customer ac
Ordering of events betw ba nk da ta base. For
A and B maintains s 1% interest on
servers in different ci ties me time branch managergive
example, t wo ty A and
Rs 100 in ci
at th e sa
10 00. If customer depo sited

initially is Rs
these
city B. ent servers. Lets us say
account balance in se th at is pr esent on two differ di ng 1%
of databa ad
l at ed to same copy d ev en t B (updating account balance by
re an
These are two
events a dding Rs 100) uld
balance by in account balances wo
(u pd at in g ac co u nt
e da ta ba se 5 then inconsistency
events as event A erent order at
th es
en ts ex ec utes r n diff
interest ). If these ev will be
case account balance in city A
occur. In this
event B is executed. first and then
first and then suppose interest = Rs 1111. In city B, if event B is executed
t Ai S executed 11 and then
In city A,if even
1100 an d then adding Rs in city B wil l be up dated as: 1000+10 = 1010
= nce
updated as : 10
00+1 00
s case, ac
cou nt bala
. I n t hi
t A is execu ted
suppose even ication delay in
lained above due to co mmun
RS 1110. d B in ord er exp
adding Rs 100 = both sites A an ring of messages can be
done as per
same time
then orde
rk show
These Event A hines in 9 etwo
it is not possible.
all
network. If clocks of
was 5 en
t. practically, sites in same order
for execution.
ssages at different
timestamp wh en mes
saBb& arrange the me er in same order.
mess ages ar e delivered to receiv
in whichall
opos
Lamport pr is
multicast
der
Totally or <= TechKnowledge
Publlcations
<7,
~*~ -
<se

$7 _Distributed Computing (MU-Sem 8-Comp.) _ 3-9 | Synchronization
4 Suppose group of processes multicast messages to each other. Sender of the message also receives its own sent
message. Each messagecarriesits sending time. Assumeall the processes are receiving messages in the order they
were sent and three is no loss of messages.
4 Each process sends acknowledgement (ACK) for the.received message. Hence as per Lamport algorithm,timestampof
the ACK will be greater than the timestamp of received message. Here Lamport9s clock ensures that, no any tWo
messages will have same timestamp. Timestampsalso gives global ordering of events.
3.2.3 Vector Clocks
4 In Lamport9s clock,if x > y then T(x) < T(y). But, it does nottell about relationship between event x and y. Lamport
clocks do not capture causality.
. ; vet atts
4 Suppose user and hence processes join discussion group where processes post articles and post reaction to these
posted articles. Posting of articles and reactions are multicast to all the members of the group.
4 Considerthe total order multicasting scheme and reactions should be delivered after their corresponding postings. The
receipt of article should be causally preceded by posting of reaction. Within group, receipt of reaction to an article
must alwaysfollow the receipt of that article. For independent articles or reactions, order of delivery does not matter.
4 The causal relationship between messages is captured through vector clocks. Event x is said to be causally precede
event y, if VT(x) < VT(y). Here, VT(x) is a vector clock assigned to event x and VT(y) is vector clock assignedto eventy.
4 It is assumed that, each process P; maintains a vector V; having following properties.
O VT,[i] maintains number of events that have taken place at process P.,.
O If VT,[j] = k then P; is aware about k numberof events occurred at processP;.
4 Each time event occurs at processPj, VTj\[i] gets incremented by 1 and it ensures tosatisfy first property. Whenever
process P; sends the message m to other process, it sends its current vector with timestamp v, along with it
(piggybacking), which ensures to satisfy second property said above.
4 Because ofthis, receiver knows the number of events occurred at other processes before receipt of message m from
process P; and on which message m may causally depends.
4 Whenprocess P; receives message m, it modifies each entry in its table to maximum of the VTj[k] and vt[k]. The vector
at P; now showsthe numberof messagesit should receive to have at least seen the same messagesthat precededthe
sending of message m. After this every Vij[j] is incremented by1.
3.3. Election Algorithms
4 Tofulfill the need of processing in distributed system, manydistributed algorithms are designed in which one process .
acts as coordinator to play some special role. This can be any process. If this process fails, then some approach is.Is
required to assignthis role to other process.In this case, election is required to elect the coordinator.
- In group ofprocesses,if all processes are same then the only waytoassign this responsibility is on the basis of some
criterion. This criterion can be the identifier which is some numberassigned to process. For example, it could be.:
network address of machine on which processis running. This assumption considerss that only one process is running.
on the machine. aes
Se Tech
vaanavstelsta
renrieatTee <
-a
He. 4
44

é ee
4. 2 ay. |
-
i 8wy Distributed Computing (MU-Sem
P:) 3-10 4 Synchronization
| Oe 4
Election algorithm locat ¬s process with hi is aware
:
proce are : ighest number in order to elect it as a coordinator. Every process
-gbo ut the
fe . Process number of other processes B But, process es do not know about which processes are currently up or
-
which ones are current ly crash ed. Elect . agree on newly elected
|
coordinator. - ction algorithm ensures that, all processes will
43.1 Bully Algorithm

|
} |
In this algorithm,if any process f .
onding to its request, then it initiate
s the election.
inds that the coordinator is not resp
Suppose, processP initi
, nitlates the election. Election is carried out as follows : |
o Process P sends i n it.
, Election messagetoall the process having higher numbertha
.
: to Election
O no one replies message of P then it wins the election.
If anyone hi done.
O yone higher number process responds thenit takes over. Now P9s workis
.
4 | Any yP process may y recei lecti , ed colleague. Receiver replies with OK message to
receive Election message from its lower-number .
the further job.If
der of Electiion message.This reply
sender of Elect
|
conveys the messagethat,receiver is alive and will take over
receiver is currently not holding an election, it starts election process.
process. This process wins the election and send
Eventually all processes will quit except the highest number
is now new coor dinator.
Coordinator message to all processes to convey the message that he
ator 17 has just
is coordinator. Process 14 notices that, coordin
In Fig. 3.3.1, initially process 17 (higher-numbered)
sending Election
ss 14. Process 14 nowinitiates e lection by
crashed as it has not given respons e to request of proce
h are higher-numberedprocesses.
message to process 15, 16 and 17 whic
Election
@©
Fig. 3.3.1
hed and hence does not
16 res pon ds with OK message. Process 17 is already cras
g. 3-3-2, Proce s5 15 and
te As shownin Fi
send repl y wi th OK me ssage. Now,
job of process 14 is over.
<"
+.
Ww TechKnowledge
Publications

¥ - Distributed Computing (MU-Sem 8-Comp.) 3-11
E
| Synchronization
s ie
4 Now process 15 and 16 each hold an election by sending Election message to its higher-numbered processes.It is.
shownin following Fig. 3.3.3 (a). As process 17is crashed, only process 16 replies to 15 with OK message. This is shown
in Fig. 3.3.3 (b).
Election ah
© ©
(a) (b)
Fig. 3.3.3
Finally, process 16 wins the election and send coordinator message to all the processes to inform that,
it is now new
coordinator as shownin Fig. 3.3.4. In this algorithm, if two processes detect simultaneously that the coordinatoris
crashed then both will initiate the election. Every higher-number processwill receive two Election messages.
Process
will ignore the second Election message and election will Carry Onas usual. .
\ Coordinator
Coordinator >(10)
Fig. 3.3.4
3.3.2 Ring Algorithm
4 This ring algorithm does not use token and Processes are physically or logicall
y ordered in ring. Each processhasits
successor. Every process knows about who is its successor. When any processnotices
that coordinator is crashed,it
builds the Election message and sendsto its successor. This message contains the process numberof sending
process.
If successor is down then process sends message to the next process along the ring. Process
locates next running
process alongthe ringif it finds many crashed processesin sequence. In this manner, the receiver of Election message
also forwards the same messagetoits successor by appending its own number in message,
A
nN
mh
:
In this way, eventually message returns back to the process that had started the election initially. This
ty
>
incoming
otst
.
message contains its own process number along with process numbersof all the Processes that had received this -
bled hacenalgMae eeILSaga
Diya : + ;
message.
atk os
pirat ; Z
ead
t' lito :
ee ita
se
Ww TechKnowledgé a
Publications 99.
Be a
pie Pe

|
Distributed Computing (1 U-Sem ac |
3-12 Synchronization
mp.)
- At this | point, this messa Be gets co
8verted to coordinator message. Once again this coordinator messageis circulated
along the ring to inform the
Processes about new coordinator and members of the ring. Off course, the highest
process numberis chosen from m the th |i
¬ list for new coordinator by process which hasstarted the election.
[5,6,0,1]
[5,6,0,1,2]
(5,6,0,1,2,3,4] [5,6,0,1,2,3]
Fig. 3.3.5
- In Fig. 3.3.5, process 5 notices crash of coordinator which was initially process 7. It then sends Election message to
to
process 6. This message contains numberofthe process 5. As process 7 is crashed, process 6 forward this message
received byall the
its new successor process 0 and append its numberin the same list. In this way message Is
processesin ring.
Highest number in thislist is 6. So message is
Eventually, message arrives at process 5 which had initiated the election.
sthat, process 6 is now new coordinator.
again circulated in previous manner to inform all the processe
3.3.3 Elections in Wireless Networks

of coordinator
through wireless networks.In this system, election
- Distributed system mayalso have nodes connected
build network phase followed byelection. Let the wireless ad hoc
network includes nine
is carried out with first step as
Xand ¥ each having capacity in the range Oto 8.
nodes P, Q, R, S, T, U, V, W, m two
move. Let p is the source node whichinitiates the election. There are minimu of
{tis assumed that node cannot l
es within the circle show capacity of respective node. Tota
-
eac h nod e as shown in Fig. 3.3.6. Valu
nodes connected t o
minimum networktraffic. Let P be the source node whichinitiates ~
ecti ons ar e kept less to have
numbers of interconn
election.
Source
¥ iechKnemledge .
q a

\
\
Ww Distributed Computing (MU-Sem 8-Comp.) 3-13 Synchronization

a 44444 4 eee
Let any node (consider, source as node A) initiates election and send Election message to its immediate neighbors
which are in its range. When any node receives this Election messagefirst time, it chooses sender of message asits
parent. The receivers then send out this received Election message to its immediate neighbors which arein its range
except to its chosen parent. When node receives Election message from other node except parent, it sends its
acknowledgement.
The build tree phase is shownin Fig. 3.3.7. Node P broadcast Election message to Q and Y. Node Q broadcastElection
Q
message to R and V.In this scenario, Y is slow in sending message to V. As a result, V receives message fast from
compared to from Y. Node V broadcast message to T and W and node R broadcast to $ and T. After this, node T
receives the broadcast message from V before R. Node W forward the message to T and X and further, T forward
message to node U which gets the message faster from T and S. This process continues till traversing of all the nodes is
finished.
Fig. 3.3.7
In this method of forwarding the message, each process records the sender node of message and the node to which
message is forwarded. Receiving node of the message prepares message in order to reply in the form (node, capacity]
and send it to the node from which message was received. In following Fig. 3.3.8, reporting of best node to sourceis
shown. -
The first reply node is generated from the last node and same return
path is traversed. The receiver of replied
message selects the message of highest capacityif it receives multiple messages. This message it then
forwards to next
node along the return path. |
As shownin Fig. 3.3.8, The reply message [U,4] would be initiated from U to T and [X,5] from X to W Node T forward
the messageas it is [U,4] to V. During the same time, W receive
s the message [X,5] and as its Capacity greater than
that of X, it replaces the message to [W,8] and forwardit to V.
As stated previously, node Vselects the higher capacity message [W,8] and forward it to node Q. In the meantimé.$
initiate the reply message [S,2] to nodeR which comparesits own capacity and forward the new message [R,3] tonode
[W,8] from V. The best node value [W,8] ix'sent t2 Be |
Q. The node Q now receives two messages. One is [R,3] from R and
source node P. AT the same time, P also receives message[Y,4] from Y. It then selects the higher value out of these aS _ i
[W,8] and reports W as a best node.Finally W is elected as new coordinator.
Tech
i aS hee lll ee

og Distributed Computin
2 MU-Sem 8-Comp)
5 Wit
Vas a
87 2
8 3-14 Synchronization
n
4|
Fig. 3.3.8
3.4 Mutual Exclusion

| .
In environment where multiple processes are running, processes shares resources. When process has to read or
mutual exclusion.
update shared data structures, it entersin critical section (CS) in order to achieve
to use shared data structure when a process is already
- This ensures that, no other process entersthecritical in order
using it.
3.4.1 Distributed Mutual Exclusion

ibuted
to achieve mutual exclusion. In distr
e, monitor and other tools are used
- In uniprocessor system, semaphor nes, different
are distributed on many machi
different machines and also resources
system, since processes runs of Hence,
ed memory may does not exist.
ired to achieve mutu al exclu sion. In distributed system shar
solutions are requ m.
mutual exclusion in distributed syste
ch s bas ed on mes sag e passing are used to achieve
algorithm shared memory is not
omes much more complex as
ty] tem , the pro blem © f mutual exclusion bec
- In distributed sys message delays can be ale: unpre
dictable. .
m of lac k of global physical clock. The
labl d th be proble
available and there can ledge of the overall state of the syst
em.
h t the c omplete know
em cannot have
Therefore, node in the sys
n Algorithms
Clas sification of Mutual Exclusio
3.4. 2 orithms differ as per
distributed system. These alg
d achieve mutual exclusion in e algorithms are
lo pe maintains about other nodes. Thes
to
e deve
gorithms ar n which each node
~. Many al oun t of informatio
urd and am
topology of net work TL ages between
consecutive exchanges of mess
- ithms require two oF more
an
a
classified as follows * hese algor enter in critical section on any

n, as algorithm permits process to
~ Non-token based Al on dec lar ati on,
variable becomes true, process
mits it. As declaration of local
nodes. These algorithm O of algorithms due to
of loca | variable tual exclusion is enforcedin this class
»de declarat ion
node as node. A
tan
on that
enters in critical sect jon
t to
\f process
. lared. nodesin distributed system.
- true value oflocal variable
dec 5 unique tokens shared among
s use
se algorith# in critical section assongiaas it holds the token.
a : Token based Algorith we TechKnowledge
er
on any node receives tok
SP publications

©
Distributed Computing (MU-Sem 8-Comp.) Synchronization
e
e -3-15 e
~ Whenprocessfinish executionin critical section. These classes of algorithmsdiffer on th e basis of the way in which
nodes/processes searches the tokens.
3.4.3, Requirements of Mutual Exclusion Algorithms
| process should be oin

Algorithm should guarantee that only one n critical
criti section to; read or upd ate shared data. Mutual
exclusion algorithms should have following features.
Algorithms should not involve in deadlocks : More than two nodesin

8 distributed system should no t wait endlessly for
otter
the messages whichwill never be received.
,
Avoidance of Starvation :
: If some nodes are repetitively ii section and some any other node is
executing critical
. % ys . ;
indefinitely waiting for the execution of critical section, such situation should be avoided.
ae 8 In finite
inite time, eve ry
requesting node should get opportunity to executein critical section (CS).
Fairness : As there is absence of physical global clock, either requests at any node should be executed in the order of
arrival or in the order these wereissued. This is due to consideration of logical clocks.
4 Fault Tolerance : In case of anyfailure in the course of carrying the work, if any failure occurs,
then mutual exclusion
algorithms should reorganize itself to carry out the further design
ated job.
3.4.4 Performance Measure of Mutual Exclusion Algorithms
Following four metrics measure the performance of mutual

exclusion algorithms in distributed system.
1. The number of required messages for invocation ofcriti
cal section in first attempt.
2. The delay incurred between the moments when one node leaves the CS and before next node enter
s the CS.This
time is also called as synchronization delay. In this case,
more messages exchan ge may require duri
ng this delay
duration.
_ 3. First node9s CS request arrives. Then its request message Is
sent out to enter in CS. Response time is
between the request message sentout to enter in CS and the time interval
when node exit the CS. It j s shown in Fig.
3.4.1.
4. The rate at which system executes requestthatarrivesfor cri
tical section (CS) whichis called as sys
Reques arr
tem throughput.
t ives Request sent out Node enters in CS Node exit the CS
ty
(*4444_ Execution timei
n CS444__»
Time
+ Response Time ea
Fig. 3.4.1
3.4.5 Performancein Low and High Load Conditions
<In low load condition of the system, there is rarely more than one re
quest at the same time
condition of the system, thereis often pending request for in system. In high
mutual exclusio nat ue node
(site). A node rare
load
in high load condition of the system. In best case performance, performanc ly remains idle
¬ metrics achiev
es best Possible
values
Ta
Ww TechKuowledge
Publications

N,N
se 4 ¢ Distributed
Computing (MU-Sem Synchronization
8-Comp.)
3-16
* .
5 to decide who
In non-token based m utual exclusj ,
lusion algorithms, a particular node communicates with set of node
non -token
-
should enter in criti orde red on the basis of timestamps in
val section. next, Requests to enter in CS are
based mutual exclusion algorithms
ck is used and request
_ Simultaneous requests ar e also handled by using timestamp of the request. Lamport's logical clo
lori over requests having larger timestamp values.
for CS having less timest am p value getspriority
3.5.1 Lamport's Algorithm

Request for cS
s run on
..- --| Sy}, tha t is the re are N umber of sites. Single proces
is N+: R,={S1,52
In this algorithm, for alli: 1s e i to store mutual
exclusion
queu e, is mai ntained by every sit
request_
each site. Site S; represents process P;,, The n each pair ofsites.
ps. It is nece ssa ry to deliver messages in FIFO order betwee
requests in the order of timestam
in its request
ion (CS), it sends a mess age REQU EST(tsi, i) to all the sites
critical sect
Whenever site S,; want to enter in
request is (ts;, i).
est_queue;. The timestamp of the
set R, and puts the message in requ S;,
from S;. This request of
e t o Si after recei ving REQUEST(ts;, i) Message
REPLY mes sag
- The site S; sends timestamped
ueue;.
site S; puts in its request_q
Executing the CS
ditions are satisfied.
CSif following two con
Site Si can enter in (ts;, i).
h timestamp larger than
ceiv ed mess age from all the sites wit
- (C1:Site S;has re
t_queue;.
at the top of re ques
C2: S's requestis
Exiting the CS timestamped RELEASE message

quest from top of request_queue,. S, then sends
e CS, it rem ov
es the re
~ WhenS, exit th
|
uest set.
to all the sites in its req from its request_queue).
S,, site Sj removesS;'s request
om
LEASE message fr
4~ After receiving RE
Correctness » in CS at any time. If we assume both S; and

only one process wil | be execution
that,
t's algorit hm ensures requests at the to p of their request queues and C
1 holds.
4 Lam por
their ow n
S, executing in 5 th en both have (T(S)) < T(S))), then as per C1 and FIFO delivery of channe
ls,
than request of 5;
is i t jis at top of request_queue,. It is not
alte equest o vn request_queve) If 5 is executing thenits reques
.
algorithm achiev es mutual exclusion
q uest of 5; i 5
req
possible as T(Si
) < T(S)). Henc®
_4
Publications
fi

a : | | Synchronization
Distributed Computing (MU-Sem 8-Comp.) 3-17 444444
3.5.2 Ricart-Agrawala8s Algorithm : a

| | | a ; ithm is as follows
_ 4 Ricart-Agrawala8s algorithm is an optimization of Lamport9s algorithm. The working of the algori
is REQUEST message &
4 Whenprocess| wants to enter in CS, it sends the REQUEST message toall other processes. This
contains the name of CS in which it (sender process) wantsto enter, its process numberand t he current time. This
messagealso conceptually delivers to the sender. Each messageis acknowledged by the receiver.
4 The receiver of the message takes action as per its current state with respect to name of the CS mentionedin the
received message. Following three cases occur.
1. Currently if receiver is not in CS and also does not wantto enter it then it replies OK message to sender.
2. If receiver is already running in CS thenit does not reply. It puts the message in its queue.
3. If receiver is about to enter in CS and yet has not done then it compares the timestamp in incoming request
message withthe timestamp in request message which it has already sent to other processes. The lowest one
wins. Receiver sends OK message to the sender if incoming request message has lower timestamp value. If its
own request message has lower timestamp value then it puts the incoming request message in its queue and
does not send anyreply.
4 Whenrequesting process for CS receives reply as OK messages from all the other processes, it enters in CS. If process
is already in CS then it replies OK message to other requesting process after coming out of CS.
(a) (b) (c)

Fig. 3.5.1 |
4 As shown in Fig. 3.5.1 (a), process A and
B both wants to enter in cS simultaneously.
Process A and B both sends
REQUEST message to everyone with timestamp 7 and 9 respectively.
4 As shownin Fig. 3.5.1 (b), as process C and D both
are notinterested in entering the CS, b
oth sends OK message as
reply to sender processes A and B. Proces A and
s B both see conflict and compare timest
a mps. In this ca
wins as its message timestamp 7 is lower than the timestamp of message sent by process B, which Se process A
Icn is 9,
~ The:process A enters in CS and keeps request
ofprocess inits queue. When process A
the cri 1 8 8
sends OK message to processB so thatlater process B will enter in CS. It is Shown in Fig.
exits
tical section,it
ensures the mutual
exclusion in dis tr ibuted system
3.5-1 (c). Hence, algorithm
as lower timestamp Process
is allowed to
n j t
C
situation.
ee By
Ww TechKnowledge -
Publications ee
ae
ages ae a? ob ok eee
8teenie a5 te 4 ' irae
Pad ae at = fe; = , cig < :
an » iret Os ' = az = ot Saaoe

Synchronization
e 3-18
+f we consider n numb
: er of processes es jin the system, then process sends (n41) REQUEST messages to enter in CS.It
receives (n4-1) OK me ssages from oth in CS. Hence, 2(n41) messages are
; ¬r processes to get permission to enter
required to exchangeto enter j
In CS. There is no single point of failure
in this algorithm, , if re ply or requesti
f
m
lost then sendergets blocked dueto time outs. Senderof request messaBe then
may conclude that recei ver j
j usly sending the request. Sender has to walt for OK message
er Is dead orit m ay try continuo
, ion.
from destinat
|
For large size group of Processes , each process should maintain list of members. Group members may leave the group,
new members mayjoin th e gr thm produces best performance and
group or group member maycrash. Hence, algori
gr
success for small size groups. .
This algorithm is complex, slower, expensive and less robust.
3.5.3 Centralized Algorithm

vos
tor proce ss . responsi,ble to give permission to other processes to enter in critical
is
algo
In this § rith m, , a cent ral coordina
to
CS is busy or currently in execution. Process which wants
-
section (CS). Coordinator process keeps track of which
SS
the name of CS.
enter in CS sends REQUEST message to coordinator stating
:
on to enter in stated CS in
requesting process to grant permissi
If CS is free then coordinator sends OK me ssage to .
it puts the request message in its queue
s s is executing then
message. If in stated CS, already other proce
,
coordinator process. As a result
exec utin g proc ess in CS exits , it sends RELEASE message to
When currently grant permission to enter in CS.
If
GS is now free and it take s request from the queue to
coordinator knows that
rsin CS.
blocked, it unblocks and ente
requesting process is still
©m @Q ®
© @ ®&
a re ye SO reee ES eS ee aie naga

Request| |OK
elease
Releas OK
(S
(c) *
Coordinator
ue IS Coordinator
ompty Coordinator
(b) (c)
(a)
Fig. 3.5.2
Process Q sends
y no processis executing in CS.
5 C is coordinator and currentl
> As shown in Fig. 3-

2 (a), Proces dit permits proces
s Q to enter in CS by replying with
OK message.
get permis sion to enteriin same CS.

an
coordinator r
e to coordinato to
§ e to ©
REQUEST m essag EQUEST mess
ag
ss R also send
R
|
ce its queue.
(b), Pro it puts R9s request in
~ As shownin Fig- >- 5.2 already inCS,
8
5
e
e4oooe
tor. Now coordinator

44444e
- tor kno ws process

RI sends RELEASE message to coordina
: coordina
As 5 Q exit the cS it age.
OK mess
when proces ission to process R by replying with
As shown in Fi8- 3.5.2 or message from queue and grants perm ,
,
: it ta es :
knows that CS !5 free, !
ers in Cs.
Now process R ent
e TechKnowledge
Publtcatiors
eh i - ;
<NG rR af
Si
a C

Synchronization
Distributed Computing (MU-Sem 8-Comp.) 3-19
, rithm
i r equires 3
4 This algorithm guarantees mutual exclusion as at a time one processIS allowedto enter in CS. This algo
messages to enter and exit the CS (Request, OK and Release). Limitation of the algorithm is that coordinator may
crash.
3.5.4 Maekawa8s Algorithm
4 icci from all other processes to ente rin CS. In Maekawa8

In Ricart4Agrawala8s algorithm, process should get permission
wa's
algorithm, to enterin CS, process has to obtain permission from subsetof the processes as long as sabsets usedby any
| - :
two processes overlaps. Process need not obtain permission from all the processesIn order to enter In cs
4 Processescarry out voting to one another in orderto enterin CS.If process collects enough votes, then it can enterin
CS. The conflict is resolved by the processes common to bothsets of processes. Hence, processes in intersection of two
sets of voters guarantee the safety property that; only one process enters in CS, by voting to a single candidate to
enter in CS.
For all the processes 1 to N, there is associated voting set V; with process P;. The set V; is subset of all the processes
numbered from 1 to N (Py, P2, P3 ....... P,,). For all processes i and j between 1 to N, this set satisfies following four
properties.
4 Process P;is member ofV;.
1. Voting sets V; and V,;contains minimum one common member.

2. Size of voting set is same.Letit be K.
¢
3. Each process P; belongs to X of the voting setV;. an

< L
8
In this algorithm, optimal solution is provided to minimize the value of voting set size K specified in property 3. This
solution guarantees mutual exclusion for K - V N and X equal to K.
iS
In order to get access to9CS, process P; sends REQUEST message to K-1 membersof V;. Process P; can enteris CSifit
receives reply from K-1 processes of the set. Let Process P; belongsto setVj.
Process P, sends reply message immediately to process P; after receiving the REQUEST message from it unlesseitherits:
state is HELD or it has replied (VOTED message)since it last received the RELEASE message.
4 When process receives RELEASE message, it takes out the outstanding REQUEST messages from front of the non-
empty queue andsendsreply as VOTE message.To exit the CS, P; sends RELEASE messagesto all (K4-1) membersofV;.
4 This algorithm guarantees mutual exclusion as common Processesin V;and V; should cast
votes for both processes P,
and P; which simultaneously want to enter in CS. But this algorithm permits to cast at the most one vote between
consecutive receipts of RELEASE message. This is impossible and hence, ensures mutual exclusion.
- This algorithm leadsto deadlock situation. Let V, ={P,, P2}, V2 = {P2, P3} and V3 = {P3, P,}.
If all three processes P,, P,and
P; simultaneously want to enter in CS then P; may reply to P2 while holding P3. Process P, to reply P3, it may hold P1-
Process P3 to reply Py, it may hold P2. As in this situation each process will receive one out of two reply, none can
proceed further.
4 In this algorithm, if process crashes and it is not the member of voting set then its failure
does not affects other |
processes. The stepsof the algorithm are summarized as shown below.
RppbaBsone
1. Initially state is RELESED and VOTED is set as false.

;
wy Publications
TechKnowledge

.
"yg _Distributed Computing (MU-Som 8-Comp.) . 3-20
Synchronization
If P; wants to enterin CS then i -

2. ; to all the processes (Vi {P;})
: .
K.It wai ts state
;
is WANTEDa ndit sends REQUEST messages
in its set of voting size :
D.
) reply. State = HEL
puts the REQUEST of P; in quetic and

. 8s my arrival of (K-1
f state = HELD or VOTED = true then

y fr
3 if P; receives repl
don9t reply. Otherwisom P;(iy toj)P,andi

e repl and VOTED = true
For P,to exit the CS, , state State == RELESED and reply RELESE messagetoall processes Vi- {Pi}.
(non-empty) then
Fecelves RELESE from P, where(i # j) and if queue of request have messages
Whenprocess P;| recej
ee
VOTED = true.
remove message from front (h ) queue. Suppose itit iisj from Py. Then send reply to Py. Now
e
ead) of
i VOTED = false.
Otherwise
eee
.
3.6 Token BasedAlgorithms 44
a
ae erent
rn eens e
*
token is shared among all the processes. The process
Token based algorithm makes use of token and this unique
=
_
process carry out
ent token based algorithms based on the way
possessing token can enter in CS. There are differ :
search for the token.
aot ge eee ee
token has sequence number and
instead of timestamps. Each request for the
_ These algorithm uses sequence numbers
ss request for the token, its
advances independently. Whenever proce
these sequence numbers for processes
for the token. As only process
a eT ree
incremented. This helps is diffe rentiating old and current request
sequence number gets
e mutual exclusion.
acquiring token enters in G, algorithms ensur
m
3.6.1 Suzuki-Kasami9s Broadcast Algorith
a REQUEST message
t to ente r in CS and does not holds token then it broadcast
cetera ehET AT LL
ess wan
- Inthis algorithm, suppose proc ng process.
holding token sendsit to requesti
the processes. Afte r receiving the REQUEST message, process
to all CS.
after it has exited the
sage then it sends token only
ady in CS when receives the REQUEST mes in
issues are important to consider
- If process ts alre
token. Following two
atedly enter in CS till it has a
Process holding toxen can repe
this algorithm.
from curren t REQUEST messages.
ed REQUEST messages
o To differentiate outdat
e
request for CG.
s having outdated
-
min ing th e pr oc es
o _4dDetter ber that showsP; is
, n (n = 1, 2, 3...) is a sequence num
pe a e
the REQ UEST messaBe of process P,. Here
Let REQUEST(P), ) Is e RN,{J] indicates largest sequence
Ae Poe
- maintains array of integers RN,[1...N] wher
~ tion. Pro ces s P,
requesting its n# CS ex ecu UEST(P,, n) the e l be outdated if RN;[J] > n. When
n this messag wil
from P;- If process Pi receives REQ
number received i N,{J] = max(RN [J], 7).
it sets R Ul)
process P, receives REQUEST(P} n), SN[1...N]. Where SN{j] indicates
the sequence number
processes and array of integers finishes its
ever it
~ Token has queue of requesting ed by process Pp; most recently. Process P; sets SN[i] = RN,{i] when
at has execut request corresponding to sequence numberRN\[i] has been executed.
of the request th
ows that its :
execution of CS. It sh .
ocess P; is currently requesting the token. The process which has
executed the
and puts their IDs
sion for 3I j9s to know aboutall the processes that are requesting the token
~ : Tij = SNIj] +1 then pr
At processP;if RNiL]
CS now checks this condi
I
: esting queue.
the process at head of requ
ere. process then sends token to
already th
in requesting queue Hf not messages per CS entry. Synchronization
delay is 0 orT.If
requires 0 or N
le 4 and efficient.gIt
simple then no message or zero delay is required.
This algorithm is : ves tin
Process has idle token while req aowledge

. TechK
Fubtitcations
e
e

7 Distribut 4
8-Comp.) Syrcrronization
4444y,
Algorithm
Request for CS
(i) If process P; wants to enter in CS and does not have token, then it increments its sequence number RN({i]. It then
sends REQUEST(P,, sn) messageto all the processes. In this message sn is updated value of RN\[iJ.
(ii) After receiving this request message, process P; sets RN|[i] to max(RN,[i], sn). If Pj has token then it sendsto P, if
RN,{i] = SN[i) + 1.
Executing the CS
After receiving the token, process P, executesin CS.
Exiting the CS
Process P, carry outfollowing updatesafterit finishes execution of CS.

(i) It sets SN[i] = RN,[i).
(ii) For all processesP,, if their IDs does not belong to the token queue then appendthese IDsin it if RNj{j] = SN{j] +1.
sends token to the process of
(iii) If token queue is still has some messagesin it then delete ID at head of queue and
that ID.
3.6.2 Singhal9s Heuristic Algorithm
In this algorithm, each process keeps information about states of other

processes in the system. This information is
4
token. The process then requests the
used by the process to determine the set of proce sses that likely to have the
to enter in CS.
token from these processes to reduce the number of messages needed
holding the token or going to obtain it in near future.
4 The request for token should go to the process which is
Otherwise there can be chances of deadlock or starvation.
of each
store the state and highest known sequence number
' Process P,; keeps two arrays PV[1..N] and PN[ 1..N] to
determined by
1 TPV[1..N] and TPN[1..N]. Outdated requests are
|
if
i process respectively. Token also maintains two arrays
}
one of the following state.
using sequence numbers.Process can be in
|
fF |
|
REQ : Requesting the Cs:

}
Oo
EXE : Executing the GS

O
HOLD : Holding the idle token.

oO
-
NONE: None of above.
0
s
Initially arrays are set as follow
ri=1toNdo
For every processPi, fo
=N toi
Oo Set PV;,[j] = NONE forj
=i4itol
Oo Set PV;[j] = REQ forj
N.
Set PN,[j] = 0 forj=1to
state. Hence, $1[1] = HOLD.
4 Initially process Py is in HOLD
and TPN{j] = 0 forj =1to N.
For the token TPV[j] = NONE
ee
oS ee ee ae
SF rece

ee
__
{
x
e S
8tributed Computi
_ above initia stat

lization ensures that either. PVi{j] = sses sending reque
REQ or PV[i] = REQ. Hence, for any two proce
the same time s are not
one proc send token request to other.
This guarantees that processe
ccessible from each i will always
the token or to
ess will be delivered to process holding
ina acn ot
e . . er and REQUEST message from proc
the process which will hold it in near coming time
algorithm
request for CS
1.If process P; wants to enterin CS and does not have token,thenit takes following action.
o -4sCSetts PV8[i] = REQ.
o Sets PN;[i] = PNj[i] + 1.

PVifj] = REQ. In this message sn Is
oP; then sends REQUEST(P,, sn) message to all the processes P, for which
updated value of PN\[i]
age is outdated now.
byP;, it discards this message if PN, [i] 2 sn as mess
2.After receiving the message REQUEST(P,, sn)
ee,
g activities as per its current state.
Otherwise, it (P;) sets PN;[j] = sn and carry outfollowin
o PV;{j] = NONE: Set PV,[i] = REQ.
PNj[j) message to P. (Else do nothing).
PV|[i] = REQ and send a REQUEST(P;,
oO PV;[j] = REQ: If PV;[i] # REQ then set
O PV, [i] = EXE: Set PV,[i] = REQ.

send the token to process Pj.
= REQ, TPN[i] = sn, PV;[j] = NONE and
PV;[j] = HOLD: Set PV;[i] = REQ, TPV[i]
o PV, [j] = NONE: Set PV|[i] = REQ. ing).
PNj[j) message to P; (Else do noth
PV|[i] = REQ and send a REQUEST(P;,
o : Pv;[i] # REQ then set
PV,i {j] = REQ: If
o PV, {j] = EXE: Set PV; [i] = REQ. n to processP;.

i] = sn, PV{[j] = NONE and send the toke
0 PVj [j] = HOLD : Set PV;[i] = REQ, TPV[i] = REQ, TPN{
= 2
Executing the CS
PV;{i] = EXE.
receiving the token. Prior to entering in CS it sets
3.Process P, executes CS after
Exiting the CS its array

7] = NONE and TPV[i] = NONE. After this,it updates
a «an 0 of CS, process Pi, sets PV((i]
4. After finishing the execution
as mentioned below. es
ora lPforj =1-:-Nd0 2 onete

as If PNj]> TPND) BES
<4 if [se token information to updat 3 information.
alee local
(TPVE] = pV{ijs TPNG] = PNG

ser
;nformation to update token information. Re
BRA so Ele os seis ;/ Use iseal:
Ne:
= DEVOTNGD = TENG
; sie tie Thee ¢ 3 s eo
set : a 23 : ee US
7 : < i uta LenS
7 " ss ae ale 8 i=
_US
pig 4
"at, " VF aN ¥ of
x oF ett rise
NG ae, - a ¥ ee
process P, such that PV;[j] = REQ. HOLD, otherwise send token to

4eeNE the set PVilil = NO
S. If for all j, PVilil
Writinesren:
Be

3-23 : Synchronization
4 Fairness ofalgorithm will rely on selection of process P,, as no queueis available in token. In this algorithm, negotiation.
rules are used in order to ensure fairness. Algorithm requires average N/2 messages in low to moderate load
condition. During high load, whenall processes requests CS, N number of messages are required. Synchronization
delay required is T.
Example
Assume thereare 3 processesin the system. Initially:
4 Process 1 : PV,[1] = HOLD, PV,[2] = NONE, PV,[3] = NONE. PN,[1], PN,[2], PN,[3] are 0.
4 Process 2 : PV.[1] = REQ, PV,[2] = NONE, PV,[3] = NONE. PNs are 0.
4 Process 3 : PV3[1] = REQ, PV3[2] = REQ, PV;[3] = NONE. PNs are O.
4 Token: TPVs are NONE. TSNs are O.
Assume P,is requesting token.
4 P2 sets PV, [2] = REQ, PN, [2] =1
4 Pz sends REQUEST(P,,1) to P, (As only P, is set to REQ in PV[2])
4 Py receives the REQUEST. Accepts the REQUEST since PN,[2] is smaller than
the message sequence number.
4 As PV,[1] is HOLD: PV, [2] = REQ, TSV[2] = REQ, TSN[2] = 1,PV,[1] = NONE.
4 Send token to processP).
4 Pz receives the token. PV, [2] = EXE. After exiting the CS, PV, [2] = TPV[2] = NONE.
|
|
4 Updates PN,PV, TPN, TPV. Since nobodyis requesting, PV. [2] = HOLD.
|
|
|
|
4 Suppose P; issues a REQUEST now.It will be sent to both process P, and P2. Only P, responds since only PV,[2]
| is HOLD
(PV,[1] is NONE atthe present).
3.6.3 Raymond9s Tree-Based Algorithm
4 Inthis algorithm, processes are organizedin directed tree. In this tree, edges of the tree are
towardsthe root(process)
which currently having token. Each process in tree maintains local variable HOLDER (H) that points to an immediate
neighbor on path towards root process. The local variable HOLDER (H) can also point to the root. The process which
currently holds the token hasprivilege to be root.
4 If we chase HOLDER(H)variables, every process has path towards process holding token. At root, HOLDER (H) points
to itself. Every process maintains FIFO Request-queuewith it which stores requests sent by neighboring processes but 3
has notso far sent the token. Following are the steps in algorithm.
4 Consider two processes P; and P; where P; is neighborof P;. If HOLDER variable H,= j. It means P, has informationthat,
"root can be reached through Pj. In this case, undirected edge between P; and P, converts to directed edge from P,and ©
P;. Hence, H variable helps to trace path towards process holding the token.
Tech
Pablicatiors

Distributed Comput : a ;
_ Consider undirected graph with Processes p

to P, with j rrently y hold the token. P, and Pz \i
: 1 P; with holder variable H. Process P;cu :
are the neighborsofrootp,; keeps information about its
Pa and Ps of P2 and processes P, and P; of P3. Every process
neighbor and communicates with onlytoits immediate neighboring processes which lead to root process P:.
||
- Suppose P, wants to enterin Cs. Process P, sends REQUEST message to process P2 as H, = 2. Process P, in tum forward
this message to process P, (H2 = 1). This REQUEST message gets forwarded by the processes along the minimal
spanning path on the behalf of the process which initially sends REQUEST message to root having token. The
ei ey
«
Forwarded REQUEST messageis noted by using variable <asked= in order to prevent the resending of the same
*
message.
at i
H, = "Self", Asked = F @ <4 Variable t indicates token
Hl
rey
®
a a a i ee ee a| SE eens 5 4:
H5 = 1 Asked a E
(P3) Hg =1 Asked = F
*
REQUEST /
= 1 Asked = F(alse) @) (Ps) (6) (P)
RQ,=4
Asked = F(alse) Asked=F Asked =F
ss P;
EST messageto P2 which forwards it to proce
Fig. 3.6.1 : Process P4sen ds REQU
H, = 2, Asked = F
RQ, = Empty
PRIVILEGE
Tokent, H, = <Self
RQ, = 4, Asked =T
ee mek
H,=2, Asked =T (Pa) (Ps) (Ps) 7) . |
AQ, =4
8 i privi ss P 1 :
Fig. 3.6.2: Process Pz receives token after granting the privilege from proce
|
and if not executing G, it sends
When it receives REQUEST message (H, = <Self=)
process Py r Ht ts
4 Initially token was with hbor P2. Process p, was the sender of REQUEST message to P,. Now P, resets holde I
or P2- |
PRIVILEGE message to its neig forwards aemessage to process Pa Process P, had requested token on the
s P, in turn eee
variable Hi = 2- ne holdervariable as H2= Fle
Pye
5
behalf of process Pa, it uPdate | it +
Data Structures j = 1, 2...). It holds the identities of immediate neighbor processes
ne ms con ne h token.
not yet been sent the k Maximum size of RQ is number of
Maxim
~ RQ, is the request
for the
that have requested
immediate nei
Ser TechKeowledge
Pablitativgs

Distributed Computing (MU-Sem 8-Comp.) 3-25 Synchronization
4 Each process has variable <Asked= whichis initially set as false (F). Its value becomestrue (T) only if REQUEST is Sent.
eitherfor itself or neighbors and notreceived. After receiving the PREVILEGE, its value is again set to false.
Algorithm
1. Request for CS
(If token is already held by the process then no need to send REQUEST message. Process needs tokenfor itself orfo,
__ its nelenbor. If variable Askedis false, means it has not already senta request message).
s notemptyand Asked==F
TeprocessP, doesrnothold thetoken and RQis
-RQi=RQ - a andsend message to>the process held in holdervariable Hi;

oe| Asked==e33 ee eer ae
neAsked=F;
2. P; Receives REQUEST message from P;.
<self, ae
5ae holds token ant if it is not executing CSand RQis not emptyand at the head of its RQID isnot
Send PRIVILEGE message torequesting process P,; update H; =

"Else if process isunprivileged process 3
_Process forwards the message to parent process.
3. Receipt of PRIVILEGE message

oe
Je =
:=Fi
RQRQ.1; Enter ij n1S;<Asked
0
request on.th
If P, hadforvarded the ebehalfof its< neighbor: --
nah ¢ PRIVILEGEn
<Send the ps piteeea whichistop ofRQ: update H, and 4
message torequesting
:RQ
_change parent en
| GETS Se hae ag:
<TERQ:is<still not empty.
H value;Asked = T;
"Then send requestmessopeto De
ta
- Else AskedeR SF Fes
4. Execution of CS
RQ;
token and its own ID is at the head of
Process Pi can enter the CS if it has
ee :
e .:
Knowleds?
Se Tec ciere

r . EE oe Lipue |
Wea
3 hee
| Synchronization
. .
| ) :|
| 4) a -
r
a ed Computi.ng (MU-Sem 8-Comp.) 3-26
Distribut
|
5. process P; Exiting the CS
HA
if RQI isnot empty. =:
i
~.DequeueRQi; If this element is ID of P,; Send the token to P,: Hy = Asked = F:
7 Ee i
ei TERQiis sull notempty
|i
fi?) - Then send REQUESTtoparentprocess; Asked = F.
i|
i
,
performance f
under light node. tit |
only O(log N) messages
algorithm exchanges 4 messages if system load is heavy. It exchanges
:
Hef
|
3.6.4 Token Ring Algorithm iteri to.attange
, no any fix criterion ae
il
i
oer
ring In bus based network eonis
There eve used acht process should
are logically arranged in
_ Inthis algorithm, processes ii
n
the machines can be used or other criterio
the processes in ring. IP addresses of ae After te
.
know the next process in line after itself. : 1°" Hi |
from process k to process k+1 in point to point messages
it finishes Its Hi
ter i n CS. If yes, then it enters the CS. After
, the token. ae eelsif it has to enter Inalong ,
Co. the ring. :
Same token cannot be used to enter | |1
-_ Initially process Pe holds ts successor
acquiring token from its neighbor, processcn9 . Fhe
on of CS and exit the CS, it passes the token fo '
ees... . : lates with high
irculaies W Hi
[ii
ii
ne
secon
is interested to enter In CS then simply eea a his
irculates in ring If no process : : : . Hi
j k I ; will be there in :
a single process
rocess has the token at any instant, | Fi |
am _ tomer token. After lost, token
is loss of _
starvatio8on. n. The problem in: this algorithm from ring. Hed
| H
speed in ring: As one
8on. There is no fr &
It !s removed
of the process is dead to which token Is passed then
aoresot
ical ruts! tea infinity. Delay before entry is 0 to (n41) message>-
Hie
||
~ needs to be regenera ted.| we a re 0 to °
Pill
entry/exit
«
Messages required per
Ha
| ad
te
Hy
i Mi
Hl
IE
yt. Pita
a
, rformance Analysis of AlgorithmsThese are response time,
ive Perfor .
Number of messages -
alt
his
Comparatt ce of the algorithms.
erforman - hig
3.7 ompare P |
wry Bite
Three parameters ion delay- : aie
required and synchronizal :
a . token (27) plus the At
Hes
3.7.1 Response Time ime for many algorithmsis roundtrip time to acquire
distance between : | ae
ion of system, respon is E). In Raymond9s Tree-Based algorithm, average
pid
cs (whic
Ave ° rage round trip dela y thereforeis T(log N). -
- In low load conditl
exe4cut e t rf Iding tokenis (log N)/2. . Lil leish
time required t© proces sno TechK led = Maer 7
an 1] yt
: WF resist atcars
} ot ne
requesting proces? .
\ idee:

Distributed Computing (MU-Sem 8-Comp.) -
it Ih ee
4 Different algorithms vary in their response time in high load condition of the system. It varies
when loadin System
increases.
3.7.2 Synchronization Delay
4 This is delay due exchanges of messages neededafterprocess exit the CS and before next site enters the CS.In MOst
of :
the algorithms whenprocess exits the CS,it directly sends REPLY or token message to the next process to enterthe cc
As a result, the synchronization delay requiredis T.
| 4 In Maekawa8s Algorithm, process exiting the CS unlocks arbiter process
by sending RELEASE message. After this, arbiter
process sends GRANT message to next process to enter the CS. Hence, synchronization delay
required is 2T.
4 In Raymond9salgorithm, two processes that consecutively execute the CS can
be found at anyposition in tree. Token
Passesserially along the edges in tree to the process to enter CS next. If
tree has N nodes then averagedistance »
between twonodesis (log N)/2. The synchronization delay required is T(log
N)/2.
3.7.3 Message Traffic
I 4 Lamport9s, Ricart-Agrawala, and Suzuki-Kasmi algorithms respectively

needs 3(N-1), 2(N-1) and N messages per CS
execution in any load condition. In other algorithms, messages required
depends on whetherlight or heavy load
presentin the system.
| | 4 Light Load : Raymond9s algorithm requires log N messages per CS execution

. The request messagetravel upward
iH towards root node from requester. The token travel backs from root to requester
node. If tree has N nodes then
average distance between root node and requester node nodesis (log N)/2. Singhal9s
heuristic algorithm requires N/2
| messages per Cs execution.
4 Heavy Load : Maekawa9s Algorithm requires SV N messagesin high load condition
. Raymond9s algorithm requires 4
iH messages whereas,Singhal9s heuristic algorithm requires N message
s.
| | 4 Following table gives comparative performance ofdifferent algorithms.
| Algorithm 44_| Response Time | Synchronization Delay| Messages| Messages.

| Beee SAL (Light toad)
| (ight Load) + ELSE aes Na (Light ce sorcarrer
Load)|: (HighLoad)
il Lamport ae
| (Non-Token) . ' ath sin)
! Ricart Agrawala 7 - |
, (Non-Token) * ' ea 2(N-4)
Maekawa | fo
inisnciokend 2T+E 2T 3v N SVN
{ Suzuki and Kasmi (Token) 2T+E T N N
Singhal9s Heuristics (Token) 2T+E T ~ N/2 N
| Raymond (Token) T(log N)+E T | Log(N) 4
ir |
4
Wen

, Synchronization
| utingg (MU-Sem 8-Comp.)
Distributed Compputin 3-08
Review Questions_
g.1 Why clock synchronizationis required in distributed system? Explain.
0.2. Explain physicalclocks in detail.
Q.3 Write short note on Global Positioning System (GPS).
Q.4 Explain Cristian9s algorithm for clock synchronization.
Q.5 Explain Berkeley algorithm for clock synchronization. |
Q.6 Explain Averaging algorithms for clock synchronization.
Q.7 Explain how Clock Synchronization is carried outin Wireless Networks.
-Q.8 Explain working of Network Time Protocol (NTP).
Q.9 Whatis logical clocks? Explain Lamport's logical clock with example.
Q.10 Whatis total order multicasting? Explain with example.
Q.11. Whatis vector clocks? Explain.

with example.
Q.12 4 Explain Bully election algorithm
example.
Q.13 Explain Ring election algorithm with
ain.
ied out in wireless networks? Expl
Q.14 How election of coordinator is carr
ted system?
Howit is different in distribu
Q.15 What is mutual exclusion?
al exclusion.
in shor t working ofNo n-token bas ed and token-based algorithmsfor mutu
Q.16 Explain
thms.
of mutual exclusion algori
Q.17 What are the requirements
exclusion algorithms.
e performance of mutual
Q.18 Explain metrics to measur
al exclusionin distributed system.
Q.19 Explain Lamport's algorithm for mutu
l! exclusion in distributed system.
grawala' s algo rithm for mutua
Q.20 Explain Ricart-A
d system.
rithm for mu tual exclu sion in distribute
Q.21 Explain centralize d algo
em.
algorithm fo r mu
tual exclusion in distributed syst
Q.22 Explain Maekawa9s
n in distributed system.
n Sus uki -Ka mi 's Broadcast algorith m for mutual exclusio
Q.23 Explai sa
.
cs algorithm for mutual excl
usion in distributed system
Q.24 Explain Sin ghal9s Heuristi
exclusion in distributed system.
Tre e-Base d algo rithm for mutual
Q.25 Explain Raymond's
algorithms.
fferent mutual exclusion
Q.26 Explain perf OOO
|

ment
Resource and Process Manage
I
Module 4
Syllabus
;
Desirable Features of global Scheduling algorithm, Task assignment i
approach, Load balancing roach, load
appro
Clients, Servers,
sharing approach, Introduction to process management, process migration, Threads, Virtualization,
Code Migration.
|
4.1 Introduction
ae
4 Indistributed system, resources are distributed on many machines. If local node of the process does not have needed
resources then this process can be migrated to another node where these resources are available.
4 In this chapter, resource is considered as processor of the system. Each processor forms the node of the distributed
system.
4- Following techniques are usedfor scheduling the processes of thedistributed system

O Task assignment approach: In this method, tasks of each process are scheduled to suitable nodes to improve
performance.
© Load-balancing approach : In this approach, main objective is to distribute the processes to different nodes to
make workload equal among the nodesin the system.
oO Load-sharing approach : In this approach, main objective is to guarantee that no node will be idle while
processes
in systemare waiting for processors.
4.2 Desirable Features of Global Scheduling Algorithm
Following are the desirable features of good global scheduling algorithm

4 No Priori Knowledge about the Processes : Process scheduling algorithm should not consider
a priori knowledge
aboutthe characteristics and resources need of the processes. It imposes extra burden on users as they should specify
p
this information while submitting the job for execution.
Dynamic in Nature : Process scheduling algorithm should always consider the current infor
mation about load of the
system and should not obey any fixed static policy to assign the process toparticular node.
It should always consider
dynamically changing load on the nodesof the system. Hence, algorithm should have flexib
ility to migrate a process
more than once. This is necessary as after some time system load may change. Hence,it
is Necessary to transfer the
process from previous node to new nodeagain to adapt to the changed system load. This j
. S possible only if the system
support preemptive process t
Quick Decision Making Capability : Process scheduling algorithm should always make quick de
cisions about regarding
assigning the processes to nodes.
Balanced System Performance and Scheduling Overhead : Process scheduling algorithm should not col a
ot collect ro=
global state information in order to use it for process assignment decisions.
t i
: foorrcariers 4
a abe
i= lores
ee ee
Scanned withCamScanner
Distributed Comput s Management
Resource and Proces
mp.) 4-2
In distributed moeSem 8.
system, collecti ng of information
overhead. Further, due to agi
.
.
n
re
rf
e inf orm ati on req uir es mo

being collected and| Ow sched B Blobal stat rmation can
al go ri
uli
eduling freq uenc y dueto costofcoll ecti ng and processing, usefulness of info
reduce. Hence, erred.
» algorithm offer; . nce
n § Near optimal performa g nimum global state information is pref
by usin mi
Stability : There m ay be situati me in migrating
their time max imum ti
all the nodes in the system spends the
of processes is carried out to schedule
Carryi ation where
the processes without
Ng Out any useful work. This migration
processes for achieving the b etter performance. Such unproductive migration is called processor thrashing. Processor
j
thrashing occursif n Odes ar this cas node
te} ¬ capable of scheduling its own processes locally and independently. Inthe same- It may
takes schedulin er nodes for
the nodes. Suppose
aay beco me old due to transmission delay between
that : 9 a 3 without coordinating with
node 1 and node 2 observes sends their part of the load to node
= is idle, both .
transferring its processes On other nodes
jf 4 ode
each other. In this situation.
¢ 3 becomes overloaded then it alsostarts
This leads to unst able state. of the system dueto repetition of such cycle again and again.
s sch eduling alvor; small as well as large ne tworks. If algorithm enquires for
alab
Sca e
cesess
ility 8 : 9Proroc scheduling algorithm should work for
twor k, as many
ller networks. In large ne
n it can work better for sma
n » a
oagea re the
ign
ode and sends pro ces ses the it cannot be scalable. Better
ena ses it|
needs to bec procesS525
pliess nee
replie consumes bandwidth as well as processing power. Hence,
to probe k out of n nodes.
in system crashes. It should
: Algorithm should continue its working although one or more nodes
Fault Tolerance
consider available nodes in decision making.
4.3. Task Assignment Approach

ons
process. Following are the assumpti
find s an optim al assi gnme nt policy for the tasks of the individual
This approach
in this approach.
m will be minimized.
ces s are les s int erd epe nde nt and data transfer among the
Oo Tasks of the pro
speed of processor is already known.
w mu ch co mp yta tio n is required fort ask and
o Ho
every nodeis known.
need ed by each task on
Cost of processing g
own and it is zeroif communicatin tas
ks
ic a rion (IP C) cos t be tween each pair of t asks is kn
un
Intercrosses comm
th e same node.
are running on s available on each node is already
am on g tasks, f esources need of each task, resource
edence relation
ship
o Prec
known. the tasks of the process to nodes in the manner to achieve
ons, the ma in goal of assigning
sumpti
~ Considering above 35
following.
° Reducing IPC cost
process
time for complete
o Quick turnaround
of parallelism
o Ahighdegree vninimize PC by
of system resources. each other. You can
o Efficient
utilization as they conflicts evenly distributed among
:
of resources, tasks must be

re
these goals simultaneously

utilization
sll
node. But
achieve for efficient
It is not P gsible to same
Oo
|
~
:
8 If x number of tasks and y number of
4
ll ta sk s on al so limits their parallel execution.

keepinga on g task s
edence relati
onship am
ed to certain
nodes. Prec y practice, ce rt ai n tasks cannot be assign
sk=s to nodes. In
signment® . of taassignments
he ossi
nx are t p
blnse. as can beless than x=,
nodes | the ctio H ence, possible
some restri
nodes due to
| Se

Distributed Computing (MU-Sem 8-Comp.) 4-3 Resource and
ProcesManag
s emen
Se
4 Thetotal cost which is optimal not necessarily will be gained by assigni

8
ng equal no of processesto each node.Fo, two

Processors, optimal assignment can be donein polynomial time. But for arbitra
ry number of processorsthe cost is Np
hard. Heuristics algorithm found to be efficient from computation point of view but produce sub optimal assignments.
44 _Load-Balancing Approach a
4 Load balancing algorithms use this approach. It assumes that, equal
load should be distributed amongall nodes to
ensure betterutilization of resources. These algorithms transfers load from
heavily loaded nodestolightly loaded
nodesin transparent wayto balance the load in system. This is carried
out to achieve good performancerelative to
Parameters used to measure system performance.
|
4~ From user point of view, performance metric often considered is respons
e time. From resource pointof view,
performance metric often consideredis total system throughput. Through
put metric involves treating all usersin the
system fairly with making progress as well. As a result, all load balancing algorith
ms focuses on maximizing the system
throughput.
|
4.4.1 Classification of Load Balancing Algorithms
4 Basically, load balancing algorithms are mainly classified into two types as static and dynamic
algorithms.
4 Static algorithms are classified as deterministic and probabilistic.
4 Dynamic algorithms areclassified as centralized and distributed algorithms.
4 Furtherdistributed algorithmsare classified as cooperative and non-cooperative.
4 Static algorithms do not consider current state of system. Whereas, dynamic algorithms
considerit. Hence, dynamic
algorithms avoid giving response to system states which degrades system performan
ce. However, dynamic algorithms
need to collect information about current state of the system and they should give response
about the same. As a
result, the dynamic algorithms are more complex than static algorithms.
4 Deterministic algorithms consider properties of nodes and characteristics of processes
to be assigned to them to
' ensure optimized assignment, for example; task assignment approach. A probabilistic algorithm
uses information such
as network topology, numberof nodes and processing capability of each node etc. Hence, such
information helps to
improve system performance.
4 In centralized dynamic scheduling algorithm, scheduling responsibility is carried out by single node. In distributed
dynamic scheduling algorithm, various nodes are involved in making scheduling decision,i.e. assignment of processes
to processors. In centralized approach, system state information is collected at single node whichis centralized server
node. This node takes processes assignment decision based on collected system state information. All other nodes
periodically send this information to this centralized server node. This is efficient approachto take scheduling decision
as single node knowsload on each node and number of processes needing service. Single point of failure is the
limitation of this approach. Messagetraffic increases towards single node, hence it may becomebottleneck.
4 In distributed dynamic scheduling algorithm various nodesare involved in making scheduling decision and hence,it
avoids bottleneck ofcollecting state information at single node. Each local controller on each node runs concurrently
with others. They takes decision in coordination with other based on systemwide objective function instead oflocal
one. .
4 In cooperative distributed scheduling algorithms, distributed entities makes scheduling decision by cooperating with
=
each other. Whereas, non-cooperative distributed scheduling algorithms do not cooperate with each other for making4
ner Sea eT
scheduling decisions and each one acts as autonomousentity. |

4
< TechKnowstedgt ee
W pevireatton aS
:
aenw apeprmee
ase
a
==
4-

8 ae
Be weet
fF
. 8 .
Distributed Co
ae eg eee ematlhes v eA ee Pee Ta

mputing (M
U-Sem 8-Co
mp ) Management
Resource and Process
un
ee
44.2 44 |
Issues in Designing Load-
a
Balancing Algorithms
ail
2
3. Selection of node to transfer
| selected
Process called as location policy.
4 Policy for how to exch
: ange system loa d information
among load.
Policy to assign prior; ty of execut :
6M priori ion of local and remote process on particular node.
6. Migration
6 limiti
ting policy to determine ; numberoftimes process can migrate from one nodeto other.
If process IS processedatits Originating node thenitis local process. If process is processed on node whichis notits
originating node then it is remote process. If process arrived from other node and admitted for processing then it
becomes local process on this current node. Otherwiseit is sent to other nodes across the network. 8It thenis
considered as remoteprocess at destination node.
Load Estimation Policy
Load-balancing algorithm should first estimate the workload of the particular node based on some measurable
parameters. These parameters include factors which are time dependent and node-dependent. These factors are:
Total number of processes present on node at the time of load estimation, resource demands of these processes,
instruction mixes of these process and architecture and speed of the processors.
As current state of the load is important, estimation of the load should be carried out in efficient manner. Total
number of processes present on node is considered to be inappropriate measure for estimation of load. This is
because actual load could vary depending on remaining service time for those processes. Again, how to measure
remaining service time of all processesis the problem.
Again both measures, total number of processes and total remaining service time of all processes are not suitable to
measure load in modern distributed system. In modern distributed system, idle node may haveseveral processes such
as daemons, window managers. These daemons wakeup periodically and again sleeps.
Therefore, CPU utilization is considered as best measureto estimate the load of the node. CPU utilization is number of
time.
CPU cycles actually executed per unit of real time. Timeris set up to observe CPU state timeto
NL.
Process Transfer Policies
processis transferred
; from heavily loaded nodesto lightly loaded nodes. There
In order to balance load in system,
4
_4
d be somepolicy to decide wh
ether node is heavily or lightly loaded. Most of the load-balancing algorithms use
ee
shoul
threshold policy.
hold
At each node, threshold value is used. Node accepts new processto run locally if its workload is below thres
value. Otherwise processis transferred to another lightly loaded node. Following policies are used to decide threshold
SUS et ere eee
value. .
© Static policy : Eac h node has predefined threshold value as perits processing capabilities. It remains fixed and
P changing wo r kload at local or remote node. No exchangeofstate information is needed in this
does not vary wit h
re
policy.
In this policy, threshold value for each node is product of average workload ofall the nodes and
© Dynamic policy :
predefined constant for that node.
Predefined constant value C& for node n, is dependson processing capability of node n, with respect to processing
er nodes. Hence state information exchange between nodesis neededin this policy.
Se ee Sg
capability of all oth
TechKuowiedge
Publications

¥ Distributed Computing
(MU-Sem 8-Comp.)
=
oat of the load-balancing algorithms

use sing
erloaded and underloaded. In
this policy, if loac
,
and <| accepts new local or remote
processes for e
¬ local processes and reject to accept remote
pro
Overloaded
Thresht
Underloaded
single-Threshold policy
(a)
In single threshold policy, suppose load is just belo

case, load becomeslarger than threshold value as so
remote process, node will start transferring its proce
Therefore it is suggested that, node should transfe
performance of its local processes. Node should
processes do not muchaffect the service to local pro
?_4 ot ea aenpnMetee a
istributed Computi Management
pistributec Lomputing (MU-Sem 8-Comp.) 4-6 Resource and Process
each node plays two roles; manager and contractor.
gidding : In this policy, bidding Process is carried out in which
processes -
and contractor is a node which Is ready to accept
Man2et" node has processes to send for execution
execution. Same node can play both the roles, Manager broadcasts request-for-bid message t0 all the and nodes and
other-
ge containing memory size, proces sing capabil ity, resource availability
contractor return bid with messa It may also happe n that,
bid from conitys@tar and sends process there.
Manager determines best bid upon receiving
problem is that,
ers and becomes overloaded. Solution to this
contractor receives many bids from Many manag ctor replies with message whether
manager sends message to contractor before transferring the process. Contra
contractor have freedom to send and
s will be accepted or rejected. In this policy, both managers and
proces
difficult to decide best pricing
ckis more communication overhead and
accept/reject processes respectively. Drawba :
policy.
er.Process transfer
Pairing : In this policy, two nodes that greatly differs in their load are tempora rily paired each oth
s esta bli shed in the system
thenis carried out from heavily loaded nodetolightly loaded node. There may be manypair
and simultaneous transfer ofprocesses is carried out amongthe pairs. |
State Information ExchangePolicies

ng algorithms and to reduce number of messages
It is necessary to choose best policy to support dynamic load balanci
transmitted.
4 ¥
Following policies are used to exchangestate information.
after elapse of eve ry t units of time. It
Periodic Broadcast : In this policy, each node broadcast its state information
not changed. It has poor
generates heavy networktraffic and unnecessary messages get transferred although state is
-
scalability.
state information only if its state changes. Node9s
Broadcast When State Changes: In this method, node broadcast its
processes from other nodes. Further
state changes if it transfers its processe s to other nodes or if it receives
state to other nodes as nodes takes part in load
improvement to this method is to do not report small change in
its state information
balancing processif they are underloadedoroverloaded. Refined method is node will broadcast
edstate. .
whenit moves from normal load to underloaded or overload
request when its state moves from
~ On-Demand Exchange : In this method, node broadcastits state information
replies their current state to
normal load to either underloaded or overloaded region. Receivers of this request,
further
requesting node. This method work with double-threshold policy. In order to reduce number of messages,
underloade d then only
improvementis to avoid reply regarding current state from all the nodes. If requesting node is
overloaded nodes can cooperate withit. If requesting node is overloaded then only underloaded nodes can cooperate
with it in load-balancing process. As a result, in this improved policy requesting nodes identity is included in state
information request message SO that only nodes which can cooperate with requesting node sends reply. In this case
other nodes do not sends anyreply to requesting nodes.
4 Exchange bypolling : This method do not use broadcast which limits the scaling. Instead, the node which needs
cooperation from other nodes for balancing the load searches for suitable partner by polling other nodes one by one
Hence, exchange of state information takes place betweenpolling node and polled nede.Polling processstopseitherif
poll limit is attained.
appropriate nodeis located or predefined
Priority Assignment Policies
Main issue in scheduling the local or remote processes at particular node in load balancing process is decidinging th the
Priority assignment rules. Following rulesare usedtoassignpriority to the processesat particular node
© Selfish - In this rule, local processes gets higher priority compared to remoteprocesses.
~_° Altruistic (Unselfish) : in this rule, remote processesare given higher priority comparedto local processes
8 TechKnewledge
Pablication:s
AT.
on

7_Distributed Computing
SS
MU-Sem 8-Comp.)
a
Process Manageme
o Intermediate In this rule, priority is determined on the basis of numberoflocal processes and numberof remot,
processesatthat particular node.This rule assigns higher priority to local processes if their total numberis greate,
or equal to numberof remoteprocesses.If not, remote processesare assigned with higherpriority.
Migration Limiting Policies

4 Total numberof times the process should migrate from one node to otheris also importantin balancing the node of
system. Following twopolicies may be used.
4 Uncontrolled : Process can be migrated any numberoftimes from one nodeto other. This policy leadsto instability.
4 Controlled : As process migration is costlier operation, and hence, this policy limits migration count to
1. This means
process should be migrated to only one node. For large size processes this count can be increased as processcannot
finish execution in specified time on the same node.
4.5 Load-Sharing Approach

4
.4 In load-balancing approach, exchange of state information among nodes is required. This involves considerable
overhead. Since resources are distributed, load cannot be equally distributed amongall nodes. There should be a
properutilization of these resources. In large distributed system, number of processes in a node always fluctuating.
~ Hence, temporal unbalancing of load among nodesalwaysexists.
4 Therefore, it is sufficient to keep all the nodes busy. It should ensure that, no nodewill remain idle while other nodes
have several processesto execute. This is called as dynamic load sharing.
4.5.1 Issues in Designing Load-Sharing Algorithms
Load sharing algorithms aims at no node is idle while other nodes are overloaded. In load sharing algorithms,issues
are similar to issues in load-balancing algorithms. It is simpler to decide about policies in load-sharing algorithms
with
compareto load-balancing algorithms.
Load Estimation Policies
4 Load-sharing algorithms only ensure that no node is idle. Hence, simplest policy to estimate load is total numberof
processes on node.
4 Asin moderndistributed system, several processes permanently resides on idle node, CPU utilization is considered as
measure to estimate load on node
Process Transfer Policies
4 Load-sharing algorithms mainly concern with two states of nodes: busy or idle. It uses single threshold policy with
threshold value of 1. Node acceptsprocessif it has no process. Node transfers the process as soonasit has more than
two processes. If node becomesIdle, it can be unable to accept new process immediately. This policy wastes the
processing power ofidle node,
~ The solution to above problem Is to transfer the process to the node whichis expected to becomeidle. Therefore,
some load-sharing algorithms use single threshold policy with threshold value of 2. If CPU utilization is considered as
measure of load then double threshold policy should be used.
Location Policies
~ Inload-sharing algorithms, sender or receiver node Is Involved In load sharing process.

v4
s Management
Resource and Proces
icy :
_ senderInitiated Polr +68 In
thi o. n
eIn re :
ansfer of process to particular node ;
this policy, heavily loag \ this location policy, sender takes decision abouttr
process. When load of the node
becomesgreater than the node search for lightly loaded nodes to transfer the
which
de which reshold Value,it either broadcast or randomly probes the other nodesto find light
node Can accepts one or moreof its processes
- The good candidate node to accept the processeswill be that node whose load would not exceed threshold value after
oeepue the processes for execution. In case of broadcast, sender node immediately knows about availability es
probing
suitable node 80 acceptprocesses when it receives reply from all the receivers. Whereas, in random probing,
continuoustill appropriate receiver is found or number of probes reachesto static probe limit. If no suitable nodeis
found thenprocess is executed on its originating node. Probing method is more scalable than broadcast.
- ReceiverInitiated Policy : In this location policy, receiver takes decision about from whereto get the process.In this
policy, lightly loaded node search for heavily loaded nodes to get load from there. When load of the node goes below
the threshold value,it either broadcast or randomly probes the other nodesto find heavily loaded node which can
transfer one or moreits processes.- |
- The good candidate nodeto sendits processes will be that node whose load would not reduce below threshold value
after sending the processes for execution. In case of broadcast, sender node immediately knows aboutavailability of
suitable node to send processes to it whenit receives reply from all the receivers. Whereas, in random probing,
probing continuoustill appropriate node is found from which processes can be obtained or numberof probes reaches
to static probelimit.
- In later case, node waits for fixed timeout period before trying again toinitiate transfer. Probing method is more
scalable than broadcast. |
- Preemptive-process migration is costlier than non-preemptive as execution state needs to be transfer along with
ed
process. In non-preemptive process migration, process is transferred before it starts the execution on its current node.
eee ees =
Receiverinitiated process transfer is mostly preemptive. Sender initiated process transfer is either preemptive or non-
preemptive.
State Information ExchangePolicies
In load-sharing approach state information exchangeis carried out only when node changes the state. Because, node
needs to know state of other node whenit becomes overloaded or underloaded. In this case following policies are
used.
ape
Broadcast When State Changes : In this policy, node broadcast state information request when its state changes. In
senderinitiated policy, this message broadcast is carried out when node becomesoverloaded andin receiverinitiated
policy, this message broadcast is carried out when node becomesunderloaded.
- Poll When State Changes: Polling is appropriate method compared to broadcastin large networks. In this policy,
upon changein state, node polls other nodes one by one and exchangesits changed state information with them.
Exchanging the state information continuoustill appropriate nade for sharing load is found or number of probes
reaches to probelimit. In sender initiated policy, probing is carried by overloaded node andin receiverinitiated policy,
erloaded node.
probing is carried out by und
4.6 Introductiontoocess Management
Process management includes different policies and mechanisms to share processors among all the processes in
system.
es
ee,
TechKnowledge
Peblitatiors

E
2
Distributed Computing (MU-Sem 8-Comp.) 4-9 8Resource and Process Managemen,
8nt
4 Process management should ensurebest utilization of processors by allocating them to diiffere nt eproces ses
we re .inae )|
Process migration deals with transferof processfrom its current location to
the node or pro en
assigned. .
4.7 Process Migration

444.
~ Process migration is transfer of process from its current node whichis source nodeto
destination node. There are two
| . types ofprocess migration. These are given below.
.
| © Non-preemptive process migration : In this approach, process
is migrated from its current to destination Node
before it starts executing on current node.
O Preemptive process migration : In this appro
ach, process is migrated from its current to desti
nation node during
its course of execution on current node.
It is costlier than non-preemptive process
environment also needs to transfer along with proce migration as process
ss to destination node.
Migration policy includes selection of process
to be migrated and selection of destination
Hy should be migrated. This is already described node to which Process
in resource management. Process migratio
i Processto destination node n describes actual transfer of
and mechanism to transfer it.
| | Freezing time is the time period for exec
ution of process is stopped for transfer
| node. ring its information to destination
4.7.1 Desirable Features of Good Process

Migration Mechanism
| | Following are desirable features
of good process migration mechan
| | ism.
Transparency : Following level
of transparency is found.
| © Object Access Level: j
*4 Robustness Failure of
other node
an and execution of that pro
cess.
location. be able to directly interact wit

h each otherirrespective of
theif
3
Pubricatioes 4

ee : :
ae
_
: .
Computing(MU-Sem 8-Comp.)
Sem8-Comp.) Sa
4-10
412 Process Migration Mechanisms
a
otactivities are in vo lved

following in process migration.
1_ Freezing the process at so urce node andrestarting its execution onits destination node.
S's address space from its source nodeto its destination node.
Transferring the process9
2.
j1 Forwarding all the messa ges for process onits destination node.
process migration.
H andling g com munication
icati between cooperating processes placed on diffe rent nodes due to
A.
wechanism for Freezing and Restarting the Process

de and then this
atio n, usually snapshot of the process state is taken at its source no
in preemptive process migr node. Its state
ess execution ts suspended at its source
-
destination node. First proc

snapshot is reinstated at its ion node.
He
execution is restarted at its destinat
ination node.Afterthis, its
informationis then transferred at dest with the processIs
1a
ce nod e meansIt s execution is suspendedand all external interaction
Freezing the processes at sour
postponed. must be i |
ion, execution of the ocess
pr
ess : in preemptive process migrat
yin}
Delaye d Blocking of the Proc it can

immediate and s re aches the state when
can be carried out imm ediately or after proces |
blocking
blockedat its source node. This ss currently not execut
ing system
proces s dep end s on Its current state, If proce ie
cked. Hence, when to
block the
sleeping at interrupt
ble Phil
be blo call but is
ently executing sy stem
be imm edi ately blocked. If process curr executing system
call then it can blocke d. If process currently
kernel eve nt the n i can be immediately
priority waiting for occ
urring of n it cannot be blocked
ty warling for occ urr ing of kerne | event the |
priori
non-interrupt ible
call but is sleeping 2% ation.
Vv ary as per implement
The me ch anismof blocking can process at
it
immediately. k 1/0, before freezing the
t 1/0 op erations such as dis
completing all fas taal
Operations - Allow
~ Fast and Slow I/O ations such as those on pipe
s and terminal.
to wait for slow 1/0 oper phi
source node. It is no
t fe as ib le
ut files currently open by
on of proces s als o contains informatio n abo i
t Ope n Files
: State inform at i
t position of t heir file
pointers etc. In
Information abou cess modes, curren
ct 8
- th e fil es, ac
identifie r of information can be
name or t is supported, this state
process. it includes en t ex ec ut ion environmen
anspar b y its full
9% networ K tr ed system recognizes the file
distributed system ss lo cal or remot e file. The UNIX bas
to acce it
protocol is used les is open ened by it. It then uses file descriptor to
oped
collected. Same de sc ri pt or to p r oces s once files is ocess can keep
Bi
e
in g sy stem returnsfil st be pr es er ve d so th at migrated pr
pathname. Operat inter to file mu
such system, po
ons on file. In
complete 1/O operati A
that
on accessing the fil
e. n node which is same as
Ne w empty pr oc es s is created on destinatio Her
cess onits Destin ation Node : llocated process can be same or different dependin on
g
- Reinstating Pro tif ier of this newly a Fi
a subsequent step before process starts
den
ss creation.
allocated during Proce the origin 3] identifier in
is changed to :
different, then It
ot
ded. Hence, rest of the system cann
lem ent ati on. If it is
s on bo th th e processes are Su spen
imp on
node. Operati s is transfe rred and copied in empt
y process, the
executing on destination re sta te of migrating proces
es. When enti cution from the state it has
before being
notice existen ce
of two copi te d. Th us process restarts exe
d old is dele tination node whi ch was not comp
leted
process iS unfrozen an all ow pe rforming system call at des
new m should
Migrati on mec
hanis
| migrated.
due to freezing of the
FF TechKnowictge
Publitatians

a
:
j
i
:
Distributed Computing (MU-Sem 8-Comp.) 4-11 Resourses
i
=i
4
Address Space Transfer Mechanism
4
nodeto destination node. :
i
4 Following informationis transferred when processis migrated from source
, scheduling information, | memory useg by |:
© Process State : It includes execution status i.e. content of registers
pro cess9s identifier, user and group 4
process, I/O states,andits rights to access the objects which is capability list,
identifier of the process, information aboutfile opened byprocess.
m. |
© Process Address Space : It includes code, data and stack of the progra
s addres, |
Size of process state information is in few kilobytes. Whereas, process address spaceis oflarge size. Proces
space can be transferred without stopping its execution on source node. However, while transferring state |
_ information,it is necessary to stop execution of the process. .
To start execution at destination node, state information is needed at destination node. Address space can be |
transferred before or after process starts execution at destination node.
4 Address space can betransferred in following manners.
Total Freezing
In this method, process is not allowed to continue its execution when decision ofits transfer to destinationis taken, |
The execution of the processis stopped while its address space is being transferred. This methodis simple and easy to |
implement.
{
4 The limitations of this methodis that if process that is to be transferred is suspended for longer period of time during 1
process of migration then timeouts may occur. Also user may observe delayif the processis interactive. |
Pretransferring .
|
arte
Inthis method, processis allowed to run on its source node until its address space has been transferred on destination
node. The modified pages at source node during transfer of address space also are retransferred to the destination |
=
followed by previoustransfer. This retransfer of modified pages is carried out repeatedly until number of modified ©
=
pages is relatively small. These remaining modified pages are transferred after the processis frozen for transferringits |
state information.
4 Infirst pretransfer operation,as first transfer of address space takes longest time, a longest time is offered accordingly ©
to carry out modifications related to changed pages. Obviously, second transfer will take short time for the same as _
only modified pages duringfirst transfer is transferred. These modifications will be lesser comparedtofirst transfef. |
Subsequently, fewer pageswill be transferred in next transfer to destination node till number of pages converges to
zero or very few.
\
4 These remaining pages are then transferred from source node to destination node
whenprocess is frozen. At source
node, pretransfer operation gets higher priority with respect to other programs.In
this method, frozen time is reduced 4
and hence, leads to minimal interfere of migration with interactions of process
to other processes and users. As same ©
pages can betransferred repetitively due to modifications, the total time elapsed
in migration can increase. |
Transfer on Reference
~ In this method, entire address space of process Is kept on its source node and only needed
address space for execution
is requested from destination node during execution (just like relocate
d process executes at destination node) . Th»
page is transferred from source node onits reference at destination
node. This is just like demand driven copy %
reference approach. 4
4 This methodIs more dependent onsource node and imposes burden on It.
Moreover, if source node fails or reboo® ;
the process execution will fail. °
ag* .
t
4
<29
= afl
*
es
ie feamertc |
tt
sa
Cae
<ith
oth

=e eerie a gone 20oe ees
ment Distributed Computing (MU-Sem
44
8-Comp.
-
a Resource and Process Management
yessage-Forwarding Mechanism
4-12
ee
It should be ensured that the

mi
. 4_
?
(en-route messages) andfuture me Brated process should receive all its pending, messages in transmission
destination nodeof mj
. . SSa
8es at destination node. Following

= .
types of messages needsto be forwarded at

type 1: Messages which
;
are received at sO , . . has
not until now been commenced urce nodeafter process9s execution has been stopped andits execution
at destination node.
dress Type 2: Messagesreceived ats
node.
state
Messages expected b ource nodeafter Process9s execution started at destination
Type 3: i ted process afterit starts its execution on destination node.
y already migra
Mechanism of Resending the Message
n be
This
sacs, mechanismis
mestanes my used ed iin
n V-System
V- and amoebadistributed OS to handle aboveall types of messages. In type 1 and 2
. e simply returned to sender as they are not deliverable or simply dropped with assumption that,
8
sender maintai ned the copyof them andable to retransmitit. Messagesof type 3 aredirectly sent to the destination
node of the process.
aiken. - Inthis method, there is no need to maintain the process state at source node. But, message forwarding mechanism of
sy to process migration operation is not transparentto the other processes communicating with the migrant process.
Original Site Mechanism

uring
- In this method, the processidentifier has its home node embeddedin it and every nodeis responsible to maintain
information about current location of all the processes created on it. Hence, current location of the process can be
then forwards
obtained from its home node. Any process nowfirst forwards the message to home node which
ation message to process9s current node.
home nodealthough processis migrated
ation The drawback of this methodis that, it imposes continuous load on process9s
message forwarding mechanism.
lified to other node.Also,failure of home node disturbs the
ng its
Link Traversal Mechanism
its source node. All the type 1 messages are
4 In this method, message queue of the migrant processis created on
lingly messages in message
process is established at destination node, all the
placed in this queue After knowing that the
ne as ation node.
queue are forward ed to the destin
isfer. cess is maintained at source node.
ng to destination node of the migrant pro
es to - For Type 2 and 3 messages, a link pointi
forward! eases. This link consists of two components: a system wide unique process
ding process
This link is address for n g node of the process and other is unique local identifier which is a last
nti fie r which is identifier of the originati
jurce
ide irst compone nt remains same during lifetime of the process.
Whereas second
on of a process- Firs ; , .
whereprocess is created, is traversed in order to forward
5
fuced Known location hange. A series of links, starting from node '
component may change- the link is now

4ss9s current node and second component of
f
s. This traversal stops at proce

8
same es.
next time when processis located from the
¥
ie
is accessed from the node. This improvesefficiency

Eti
nee 2
updatede eo
when proces
ss | s
ue
#
he
oft
same node.
i
any node in the chain fails then process cannot be located. it has poor
3
. that,fatilu re of
at
thodis -s
The drawback ofthi sme

,'
process.
<ution ~ s ed to locate the
ds to be traver
h
efficiency as manylinks nee

py on
ndling Co processes
Mechanism
nism for Ha en | processesif they are migrated and placed at
parent and child
ion betwe
efficient communicat
boots There should be
YF TechKaonietge
ne
arion®

4-13 Resource and Process Managemen
Y Distributed Computing (MU-Sem 8-Comp.)
Not Allowing Separation of Processes
te are not allowed to migrate. It is used
eoyby UNIX baseq
4 Processes that wait for one or moreof their children to comple
is due
networked system. The drawback | of this approachis that, it does not permit use 0 f parallelism within job.It
restricting the various tasks of job to distribute on different nodes.
The
will be migrated along with it. It is used by V-System.
4 If parent process migrates then its children processes
in migrating the process if logical host comprisingthe
drawback of this approach is that, it involves more overhead
ises several host
address space of V-System and their associated processes is large. It means logical host compr
processes.
|a Home Nodein Original Sight Concept

on different nodes. All the
4 In this method, process and its children are allowed to migrate independently
node.
communications between parentprocess and its children takes place via home
4 Drawbackis that messagetraffic and communication cost increases.
4.7.3 Process Migration in Heterogeneous System

node.
In heterogeneous environment, it is necessary to translate data format when processis migrated to destination
CPU format at source node and at destination node can be different. If n numbers of CPUs are present in the system
then n(n41) pieces of translation software is required.
s to
Solution to reduce this software complexity is to use external data representation in which each processorha
convert data to and from the standard form. Hence,if 4 processors are present then only 8 conversions are required.
Two conversions for each processor, where one is to convert data from CPU format to external format and otherin
reverse order (external format to CPU format).
In this method, translation of floating point numbers which consist of exponent, mantissa and sign needs to be
handled carefully. In handling of exponent example suppose processorA uses 8 bit, B uses 16 bits and external data
representation is designed for processor A offers 12 bits. Hence, process can be migrated from A to B which involves
conversion of 8 bits to 12 bits and a12 bits to 16 bits for processor B. Process requiring exponent data more than 12
bits cannot be migrated from processor B to A. Therefore, external data representation should be designed with
numberofbits at least as longest exponentof any processor in the system.
Migration from processorB to processorA also not possible if exponentis less than or equal to 12 bits but greater than
8 bits. In this case, although external data representation hassufficient numberof bits but processor A doesnot.
Suppose processor A uses 32 bits, B uses 64 bits and external data representation uses 48 bits. In this case, handling
mantissa is same as handling exponent. Only transferring process from B to A will result in computation with half
precision. This may not be acceptable when accuracy of result is important. Second problem may occur duetoloss of
precision due to multiple migrations between set of processors. Solution is to design external data representation with
longest mantissa of any processor. External data representation also should take care of signed-infinity and
signed-zero.
4.7.4 Advantages of Process Migration
Following are the advantages of process migration.

4 Reducing average response time of processes : The response time of processes increases as load on the node
increases. In process migration, as processes are moved from heavily loaded nodetoidle orlightly loaded node, the:
average responsetime of processes reduces. 4

Resource and Proce
4-14
pistributedComputing (MU-Sem 8-Comp-) ent nod e. Also job can be migrated
to node
s of the job can be migrated to differ
speeding up individual job : The task
.
cases, execution of the job speedsup
havin g faster CPU. In both i cy, higher throughput
bett er utili zed with -pro pe r load balancing pol
all the CPUs are
Gaining highe r throughput: As globa lly, throughput ca
n be increased. 4
an d1/O boun d jobs many
can be achieved. By properly mixing CPU sources are di
stributed On
an d so ft wa re re
resources - In distributed system, hardware resource
Effective utilization of atin g processes to the nodes as
per their
ctively used by migr node having
nodes. These resources can be effe hardware t hen they can be mig rated to
esses require special-purpose
requirements. Moreover, if proc
rces,
such hardwarefacility. de havin g their
required resou
our ces OF to the no to
g processes C loser to the res ly, it is better
Reducing networktraffic : Migratin d of acc ess ing huge d atabase remote
all y. Ins tea
urces are accesse d loc
reduces networktraffic as reso
ent.
e wher e such database is pres prove
move the process to the nod wit h hig h er reliability in order to im
rated to nod e
Improving system reliability : Critical proce ss can be mig |
reliability. to more secure node.

can be ach iev ed by migrating sensitive process
Improving system security : This
4.8 Threads |
facility, basic unit of CPU
In operating system supporting thread
of the application. as lightweight
_4 Threads improve performance Threads are also called
is a single sequence stream within a process.
utilization is threads. A thread one process.
h thread belongs to exactly
e of the properties of processes. Eac
processesas it pos ses s som threads run in parallel
hreading, pro ces s can con sist of many threads. These
In operating system that sup
port multit
CPU state an d stack, but
they share the address
h thr ead has its own
performance. Each suc
improving the application
the environment.
space of the process and the processes, threads
the y do not nee d to use interprocess communication. Like
mmon data so highest
_4 Threads can share co blo cked etc. priority can be assigned
to the threads just like process and
dy, execut ing ,
also have stateslike rea
led first.
' priority threadis schedu thread and register
process context switch occurs for the
its ow n Thr ead Contro | Block {TCB). Like hronization is also required
fach thread has resources, sync
4
B). As thr ead s shar e the same address space and
con tents are saved in
(TC
activities of the t hre
ad.
for the various
ess and Thread

4.8.1 Comparison between Proc
d
parison between thread an
ous with compare to proces s. Following table shows com
age
-- Threads are more advant
process. peer,
7 d
* fe
: . ndgot oS
Lamers eae faa al s Sepa pts
Lae ey! w,
as 2 *E +.
ott iy ete ide te tar
<a eae 0 dete hat u
cate on es Se oN ee a ote
oa a 55 =e weg ie = . se as
res oe <at _* 5-25
*} == 4 ie Retr aes, ,
hae : 8 i oor
ave" 4 he .
e eee. +h | CT, jes ht eo
are a 7pes Se
ys eae ee geet bel ie ants
PPT a ete Netley NIE! Nahe aed « i eg
« F +s Foyt
eae te re a5
It is also called as light

in execution called as proce ss. It is heavy Thread is part of process.
1. Program | weight process
|
weight process. e as compared t O
time as compared Thread context switch takes less tim
2. Process context switch takes more process context switch becauseit needs only interrupt
ause it needs interface of
to thread context switch bec to kernel only.
operating system.
as compared
as compared to New thread creation takes less time to
3. New Process creation takes more time
new process creation.
new thread creation.
wor Tech =
eee
Puntications

eeta Se
ResourceandProcess Manage
P
Sn|
No. Process
4 New Process termination takes more time as New thread termination takesless t me as compar
pared
| j
compared to new thread termination. tonewp rocess termination. ; =
5 Each process executes the same code but has its own All threads can share same set of open iles, child
memory and file resources. processes.
be a ee ee at
ng
6. In process based implementation, if one process is In multithreaded server implementation, if one thread
blocked no other server process can execute until the is blocked and waiting, second thread in the same
first process unblocked. process could execute,
ee et er ea a es a ee eee
7.
a
Multiple redundant processes use more resources Multiple threaded processes use fewer resou
rces than
than multiple threaded process. muluple redundant processes.
} ae
ee
8. Context switch flushes the MMU (TLB) registers as No need to flush TL8 as address spac
e remains same
address space of process changes.
after context switch because threads
belong to the
A |
SaMe process.
a ee reas
4.8.2 Server Process
Following models can be used (CO cons

truct server process. Consider the
exampleof file server.
4 AS Single Threaded process
: In this model client requests are
served sequentially. There is no
uses blocking system call. File serv parallelism. The model
er gets Cent's file access request from request queue.
server checks if disk access is If access is permitted then
needed. If no, request serv
ed immediately and repl
access is needed then file serv y is sent to client process.
er sends disk access request If disk
to disk server and waits till reply arri
server's reply, it services Clien ves. After getting disk
t request and sends reply to clie
Hence, performance serv
nt pro ces s, Server then goes back
er is unacceptable. to handle next request.
4 Asa Finl te-State Machine : R sup
ports parallelism with non
server's reply Messages and -blocking call. Event queuei
client9s requests. If server s maintained to store disk
rem ains idle, it takes next
client9s request and if ACCESS
is permitted then server che req uest from event queue. If it
cks if disk access is need is
server sends disk access req
uest to disk server. At this
ed. If disk access is needed
point, instead of blocking the n file
request in table and goes to , it saves current
next request in event queue. state of client's
This mes 3Re May be
Previous request from disk
server. If it is new client's new client9s request or
req ues t the n it is handled in above reply of
disk server then it retrie manner and if it is repl
ves current state of client y from
9s requestin table. It
retrieved client9s current state. the further processes this client req
uest with
an
WF FechKeontetyt |

48.3 Models for Organizin; Threade
Ue ewe ee
ee ee
following are the three dif

ferent Ways to organize the th
_ Dispatcher-worker Model:
in this . ee, ; of one dispatcher thread and several worker threads. The
dispatcher thread accepts model, process consist
the client9
ent's request chooses one of the free worker thread to handover this request for
further processing. . Henc e, mul tiple
i ient9 request can be processed by multiple worker threads in parallel.
client's
- Team Model: in this model
, all thread are equallevel. Each thread accepts client's request and process it on its own. A
specific type of thr ead to handle. specific
,; requests can be created. Hence, ion n of dif ferent client's
_ be carried out. cated. Hence parallel executio
requests can
- Pipeline Model: This a ucer consumer model. In this model,

output of the firs t tt modeil is useful for the applications which are based on prod
rad piven :
> ahrs read is given as input to second thread for processing the output of the second thread is given as
,
eatetedi
nput to thir 8 acetmm ut of the last thread
| ? thread for processing and so on. Threads are arrangedin pipeline, In this way, outp
is final output of the process to which threads belongs.
4.8.4 Issues in Designing a Thread Package

d related
package. System should supports primitives for threa
Following are the issues related designing a thread
operations.
Thread Creation
fixed till lifetime of the process. In dynamic
Threads can be created statically in which number of threads remains
New threads are created when needed during the
creation of threads, initially process is started with single thread.
after finishing of its job by using exit system call.
execution of process. A thread may destroy itself
stack
Instatic approach, number of threads is decided
while writing or compiling the program. In this approach,a fixed
-
ied through parameter to the system call.
In dynamic approach, stack size 5 specif
is allocated to the thread.
Thread Termination
used to kill thread from
job by using exit system call. A kill command is
- Athread may destroy itself after finishingof its nates.
In many cases, threads are never killed until process termi
outside by specifying thread identifier a5 parameter.
Thread Synchronization
variable is referred as the Critical region (CR).It is
Th rtion of the; code where thread may be accessing some shared
e portio n s
= ose mutex variable is used. Thread that want
t multip le threads accessing same data. For this purp
necessary to preven mutex variable. In single atomic operation, state of mutex
e ration on corresponding
to execute in CR performs lock op
d state. .
changes from unlocked to locke mutex
s in queue of waiting threads on the
ed state then thread is blocked and wait
~ If mutex variabl e is already in lock mutex variab le. In multi proce ssor
= se thread carries out other job but continuously retries to lock the
; ble. herwi
varia Ot s
nread run in parall el, two threads tion
may carry out lock opera x ble.
on same mute varia In this case,
system where ; ad other wins. To exit the CR, thread carries out unlock operation on mutex variable.
one thread waits 4 d for
u sed for more
general synchronization. Operations wait() and signal() are provide
variab les are is
e out wait operation on condition variable the associated mutex variable
4~ Condit ion
le. When thread
hcarries
it io na l va ri ab operati on is carried out by other thread on the condition variable. When
cond sig
read is blocked till
unlocked and th jon on con
dition variable the mutex variable is locked and blocked thread on condition
t i
si gn al op eratio
s ou
thread carrie
variable starts execution-
- Pepblicatiens
;

Distributed Computing (MU-Sem 8-Comp.) 4-17 : Resource and Process Managemen,
4.8.5 Thread Scheduling
Thread packages supports following scheduling features.

Priority Assignment Policy : Threads are scheduled with simple FIFO or in round-robin (RR) basis. Application
programmers can assign priority to threads so that important threads should get highest priority. This priority baseq
scheduling can be non-preemptive or preemptive.
Flexibility to vary quantum size dynamically : In multiprocessor system, if there are fewer, runnable threadsthan
available processors thenit is wasteful to interrupt the running thread. Hence,instead of using fixed length quantum,
variable quantum is used.In this case, size of time quantum is inversely with numberof threadsin the system.It gives
good responsetime.
Handoff Scheduling : Currently running thread may name its successorto run. It may give up processor and request
the successor to run next, bypassing the queueof runnable threads. It enhances performanceif cleverly used.
Affinity Scheduling : In this scheduling in multiprocessor system, a thread is scheduled on processor on whichit was
last run, expecting that its address spaceisstill in that processor cache.
4.8.6 Implementing a Thread Package

Thread Package can be implementedeither in user space or kernel.
Implementing Threads in User Space
User Level Threads
4 _Inuserlevel implementation, kernel is unaware of the thread. In this case, thread package entirely put in user space.
Java language supports threading package. | |
User can implement the multithreaded application in java language. Kernel treats this application as a single threaded
application. In a user level implementation,all of the work of thread managementis done bythe thread package.
Thread managementincludes creation and termination of thread, messages and data passing between the threads,
' scheduling thread for execution, thread synchronization and after context switch saving and restoring thread context
etc.
Creation and destroying of thread requires less time and is chip operation. It is the cost of allocating memory to set up
a thread stack and deallocating the memory while destroying the thread.
Both operations requiresless time. Since threads belong to the same process, no needto flush TLB as address space
remains same after context switch. User-level threads requires extremely low overhead, and can achieve high
computational performance. Fig. 4.8.1 shows the user level threads.
L L ! Userlevel threads
User area aiaes N(~* Treadlibrarylee so ap eee

wae eet
(Avesaolykee
pitrey ier
Kemel area
Fig. 4.8.1 : User level threads

Se, :
2 Distributed Computin
g(MU-Sem 8-Comp.) | | Resource and Process Managetme
| +18
advantages of user level threads
ern a tail lipeet gt

;
not require flu i
e Z Thread switching does
n. shi ng of TLB and doing CPU accounting. Only value of CPU register need to be
stored and reloaded agai
_ User level threads arepl
platform independent. They can execute on any operatin system.
si Scheduling can be as per need
of application
#
_ Thread managementareat userlevel and d is taken by threading package.
is saved for other activities one by thread library. so kernels burden
Kernels time Ht
: .
pisadvantagesof userlevel threads
- lfone threadis blocked onI/O,entire process gets blocked.
s to run in parallel, user level threadsare of no use.
- app catio ns whereafter blocking of one thread otherrequire
The appli
Implementing Threads in Kernel

Kernel Level Threads
ntiss carried out by kernel.
ting system9s kernel. The thread manageme
In this, threads are implementedin opera
ad context and process context
ied out in kernel space. So thre
4 All these thread management activities are carr
switching becomes same. e process.
n are supported as threadsin singl
threaded and threadsof the applicatio
- Application can be written as multi | ih
user threads.
more time to create and managethan the eh
- Kernel threads are generally requires
eads
Advantages of kernel level thr
is blocked. Blocking i
ess eve n though one thread in a process
;
le another th re a d of the sam e proc
- The kernel can schedu
Se eee
process. lit
not block the entire
of one thread does multiple processes.
edul e multiple thr ead s from the same process on
eously sch
- Kernel can simultan
|
kernel leve l threads
Disadvantages of
ervention. Hi
requires kernel int n the user threads.
- Context switch erally req e 5 more time to create an
uir d mana ge tha
.
i bide
- Kernel thread S are gen
Hybrid Implementation d kernel level thre

ads : a hybrid threading model using bot
h types of
user lev el an
; P vantages of
| 4 Considering the ad
plemented. In this implementation,all the thread managementf
unctions
threads can be im th is hybrid model. ire rnel in te rv ention.
tem suppor ts spac e. ns ad
S o operatio on thre do no re t qu ke
S¥° kage at user
~ The Solaris operating a ;
. ve I thread pac sis that, if the applications are multithreaded then it can take advan
tage
if are carried out by userl e jel of the thread
-4¢ mode
~ The advantage of hybrid ,
hey are available. rogress even if the k ernel blocks one thread in a system function.
of multiple CPUs if t ake P
-
continue to#
4 Other threads can i
49 Virtualization oot of vieWs using separate com
puter to put each service is mo
re preferable by rh
Hi
ity poin
ed.
la li ty and securlails, otherwil:l not be affect
at
Py From avaiilabi
e
a tit
Pablicatiogs
a
i .
>.
5 oe

Distributed Computing (MU-Sem 8-Comp.) 4-19} Resource and Process Manage
gement
me
4 Itis also beneficial in case if organizations need to use differe
r nt types of operating
i system. naan,stai
keepinge se iol:
machinesfor each service is costly. Virtual machine technology can solve this
problem. Althoug ! BY Seems
to be modern butidea behind it is old.
4 The VMM (Virtual Machine Monitor) create the . . .
s illusion of multiple (virtual) machin:
A VMM is also called as a hypervisor. Using virt es on the sammie pvicale ha haraks
ual machine,it is possible to run legacy applic
Systems no longer supported or which do ati ons | Operating
not work on current hardware.
4- Virtual machines permits to run at
the same time applications that use diffe
Systems Can run on single machine without rent operating systems. Several Opera
installing them in separate partitions. Virtu ting
role in coud computing. alization technology plays major
4.9.1 The Role ofVirtualization in

Distributed Systems
4 Practically, every machine offers
a programminginterface to high
many different types of interfac er level software, as shown in Fig.
es, for example, instruction set 4.9.1. There are
collection of application Programm as offered by a CPU, and as othe
ing interfaces that are shipped r example, huge
with many present middleware
systems.
Program
Fig. 4.9.1 : Program,Interfa

ce and System
4 Virtualization makesit Pos
sible to extend or replac
system, as shownin Fig. 4.9.2. e an existing interface so
as to imitate the activi
ties of another
oles Programs
444_,
oda,
Implementationof imitating X ony9 ed
AS ce lnterfaceYo er,
ee
Fig. 4.9.2 - Virtualizing
of System X on top of
4 With advancementin Y
technology, hardware
applications and midd and low-level system software
leware which are at change reasonably fas
with the platformsit relies higher level. Hence,
it
t compared to
on. With virtualization leg is not pos sible to maintain pace of
latter can be openedf acy interfa legacy software
or large classes of ex
4 In large distributed system env
iro
8 whic
running different applications .
h are he
efficiently. In this Case, vir ; alste
o ro
lisge
ea neous col
o lecet of mond
tualization plays Majorrole
.
4 Each application ca urces easily an
n run on its own vir
turn, run on a comm tual machine, Perh
on platform. In this aps with th e related librari
degree of portability and fle Way,
, the diversity of p! atfor
xibility can be achiev ed, i
# me and machines can be i nimi
minimized and high
4.9.2 Archit ectures Of Virtual
Machines
4 Four types ofinterfa
ces are of fered b Y any co
software, consist ing mputer system.
of ma chine Inst These are - Anint
.
software consis
siettse of machin ructions that ca erfa
e Instructions thakeay nbebe inv
| oke
! application programming interface , interfa ce between the har
rfa
(API). d by Os only, an interface consisti
dware and
ng of system calls and
Publicattens

Disuted Computing (MU-Sem 8-Comp
.) 4-20 Resource and Process Manage
g ment
- j e of yjrtualiza
Infirst techniqu
tion, w : j . 7
to be used for executing ap plicati © can build a runtime system that basically offers an abstract instruction setthatis
running Window ons. Instructii ons can be
| s ap plication $ on UN interpreted, but could
IX platforms. also be emulated as is
| done for
- Inother techniqu
e of Virtualization
a system thatis basically implemented as a layer completely shielded the original
hardwareis provid
ed and offered th
interface can be Offere ¬ completeinstruction set of that same or other hardware asan interface. This
d at the sa me time to different
programs. Hence, it is now possible to have multiple, and
different operating sy
stems run ind
¬pendently andin parallel on the same platform.
4.10 Clients
$e=
4.10.1 Networked User Interface

s
hat Client conta . the server to access remote service. For this purpose, client machine may have separate counterp | ; art
at can con ;
service. Client
need ng have fhe service
ractlocal over network. Alternatively, a convenient user interface can beoffered to access
storage and has only terminal. In the case of networked userinterfaces, entire processing and storage
is carried out at server.
"Example : The X Window System .
- Itis used to control bit-mapped terminals, which include a monitor, keyboard, and a pointing device such as a mouse.
It is just like componentof an operating system that controls the terminal. The X kernel comprisesall the terminal-
specific device drivers. It is usually highly hardware dependent. The X kernel offers a relatively low-level interface for
controlling the screen, but also for capturing events from the keyboard and mouse.
- This interface is made available to applicationsas a library called Xlib. The X kernel and the X applications may reside
on the same or different machine. Particularly, X offers the X protocol, which is an application-level communication
~ protocol by which an instanceofXlib can exchange data and events withthe X kernel.
4 Asanexample, Xlib can request to X kernel to create or kill a window,for setting colors, and defining the type of cursor
to display, etc. The X kernel will react to local events such as keyboard and mouseinput bysending event packets back
to Xlib. Several applications can communicate simultaneously with the X kernel. Window manager application has
to the user.
special rights. This application can dictate the "look and feel" of the display as it appears
ency
4.10.2 Client-Side Software for Distribution Transpar
4 Inclient server computing, some part of processing and data level is executed at client-side as well. A special classis
formed by embedded client software, such as for automatic teller machines, cash registers, barcode readers, TV set-
top boxes, etc. In these cases, the user interfaceis a relatively small part of the client software, in contrast to the local
n facilities.
processing a nd communicatio
transparency. A clinsp ould not be aware thatit
ent shnt
~ Client software als onents for achieving distribution ion are to servers for reasons
0 includmo
re te mpcesses. On the contrary, distribut is often less tra
es copro
is communicating with | |
ss.
of performance and correctne
lent side offers the same interfaces which areavailable at server, access transparency is achieved by client
4 As, at client SI
stu offers the same interface as available at the server. It hides the possible differenc
es in machine
b.
stub. The The stub )
as the actual com mun ica tio n.
architectures, as well
8oration, and relocation transparency Is handled with different manner. Client request is always
~ The location, mig ré of server replica. Client-side software can transparently
collect all responses and pass a single
forwarded re° orent application. Failure transparency is also handled by
client. Middleware carries out masking of
response to ;
can repetiti vely attempt to connect the same servero r otherserver.
communication failure.
Tal : Client
| Ww TechKnowledge
Publicatisns

=
4-21 Resource and Process Managemary
4.11 Servers
4.11.1 General Design Issues
4 Server takes request sent by client, processesit and sends responseto client. Iterative serveritself handles request
and sends responsebackto client. Concurrent server passes incoming request to thread and immediately waits for
another request. Multithreaded serveris example of concurrent server. ,
4 Server listen client request at endpoint. Client sends request to this end pointcalled as port. These end points are
globally assigned for well-known services. For example, servers that handle Internet FTP requests alwayslisten to Tcp
port 21. Some services uses dynamically assigned endpoint. A time-of-day server may use an end point that js
dynamically assignedto it by local OS.
4 In this case, a client has to search the end point. For this purpose, a special daemon process is maintained on each
server machine. The daemon keepstrack of the current endpoint of eachservice implementedbya co-located server,
The daemonitself listens to a well-known endpoint.A client will first contact the daemon, request the end point, and
then contact the specific server. |
4 Anotherissue is how to interrupt the server while operation is going on. In this case, user suddenly exits the client
application, immediately restart it, and pretend nothing happened. The server will finally tear down the old
connection, thinking the client has most likely crashed. Other technique to handle communication interrupts is to
design the client and server such thatit is possible to send out-of-band data, which is data that is to be praressed by
the server before any other data from thatclient. :
4 One solution is to let the serverlisten to a separate control end point to which the client sends out-of-band data, while
at the same time listening (with a lower priority) to the end point through which the normal data passes. Another
solution is to send out-of-band data across the same connection through which the client is sending the original
request.
= Other important issue is whether server should be stateless or stateful. A stateless server does not keep information
on thestate ofits clients, and can change its own state without having to inform anyclient. Consider the example of
file server. Client access the remote files from server. If server keeps track about each file being accessed by each
client then the service is called stateful. If server is simply provides requested blocks of data to client and does not 7
keep trackabout how client makes use of them then serviceis called stateless.
Stateful File Service

from
First client must carry out open{ ) operation on file prior to accessingit. Server the access information aboutfile
disk and store in main memory.
for client. This identifier is used by client to
Server then gives unique connectionidentifier to client and opensthe file
access thefile throughoutthe session.
memoryspace used by
On closing of the file, or by garbage collection mechanism, the server should free the main
regarding client is used in fault tolerance. AFS
client whenit is no longeractive. The information maintained by server
uses stateful approach.
Stateless File Service

opened files. Here, each requestidentifies
4 In this case server does not keep any information in main memory about
thefile. There is no need to open and close the file with operations open( ) and close{).
iis on its . thefile at last is also ang
: own separately. Closing 3
4 Eachfile operation in this case is not a part of session butit
: :
remote operation. NFS is example ofstatelessfile service approach.
OF Publicarlor
TechKnemetet
oar: Fe tapes Sh tule omateak? yikes sable 3 f wire ciane ore CS 8nk pe helt +
ape
aah Ta ii Oia hrRec
yes » D i,
a hina ; itee Ae
9 it, al ; | os slacker 8
7. pee 8ri = aes
* a . - it 8 x " - : " s L Fé "
i dd Vere ee
o : ais e eat f eT ; . = 2 ti tay age el a Wee ll
+--+ = = 4 4 4

oo .
pistnbuted Computing (MU-Sem 8-Com Resource and Process Management
Stateful service gives mp, 2 422
:
is used to access information maintained in main
-
memory, disk accesses Semen As connection identifier ahead

file is open for sequential access, then read
. next block improves the performance oreover as server knowsthat
complex. A dialog with clientis used b Nn stateful case, if server crashes then recovery ofvolatile state informationIs
also needs to know aboutcrash ofclient to free the
memory spaceallocated toit. All th = recovery protocol. Server
, : ¬ Operation underwayduring crash should be aborted.
- Incase of stateless service, the effect
of server crash and recoveryis invisible. Client keeps retransmitting the requests
if server crash and no response from it.
411.2 Server Clusters
- Server clusteri S collection of server machines

: that are connected through network. Consider these machines are
connected through LANs whichoften offers high bandwidth. This server cluster logically organized in threetiers as
, |
shownin Fig. 4.11.1.
o Switch : Client requests are routed throughit.

O | Application/Compute Servers : These are high performanceserversor application servers.
O Data Processing Servers: These arefile or database servers.
-. The server cluster offering multiple services can have different machines running different application servers.
to the proper
Therefore, the switch should distinguish services. Otherwiseit will not be possible to forward requests
machines. ©
Logicalswich |. . | ' Distributed -
(possibly multiple) Application/compute servers file/database
t system
!
' El |
!
<=
: Dispatched !
' =
request _444--.
~ Client requests ! aes
First tier Secondtier Third tier
Fig. 4.11.1: Logical organization of server cluster
dle as it may be7running single service. Hence, it is advantageous to

4 Inserver cluster, some ofthe servers may remaini Client applications should always remain unaware about the
the servic id
i es to idle machines.
or ar il y mi gr at e
temp
of cluster.
internal organization
} set up over whichapplication-level requests are sent as part of a session.
a TCP connectionis
| to access cluster, TCP
accepts incoming
terminatin g the connection. If transport-layer switches are used, then switch
session ends by
A8ee . In coming request comesto single address
sts and han ds off such connections to one of theservers
connection reque>*- e server.
sit to appropriat
of switch which then forward 7 i
tO ieennt contai ns IP address ofswitch as a source address. Switch
cli
follows the round robin policy
ss B ong servers: Moreadvanced server select criteria can be deployed as well.
Server9s ACK me¥ age io n
£
to distribute load am
ee 7 TechKaowledga
Publications

NNN
Distributed Computing (MU-Sem 8-Comp.) 4-23 Resource and Process Managemen
4.11.3 Distributed Servers
Using single switch (access point) to forward clients request to multipleservers may _ aticthe ictal The
cluster may become unavailable. As a result of this, several access points can be used, o PSSES are
madepublicly available.
. |
A distributed server is dynamically varying collection of machines, with also possibly varying 4 points which
appears to the outside world as a single powerful machine. With such a eistabutec server, the clients benefit from a
robust, high-performing, stable server. The available networking services, particularly mobility support for IP on 6
(MIPv6) is.used. In MIPv6, a mobile node normally resides in home network with its home address (HOA) whichis
stable.
_ When mobile node changes network,it receives care-of-address in foreign network. This care-of address is reported
to
the node's homeagent(router attached to home network) which forwardstraffic to the mobile node.All applications
communicate with mobile node using homeaddress, they never see care of address.
This concept can be used for stable address of distributed server. In this case, a single unique
contact -address is
initially assigned to the server cluster. The contact address will be the server's
life-time address to be used in all
communication with the outside world. At any time, one nodein the distributed
serverwill work as an accesspoint
using that contact address, but this role can easily be taken over by another node.
Whathappensis that the access point records its own address as the care-of
address at the home agentassociated
with the distributed server. At that point, all traffic will be directed
to the access point, which will then take care in
distributing requests among the currently participating nodes.
If the access point fails, a simple fail-over mechanism
comesinto place by which another access point reports
a new care-of address. Home agent and access point
may
become bottleneck as whole traffic would flow these two machines
. This situation can be avoided by using MIPv6
feature knownas route optimization.
4.12 Code Migration
4.12.1 Approaches to Code Migration
Let us discuss about whyto migrate the code.
Reasonsfor Migrating Code
Performance is improved by moving code from

heavily loaded machines to lightly loaded machi
costlier operation. If server manages huge datab ne, althoughit is
ase and client application perform operations
involves large quantity of data. In this case, on this databasethat
it is better to transfer part of client appli
case, instead of sending requests for data valid cation to serve r machi ne.In other
ation to server It is better to carry outclient
transfer part of server application to client machine. side valid ation . Hence, to
:
Code migration also helps to improve performanc
e by exploiting parallelism where there is no need
of parallel programming. Consider the example to considerdetails
of searching for information in the Web. A searc
h query whichis
different sites to achieve a linear speedup comparedt
o using ju st a single program instance
to configuredistributed system dynamically. . Code migration also helps
In other case, no need to keepall software at client. Whe
never client binds
with server, it can dynamically download
the required software from webserver.
Once work is don e, all these software
client side software. can be discarded. No needto preinstall
Tech
Publications 4

Distributed Computing (MU-Sem 8-Comp,) eats ment 4
Resource and Process Manage
yodels for Code Migration
- n first model, prog
ram is comprises of
ll the instructions of
cod
e segment, resource segment and execution segmen t. Code segment
Program. R d by the
process, such asfil es, pri esource segment indudes references to external resources neede
ee devices, )other processes, and so on. Execution state includes current execution state
that contains priva
© data, the stack, and Program counter. ~
InIn weak mobiliity, only code segment
with initialization data is transferred. For example, Java applets always start
anec vom initial state. On the other hand, in strong mobility, execution segment is needed to resume execution
on target node as execution
j ofprocesses is stopped at source node and
resumes at target node. |
Migration of code can be sender-initiated or receiver-initiated. In senderinitiated migration, migrationinitiates from
the node (source node) where the codeis currently running. In receiver-initiated migration, target machine takes
initiative for code migration. Receiver-initiated migrationis simpler than sender-initiated migration. Some time client
initiate migration and in the same wayserverinitiates migration.
In weak mobility, migrated codeeitheris executed by thetarget process or a separate process needsto bestarted. For
example, downloaded Java applets execute in the browser's address space. It avoids staring a separate process and
hence, avoids communication at the target machine. The main drawback is that the target process needs to be
_. protected against malicious or unintentional code executions.
s
A simple solution is to let the operating system take care of that by creating a separate process to execute the
ee ee ee ee
migrated code. Migration by cloning is a simple way to improve distribution transparency in which process creates
child processes.
Following are the alternatives for code migration.
Strong Mobility
ee res
44 Pun a
|
+ t
Receiver-initiated
Sender-initiated
mo 4 I s
4 T-
yo : : l
ee
Executein Executein
ecutein | Executein separate process
separate Process f target process
inet process
~ Weak Mobility
____t =
Receiver-initiated
ss
Executein ~ Executein
target process separate process
target process 7 *
4 Fig. 4.12.1
WFTohtnemieagi

Resource and Process Managemen; :
Distributed Computin
e g (MU-Sem 8-Comp.)
e 4-25
e
4.12.2 Migrationand Local Resources
ific TCP port to communicate with
process holds 3 reference to a spec
In migration of resource segment, suppose r node, processwill have to
refer ence is held in its resource segm e nt. After migration to othe
remote processes. This as file, reference which is URL will
request a new one at the destination. In other cas es such
give up the port and
remain valid. resource
requires precisely the referenced
| of process-to-resource bindings exist. Binding by identifi er
Three types
er by means O f that server's Int
ernet address. References to
Web sit e or to FTP serv
such as URLto refer to a specific .
local communication end points are also in this category.
of the process doesnotaffect if
rce is need ed. In this binding, execution
In binding byvalue, only the value of a resou
ng by valueis libraries that are locally available,
same value is avail able on target node. A typical examp le of bindi
.In binding by type, a process indicates
file system maydiffer between nodes
although their exact location in the local as monitors, printers, and
it needs only a resource of a speci fic type. Exampleis: references to loc al devices, such
so on.
moved from
-machine bindings. Unattached resources can be easily
It is also important to consider the resource-to ces but it
On the other hand,it is possible to move fastened resour
one machine to other. The example is datafile.
tes arethe example of fastened resour
ces. Fixed resources like
incurs high cost. Local databases and complete Websi
moved.
local devices and communication endpoints cannot be
and resource-to-machine bindings, following combinationis
Considering above types of process-to-resource bindings
required to consider while migrating the code.
Move(or Global Ref) Global Ref (or Move) Global Ref

By Identifier
Copy (or Move, Global Ref) Global Ref (or Copy) Global Ref
By Value
Rebind (or Move, Copy) Rebind (or Global Ref, Copy) Rebind (or Global Ref
By Type
o Move: Movethe resource.
o Global Ref : Establish a global systemwide reference.

o Copy: Copy the value of resource.
oO Rebind : Rebind process to locally available resource.
then
It is best to move unattached resource along with the migrating code.If the resource is shared by other processes
best solution is to
establish a global reference such as URL to access resource remotely. For fastenedor fixed resource,
create global reference.
date.
Establishing global reference is sometime expensive. For example, to access the huge amountof multimedia
a fixed resource to which the process Is
Suppose migratinga processis usinga local communication end point whichis
fter it has migrated and
bound bythe identifier. In this case, process set up a connection to the source machinea
of
install a separate process at the source machine to forward all incoming messages. The drawbackis that failure
vely, all processes that
source machine results in failure of communication with the migrated process. Alternati
communicated with the migrating process should change their global reference, and send messages to the neW 4
communication end point at the target machine.
In binding by value, fixed resource such as memory can be shared by establishing global reference. In this ¢4
Se,
distributed shared memory support is needed.
ee
e a
Be Techni

/
Resource and Process Management
y puntime libraries are the exa m
ple of bindi ng b
wa
. fastened resources. It can be available on the target
6 by value with li
machine, or should otherwise 5 ¬ copi
huge amount ofdata needs to
Pied before ¢O de migration i
igrati takes place.In this case global referenceis preferredif i
} betransferred ,
s can be copied o
Unattached resource
then, establishing a global reference . me |to the new destination, if not shared by a many processes. If it is shared
. .
Oo ; 4_4
In case of bindings by type, the obvious solution is to rebind fy
only option.
ess
the proc to loc all y avai labl e resource 5f the sametype . 7 . | iu
412.3 Migration in Heterogeneous Systems. |

e distributed sys tem . , He
_ In larg -m, ma chine9s architecture and platforms used are differen Hence, environment is
8 t.
orf
| '
ipti
heterogeneous. With the use of scriptin : Lt
.
g languages and highly portable languages such as Java, solution to code | Be

. ines
migration in heterogeneous systemsrelies on virtual machine. In case of scripting languages, these virtual mach
directly interpret sou
mediaté code generated by a compiler.
}
rce code. In case ofJava,it interprets inter . . hs
| |
is preferred. It is possible to
stead of migrating only processes, migration of entire computing environment
. .
. .
Recen :
t in
provided proper a hi
decouple a part from the underlying system and actually migrate it to another machine, of migration,
Bil
such migration supports strong mobility. In this type
compartmentalization is carried out. In this way, -
different types of resources may be solved.
many complexities that arise in binding with
_ Review Questions ve
bf Sab ee
esusedin distributed system.
sapiens sata k phsdcdiens Si

0.1. Explain.in short scheduling techniqu |
of good global sch eduling algorithm.
80.2 Explain desirable features
| |
Explain in detail task assignmentapproach.
-
Q.3 | | i
orithms. |
ferent load balancing alg
Q.4 Explain and classify dif
ms.
-balancing algorith
ffer en t is su es in designing load
Explain di
Q. 5
de in load balanc
ing approach.
ad of a no
l cy to est! mate t
h e lo
Q6
: Explaiain n polpoli oach.
ad balancing appr
proces sfer policies in lo
s tran
Q7 Explain different n load balanc
ing approach
.
th e no de i
hes to locate
Q.8 Explain different approac
in diff in load
balancing approach.
e po li ci es
: ion © exchang
state information Explain.
Q9 Explain different niquesin load bal
ancing approach?
i
i y assignme nt tech
iornt
io
erent pr
| Q10 What are the diff 8nq algorithms. |
. n i g joad-shanng a
es j
in desi gn in g
i loa d sharin g approach.
O11 Explain issu lo cate the node i n
t approaches to proach.
Q te in fo rm at io n exchang?¢ poli_iagcies In in load sharing ap
Exp
: an a e
ent st
013 Explain differ tion
.
ro ce ss migra es s gr at io n mech an is m.
Q14 Explai8 n| the types of p sod proc mi
e features of . process migration.
the process in
45 Explain desirab! , g and rest arii n g
. eezin ; ss migration. ti
. igm for fr ath
n sfer mechanisms in proce
EB a ne
UF TechKnewiedge
Publications

Resource and Process Manage,
44
Distributed Computing (MU-Sem 8-Comp.) _ 4-27
Q.19° Explain mechanism to handle coprocesses in process migration.
Q. 20 Explain Process Migration in Heterogeneous System.
Q. 21 Whatare the advantagesof process migration? Explain.
Q. 22 Explain the different advantages of threads with compare to process.
Q. 23 Explain different models to construct server process.
Q. 24, Explain different models for organizing threads.

Q. 25 Explain Issues in Designing a Thread Package.
Q. 26 Explain different approachesto implementthread package.
Q. 27 Explain different techniques of thread scheduling.
Q. 28 Explain the role of virtualization in distributed systems.

Q. 29 Explain architecture of virtual machines.
Q. 30 Explain general design issues of server.
Q. 31 Explain the client-side software for distribution transparency.
Q. 32 Whatis server cluster? Explain in brief logical organization of bawer cluster.

Q. 33 Explain different approaches to code migration.
Q. 34 Explain hoe bindingto local resourcesis handled in process migration.
Q. 35 Write short note on <code migration in heterogeneous systems=.
Ol

Consistency, Replication
and Fault Tolerance
_Syllabus |
i
Introduction icati on and consistency, Data-Centric and Client Centric Consistency Models, Replica
to replicati
and group communication,
Management. Fault Tolerance : Introduction, Process resilionco, Reliablo client server
Recovery.
5.1 Introduction to Replication and Consistency

and to improve
buted system in order to enhance the reliability
The replication of data ts carried out in distri
d be ensured to have
ecan be achieved throughreplicati on, it shoul
performance. Althoug i) reliability and performanc
propagated to other copies of data
copy updates then this updates should be
4La ee .
all the copies of data consstent. If one

same.
also. Otherwise replica will not be
taneously the consistency should be
If processes access shared data simul
- Replicatior » also relates to scalability. not easy
In large sc ale distributed system, these models are
data-ce ntric consistency models.
ensured. There are many
are useful, which focus on consistency from the
nt. For this reason, client-centric consistency models
to impleme
mobile client also.
perception of a single clien t which possibly may be
n
5.1.1 Reasons for Replicatio
es then other
data. if one replica of the data crash
rman ce are two main reasons to replicate
- Reliability and perfo te operation fails in one replica offile out of three replicas
ble for required ope
ration. If single wri
replica can be availa
ce, we can protect our data.
er valu e ret urned_ by other replica. Hen
id
then we can cons
system can be carried out in numbers orin
proves performance. Scaling of the distributed
4 Replication also im is better to
er of processes access the data from sameserver, then it
If the re is increasing numb
ea.
geographic ar performance can be improved.
load.In this way,
ate server an d divide t he work
replic If a copy
ect to the size of a geographical area,thenreplication is also needed.
he sy stem with resp
~ ifwe wantto scalet the process which access it, then access time decreases. As a result, the
pla c ed in th e proximity of
of data supposeis
y that process improves.
perceived b
performance as
rr oFSohtrmmna
4

Consistency, Replication and Fault Tolerangg
Distributed Computing (MU-Sem 8-Comp.) o-2 :
data need to be
4 , Althoughreliability and performance can be achieved through replication, the multiple copies of the
on all copies of data. This price we have t
consistent. If one copy changes then these updates have to be carried out
pay against benefits of the replication.
9.1.2 Replication as Scaling Technique
Replication and caching is mainly used to improve performance. These are also widely applied asscaling9techniques
Scalability issues usually comeinto view in the form of performanceproblems. If copies of data are placed cose to the
~ processes accessing them, then definitelyit improves access time and performance aswell.
j
a hae all replicas up to date, updates propagation consumes more network bandwidth. If process
P accessesloca|
copy of the data n times per second. Let same copy gets updated m times per second. If n<<m whichindicates
access-to-update ratio is very low. In this case, many updated versions of data copywill never be accessedby process
P. In such case, network communication to update these versionsis useless. It requires other approachto updatelocal
copyor to not to keep this data copylocally.
4 Keeping multiple copies consistent is subject to serious scalability problem. If one process read the data copy thenit
8should always get updated version of data. Hence changes donein oné copy should be immediately propagatedtoall
| copies (replica) before performing any subsequent operation. Such tight consistency is offered by whatis also called
synchronous replication.
4 Performing updateas a single atomic operation onall replicasis difficult as these replicas are widely distributed across
large scale network. All replicas should agree on when precisely an update is to be carried out locally. Replicas may
need to agree on a global ordering of operations using Lamport timestamps, or this order may be assigned by
coordinator. Global synchronization requires lot of communication time, particularly if replicas are spread across 2
wide-area network. Global synchronizationiis costly in terms of synchronization.
7 The strict assumption about consistency discussed above can berelaxed based on the access and updatepatternsof
the replicated data. It can be agreed that replicas may not be same everywhere. This assumption also
dependson
purpose for which data are used.
5.2: Data-Centric Consistency Models

44
4 Processes on different machines perform operations on shared data. This

shared data is available throughuse of
distributed shared memory,a distributed shared database, or a
distributedfile system.
al
- Consider data store which is physically distributed on many machin

es. Each process hasits local copy ofthe entire
data store. The write operation by process changes the
data and read Operation does not carry out any change
in dal
item. The write operation should be propagatedtoall
other copies of the data store on different machines.
Read operation by a process on data item should always
return the result of last write operationon that data item.
A
consistency model is basically ij s a contract between proces
ses and the data store. If processes agree to follow certai
n
rul es the data store assures to workcorrectly. As
it is not possible to have global clock, it is difficult to decide
wae
___Whichis the last recent write operation. Hence,
different model obeysdifferent rules.
so
a vie |
Pubricatie® aie
Seltee
Pe ear;
eee eaten

a
ee
=e Te
g (MU-Sem 8- plication an d Fault Tolerance
WS
Cons is te nc y, Re
zt
r
wProcess =3
aL fa 233600
Process Process
Fat.
: Process
P4
oan yenn -we

Local Copy
ee Fy Te
at
aE pono nnnngEaE RR
a
: ,
_44
a ae
=< SuUEEEE
oeLs had
a
Distributed data store
=a
ted Data Store
Fig. 5.2.1 : Physically Distribu
5.2.1 Continuous Consistency

icient solutions can
ots
the dat a on dif fer ent machines in network. Eff
ate
no best solution to replic tion also depends
It is true that, there is cy. The tol era nce whi le considering the relaxa
axation in consisten
if we consider some rel
=
be possible eseare :
ies in three way s. Th
ons can spe cif y the ir toler ance of inconsistenc
ati
on applicati on. Applic
een replicas.
merical values betw
o Variation in nu
as.
e ss between replic
o Variation in stalen
ions.
ring of update operat
io n wi th re sp ect to the orde 0 r variation. For exampl
e,if
o Variat the y use s numerical deviation
ations the n
s for the applic e should not deviate
ta have nu me rical seman tic ap pl ic at ion ma y sp ecify that two copi s
if da ta then
ntain stoc k
market price da cal d eviation coul
d be
re cord s co the oth er han d, a relative numeri
replicated deviation. On
lu te numeric
uld be abso e.
more than Rs 10
. Th is wo
an some S$ pecific valu
e n t w o co pies not more th
ve devi ation be
t we cal de viations, replicas would still
specified to ha ol at in g th e specified numeri
out vi
k goes UP with e
last updated tim of
nsideratio the=
, if a st oc n
vi at io ns io n ta ke s| in to co
In both cases of A staleness de vi
de at |
tent. it is not too ag ed.
b e mutually
consis n be old data, provid
ed
be considered to ta supplied by re plica which ca such
t change suddenly.In
he da
tolerate t
pplicati ons can 45 parameter values do no
lica. S 3 for few hours
replica. Some - ates to the replicas
P may take decision to propagate upd
| e, weather rep updates,
For examP
cases, a main | ferent at the different replic
as,
u pdates that can be dif
periodically. } orderi ng of
permits to a local copy,
e updates can be
viewe d as they are applied tentatively
applied
At last, there 2
stay bounde d. Th es
, some upda tes may need to be rolled back and
rences m ves. As
a resu lt
provided the diffe om
all replicas ar
ord ering deviations are not easy
to grasp than the other
e ement fr
permanent . Naturally ,
agr
until: the global al 36 fore beco
ming
in a diff er en t orde r be
metric-
two consistency
m
Bil
ersphdiiaebiaie sla
a oS a
ualiiteceaser
ee ky
<5saant

Distributed Computing (MU-Sem 8-Comp.) Consistency, Replication and Fault Tolerangs _
9.2.2 Consistent Ordering of Operations
In parallel and distributed computing, multiple processes share the resources. These processes access these resources
simultaneously. The semantics of concurrent access to these resources when resources are replicated led to use of
consistency model.
Sequential Consistency
The time axis is drawn horizontally and interpretation of read and write operation by process on data item is given
below.Initially each data item is considered.as NIL.
Oo W,(x)a: Process P; perform write operation on data item x with valuea.
oO R,(x)b : Process P. perform read operation on data item x which return value b.
" As shownin following Fig. 5.2.2 (a), process P, perform write operation on data item x and modifiesits valueto a.This
write operation is performed by process P, on data store which is local to it. We have assumed that data storeis
replicated on multiple machines. This operation then propagated to other copies of the data store. Process P,later
reads x from its local copy of the data store and see value a. This is perfectly correct for strictly consistent data store.
P,: W(xja Pi: W(x)a

poe R(xja Py: R(x) NIL R(x)a
@o (b)
Fig. 5.2.2
As shownin Fig. 5.2.2 (b), Process Plater reads x from its local copy ofthe datastore and see value NIL. After some
P,. Itis
time, Process P, see the value as a. This means it takes some time to propagate the updates from processP, to
, acceptable if we are not considering data storeasstrictly consistent.

y. model. Following
Sequential consistency modelwas first defined by Lamport. It is important data-centric consistenc
- condition data store should satisfy to stay sequentially consistent. -
were
The result of any execution is the same asif the read and write operations by all processes on the data store
this sequence in the order
:Executed iin some sequential order and the operations of each individual process appearin
specified byits program.
read and write operationsis tolerable
Processes executes conctirrently on different machines. Anyvalid interleaving of
operations. Time of operation is not
behavior. All these processes see the same interleaving of read and write
considered here. A.
(a) ProcessP,first performs write.
Following Fig. 5.2.3 (a) and (b) shows sequentially consistent data store.In Fig. 5.23
Later process P, writes b to data item x in its local copy 2
on x in its local copy of data store. This setsthe value of x as a.
s have
of data storé. Both processes P, and P, first read b and then a. The write operationof process P, appear to
ta :
processes P, and P, first read a andthen b. Both the data
occurred before that of process P,. In Fig. 5.2.3 (b), both
stores shownare sequential consistent.

Sas
ted
<ee
Distributed Computin g (MU-Sem

on and Fault Tolerance
25 | Consistency, Replicati
P, : OW (x) a ore)
yp : Po: oW (x) a = | 44
P,: W (x) b- __
8a 2°. W (x
P,:
5 __R{x)b R(x) a P,: R(x)a R (x) b
4° a
R(x)b R(x)a 4P,: | R(x)a R(x)b
(a) . : , (b) =
Fig. 5.2.3
~ Following Fig. 5.2.4 vi : data item been changed to b andlater
. Be 4 violets sequentially consistency. Processes P, see that first
to a. On the other hand,process P, will conclude thatfinal valueis b.
Pi: W(xja
P,: W (x) b
P3: R (x) b R (x) a
Se
P,: R(x)a R(x)a
elin = ese
Fig. 5.2.4
are executing concurrently. Three integer type data items

As an example, consider three processesP,, P, and P,; which
and read-
ee
store. Assignmentis considered as write operation
a, b, and c are stored in shared sequentially data
~
ses are shown below.
ly. All statement executes indivisibly. Proces
4-
statement reads two parameters simultaneous
4 - he
-
Process P, Process P3
EL
Process P,
c=1;
Ss
<as 1; b=1; él
read(a, c); . read(a, b); >.

- read(b,c);
Fig. 5.2.5 a
and hence, 720 (6!) possible execut ion sequences. If sequence begin with
total has 6 s tatements
Three processes , c) appears before statement
sequences. The sequencesin which statement read(a
ent a = 1 then total 120 (5 1)
statem . Hence, only 30
s In whic h st atem ent read(a, b) appears before statement c = 1 are invalid
b = 1 and sequence s ith a = 1are valid. Similarly 30 starti
WI!
ng with b = 1 and other 30 starting with c= 1
i n whici hstart
executio
sequences of enc es are valid out of 720. These can producedifferent program results
ot al , 90 execution sequ
sequences are valid. Int i y cons istency.
cy : =a.
der seq uentiall
s
which are acceptableun
~
Causal Consistency 3 |
not
istinction between the events that are causally related and those that are
cy mo del makes d : .
d see y and then
isten
d or occurred by previous event y then every processfirst shoul
4,
Causal consist
d. If an y ev en t bi s cause n
causally relate
io :
e follows th e following condit
x. Causally consistent data stor
are not causally related
lly rela ted writes in the same order. Those write which
a
All processes. ne
see the sescaus
may see in different order on different machines.
| roces
writes) P
(concurrent
: ae | r TechKaemletgé
: Peblicattens
fas ; 8 ,

Distributed Computing (MU-Sem 8-Comp.) 5-6 - wa
Consistency, Replication and Fault Tolerance
value <a= is later

. P, isis shownshow . Process p Fy write on data item x and this
In the Fig. 5.2.6, four processes P,, P2, P3 and
j x. This thea
i b value ma may bes result of computationinvolving the
by pro ces s P,. After this, P, writes b on data item
read :
a
\a byP. P, and
e W(x)a W(x)b by P2 are causa
value read by P, which wasa. Therefor
h other,
t writes as computationally not related to eac
On the other hand, W(x)c by P, and W(x)b byP, are concurren ss
It is shown in Fig. 5.2.6. ThiThis eve nt sequencei
q s permitted
Processes P, and P, may see these writesin different order.
| . : consistent
sequential ; data store.
by causal consistent data store but not permitted by
P,: W(x)a W (x) c
P,: R(x)a W(x)b
P,: R (x) a R(x)c R(x)b
P,: R (x) a R (x) b R (x) c
Fig. 5.2.6
In Fig. 5.2.7 (a), W(x)a by process P, and W(x)b by P, are causally related writes. But, processes P, and P, see these
writes in different order. Hence,it is violation of causally-consistent data store. In Fig. 5.2.7 (b), as W(x)a by processP,
and W(x)b by P, are concurrent writes. This is due to removal of read operation. Hence, figure shows correct sequence
of events in a causally consistent data store.
P,: W/(x)a Pi: W(x)a
Po R(x)a_ W/(x)b : P,: W (x) b

P,: R(x)b R(x)a Py: R(x)b R(x)a -
P,: R(x)a R(x)b P,: R(x)a R(x)b
(a) ! (b)
Fig. 5.2.7
FIFO Consistency
FIFO consistency is the next step inrelaxing the consistency by dropping requirementof seeing the causally related
writes in same orderat all the processes. It is also called the PRAM
consistency. It is easy to implement It says that:
tis,
All the processes must see the writes done bysingle process in the orderin
which they wereissued. The writes from
different processes, processes may see in different order.
In Fig. 5.2.8, a valid sequence of events for FIFO consistenc

y is shown. Here, W(x)b and W(x)c are the writes from
process P,. Processes P, and P, see these write same
s in the same order i.e. first b and then c
~.,
.
Pi: Wix)a
Pa: R(x)a W(x}b W(x)c
fat | R(x)b R(x)a R(x)c

Py: R(x)a R(x)b R(x)c
Fig. 5.2.8
Wesenety
Distributed Comput
Ng (MU-Sem 8-Com
Grouping Operations
- Consistency
defineg at th
e level
applications. The by Hy
qifferent a
Progra 0 ms Write operations does not match its granularity that is offered
ms r
mechanismsand
| tran acti unning Concurrently use shared data. For this purpose, synchro
nization 1
. ons are u
sed t )
and write Oper re program level read
ations are Brou » control the concurrency between programs. Therefo
ped togeth er and bracketed with Pair of operations ENTER_CS and EXIT_CS. | ||
~ Consider the dist
ributed data store
Ww hich we have assumed. If process has successfully executed ENTER_CS, it
guaran tees thatits loca] co
py of data
p ations ins
oper insi
Store is up to date. Now process can safely execute series of read and write \
i de CS on that stor
e and exit the CS byca
lling EXIT_CS
A series of read and wr}
write o i vans , . i!
mult Perations within program are performed on data. This.data are protected against Mt,
simultaneous accesses that would le
ad to seeing something different than the result of executing the series as a |
whole. It is needed t
© have precise semantics about the operations ENTER_CS and EXIT_CS. The synchronization
variables are usedfor this Purpose.
Release Consistency
4 In release consistency model, acquire operation is used to tell the data store that CS is about to enter and release wi
operation is used totell that, CS has just been exited. These operations are used to protect the shared data items. The
shared data that kept consistent are said to be protected. Release consistency guarantees that, when process Carry
out acquire, the store will make sure thatall the local copies of protected data are brought up to date to be consistent
with remote copies if must be.
|
of the store. Carrying out acquire
After release is done, the modified protected data is propagated to otherlocal copies |
Hi
does not g uarantee that locally done| updates will be propagated immediately to other copies. Carrying out release
5.2.9 showsvalid events for release consistency.
|
does not ens ure to fetch updates from other copies. Following Fig.
|
P,: Acg(L) Wix)a W (x)b_ Rel(L)

Rel(L) | ait
Acq(L) R(x)b
Py:
P,: i
tit}
ii
44444444 a
Ps: AM
Fig. 5.2.9
p. upd ande then carry outrelease.

ates shared data Items twice an
acq uir e,
After carrying out
1
e
ds data item x whichis b. This. value was updated by process P, at th
esses P, performs acq uire and rea
After this, Proc ' ed that Processes P, gets this updated value. On the other hand, process P, does
e. Hence, it is guarante
time of releas not get current value of x. There is no compulsion to get current value of x and
d hence It does
not carry out acqu! re an |
. . itted. 3 elAs
ning a is permitte d data store follows following rules !
ee
nsistent distribute
se co . i r ed data, all previous acquires carried out by process must... ae
j
: da
o Before carrying out fully
have completed; eestted to be carri ed out, all previous reads and writes carried out by process must have been
. . ; bh
e is permitte
: <| u
Oo Before ne
He
completec- , --bles are FIFO consistent.
|
| varia
ization
o Access es to synchron!
. m 3 Hee
| | TechKuowledge Hike |
Publicattons hie y
ita

icationancFault Tolerance
TE
Entry Consistency
The shared synchronization variables are used in this model. When process enters the CS, it should acquire the relateg
synchronization variables and whenexit the CS, it sho uld release these variables. The current ownerof synchronization
variable is the process which haslast acquired it.
The ownermayenter and exit CSs repetitively without sending any messages on the network. A process not currently
having synchronization variable but need to acquire it has to send a message to the current ownerasking for
ownership and the current values of the data coupled withthat synchronization variable. Following rules are needed
to follow.
1. An acquire access of a synchronization variable is not permitted to carry out with respect to a process until all
updates to the guarded shared data have been carried out with respect to that process. |
2. Before an exclusive mode access to a synchronization variable by a process is permitted to carry out with respect
to that process, no other process may keep the synchronization variable, not even in nonexclusive mode.
3. After an exclusive mode access to a synchronization variable has been carried out, any other process9 next
nonexdusive mode access to that synchronization variable may not be carried out until it has carried out with
respect to that variable9s owner.
First condition says that at an acquire, all remote updates to the guarded data must be able to be seen. The second
condition says that before modification to a shared data item, a process must enter a CS in exclusive mode to confirm
that no other process is attempting to update the shared data simultaneously. The third condition says that if a
process wants to enter a CS in nonexclusive mode,it must first ensure with the owner of the synchronization variable.
guarding the CS to obtain the latest copies of the guarded shared data.
A valid vent sequence for entry consistency is shown Fig. 5.2.10. Lock is associated with each data item. Process P,
does acquire on x and update it. Later it does acquire on y. Process P, does acquire for data item x but not for y. Hence,
process P, will read value a for data item x but it may get NIL for data item y. Process P, does first acquire on y henceit
reads value after y is released by process P,. The correctly associating the data with synchronization variables is one
of the programming problem with this model.
Pi: Acq(lx) W(x)a Acq(Ly) W(y)b Rel(Lx) 4Rel(Ly)
P,: ; Acq(Lx) R(x)a R (y) NIL

P3: Acq(ly) R(y)b
Fig. 5.2.10
Consistency versus Coherence
8Consistency. is concerned with the read and writes operations by the processes that are performed on set of data
items. A consistency model explains what can be predictable with respect to that set when numerous processes
concurrently work on that data.
Whereas, coherence models expects result from single data item. In this case, assumption is that a data item is
replicated at several places;it is said to be coherent when thedifferent copies stick to the policies as defined by its
allied coherence model.

Models
istency
entric Cons
a
l l ent#
ga_c s on distributed
i multaneous update
i ered si n th
5.2, W eh ave cons
id solved. O
in section
situation can be easily re
y and if, j this
ated si m ultaneousl
not get upd
ly readin g the data. be W
operations are on idered to
co ns is te nc y mo del which is cons
ly ch
obeys eventual in ac omparative
These data stores inconsistencie s ca n be hi dden
ri c consistency models, many
ce nt
SST
owes
e
a from d
sistency
ee e
re ad dat
5.3.1 Eventual Con If processes only all processe
s
limited fo rm .
updates [0
aiedete
ex is ts in so me pr op ag at e
ions, CON currency necessary to
caniineeniadianeas
Hecaie
4_4
Practical
y in many situat it i 5 not
ade
on it th en
rforms U pdates In this case
,
process s pe at pag e.
o
hardly some
4
e r of th
by o wn are
uthority OF eb proxies
a
immediatel
y.
n be up dated by a , br ow se rs and W
WWW) web
pages either
ca
get better
efficiency
ide Web ( ersely, to request.
_ In Wo rl d W
al l no t re qu ir e. Co nv
at pa ge U P S n the next
e conf lict
s at return th entual
che and to t ing client./Ev
4mae
e writewit in a local ca ques
resolving th pa ge en t to re
some © xt e for
ss ed
to hold acce t take plac
configured ptable at
frequently can be acce up dates do no
_4
>- it p li ca if ica.
re turns ol
d pa re the re e same repl
y to update
aaa
th ou gh the we
d caches
pr op ag at ed graduall ie nt s al wa ys access th
_4 Al ll bewi ed
id cl
tha t, upda
tes
rrect ma
nne f, prov
444
cy states work In co time.
consist en stores of
nt data
te t period
over 4 shor th current
me . Fy en tual consis ar e ac cess ed sconnects wi
long ti e replic as . Cl ie nt di
en divers copy of data
base
ious copy to
ms occur wh © perations on ated from prev
But, proble rf or mi ng not pr op ag
t and pe if upda tes are els
bile dien copy. In th
is case,
nt-cent ric co
nsistency m o d
os e, client | s mo pl ic at ed havi or . Cl ie
4- Supp oth er re nsisten t be
nects to no tice inco
ta ba se an d late f con th en he/she M3y
da ba se tency-
py of data ntual consis ta
nnected co le of eve accesses to a da
recently co cal examp e c onsistency of
t th
This is typi single client abou
problem. rance fora
res olve such offe r as su
cesses by differ ent
clients.
tric Cc onsistency abou t simult aneous ac ta
copy of the entire da
en ed
e is pr ovid
nt-c
, clie
_~ Particularly
. N o assuranc es . Ea ch pr ocess has its local
only y mach in t operation
at client ted © n man store, it carry ou
store by th
ph ys ic al ly distribu en pr oc es s accesses data
hich is vity is un
reliable. Wh
pies. Following four mo
dels are
ts store w
~ Consider da n e t w o r k connecti o p a g a t e d to other co
at ly pr
assumed th en eventual
store. It is la bl e. U p dates are th
py avai
nearest co
on local or consistenc
y
d for client-centr
suggest e
ds
onic Rea
° Monot
c writes
o Monotoni tes
wri
o Read Your
low R eads ri:te
o Writes Fol
sc ri be above models. io n Is re su lt of seriies of w
to de e t This vers
tion® gre used copy L, at) tim
Following nota em * at loca
onl
.
~ rsion of dasilncit e initializati
h : in d tloca he pvey & occurred L, at timet.
o x t n
operatio x at local copy
ns on data item local copy L at late
r
ri es of write operatio n ca rried :out later at
ws(x tl) : se ]) a v e also b e e
6
operat ions in WS(xi[ty h
: It denote that
"9 ws (x[t]ltal)
time t:- | tnemtctes
SE fe
|
|

Tolerance |
5-10 Consistency, Replication and Fault
9.3.2 Monotonic Reads

i ition holds : -
. .
4 Monotonic-read consistency is guaranteed by datastoreif following eon)
4 Ifa process reads the value of a data item x, any succ

essive read operation on x bythat process will always return
that samevalue or a more recent value.
4 This ensuresthat in later| read operation, processwill alwaysget (read) latest or same value (which
i
was retu rnediin last
read operation) and notolder one. Consider the example of distributed email database.In this case, mailboxesof users !
are distributed and replicated on multiple machines. The received mail at any location is propagated in lazy manner(as
per demand) to othercopiesof mailbox.
Only that data gets forwarded which is needed to maintain consistency. Suppose user reads the mailbox in Mumbai
and later flies to Delhi. The Monotonic-read consistency guaranteesthat, the messages in mailbox that were read at
Mumbai will also be in mailbox at Delhi.
4 Fig. 5.3.1(a) shows monotonic consistent data store. L, and L, are local copies at different machines. Process P carries
out read (R(x,)) operation on x at L,. The value returned to the processP is the result of write operation WS(x,) carried
out at L,. Later, process P carries out read operation (R(x,)) on x at L,. Before P performs read operation on L,, here
writes at L, is propagated to L, and it is shown in diagram by WS(x,; x2). Hence, WS(x,) is the
part of WS(x,). This shows
that monotonic-read consistency is guaranteed.
L,: WS (x,) R (x,) L,: WS (x,) R (x,)

L: WS(x, 3X2) R(x.) L,: WS(x,) R(x.) WS (x, ; x)
(a) (b)
Fig. 5.3.1
~ Fig. 5.3.1 (b) shows data store which does not offer
monotonic read consistency. After P carries out
Operation on x at L,process P carries out. read read (R(x,)1 )
operation (R(x,)) on x at L,. But only WS(x
performed at L,. Here, result of writes from * *
,) write operation is
L, is not propagated to copy L,. This show
-
-
s that monotonic-read
consistenc y is not guaranteed in this case.
9.3.3 Monotonic Writes
a
= In this model 9 it Is important to Propagat
gate WiIte Operations in the corr ect o r de r
Monotonic-read consistency to all copi |
is guaranteed by data storeif |
following condition holds:
Tech fi. .
Pusiicati
. ta ars : eine
aie
:

(b)
(a)
Fig. 5.3.2
5.3.4 Read Your Writes

is guaranteed by
Read Your Writes modelis closely related to monotonic reads model. Read-your-writes consistency
data store if following condition holds :
n on x by
a write oper atio n by a process on data item x will always be seen by a successive read operatio
The result of
the same process.
by the same process,
completed before a successive read operation
This ensures that; write operation is always
n occurs.
regardless of where that read operatio
effect
on web.It takes some time to come in to
ider the exa mpl e of chan ging password ofdigit al library account
Cons
delay is use of separate server to manage
nge. As a resu lt, libr ary may be inaccessible tot he user. A reason for
this cha s servers that
rypted passwords to the variou
and it may tak e som e time to subsequently propagate enc
passwords
form the library. es out read operation
s P ca rries out write (W(x, )) operation onx at L,. Later, process P carri
es
In Fig. 5.3-3(a), proc R(x,) operation at local copy L,
-y ou r- writ es cons istency guarantees that, the succeeding read
ad
(R(x,)) on x at L,. Re op er ation w(x ,) carried outat L,. This
is indicated by WS(x,; x,), which states that
cts of th e wr ite
observes the effe
W(x,) is part of W(x,)- at L, have not been propagated to loca
l copy y This is
write ope rat ion (W(x,) ) on x
ows tha t. effect of
Fig. 5.3.3(b) sh ;,)-
operation WS(x
shown by write L: W(x,)
L: W (x,)
e eR (x) L: WS(x,) R (x,)
OWS (x, 5%)
L:
(b)
(a)
Fig. 5.3.3
Reads
5.3.5 Write Follows eads consistency is
updates of last read operation gets propagated. Write-Follows-R
. mo del
$
In this C condition holds:
Ee s t o r e !f following
ed by data
operation on x by the sameprocessis
guarante cess on a data 8em x following a previous read
4 p ro t was rea d.
tion b y value of x tha
- Awrite opera he same OF 2 more recent = E
Es lace ont
: guaranteed to take P .
Publisatises
<2 = 3

t Tolerance a
5-12 Conese, Replication and Faul
copy of data item with value most _
4 It-states that, process will always perform successive write operation on up to date
: group.Us er
and their rea ctions in network news |
8recently read by that process. Consider the example of postingarticles
y, reaction Y will be writtento
first read the article X and then posts its reaction Y. As per write-follows-reads consistenc
ant copyof the newsgroup only afterarticle X has been written as well.
which just now read at L,, returns
4 In Fig. 5.3.4 (a), process P carries out read (R(x,)) operation on x at Ly. The value x
set at L, where same processlater
value which just written (WS(x,)) on x at L,. This value also appears in the write
Carries out write operation. |
ona copy that is consistent
4 In Fig. 5.3.4 (b),it is not guaranteed that, the operation carried out at L, are carried out
with the copyjust read at L, } | |
R(x) L,: WS (%) R (x,)

Li: WS (x)
L,: WS(x, ; x,) W (x,) L,: ws (x2) W (x,)
(a) (by
Fig. 5.3.4
5.4 Replica Management §:9-8
4 Replica managementis importantissue in distributed system. The key issuesare to decide whereto place replicas, by
whom andthe timeat which placement should be carried out.
4 Otherissueis selection of approaches to keep the replica consistent. Placement problem also includes placing the
servers and placing content. Placing server involves finding the best location and placing content involves whichis the
best server for placementof content.
5.4.1. Replica-Server Placement
For optimized placement, k best location out of n (k <n) needs to be selected. Following are some ofthe approaches
| to find best locations.
4 Consider distance between clients and locations which is measured in termsof latency or bandwidth.In this solution,
one serverat a timeis selected such that the average distance betweenthat serverand its clients is minimal given that
- previously k servers havebeen already placed (n-k locationsareleft).
4 In second approach, instead of considering the position ofclients, take topology of internet formed by autonomous
systems (ASs) in which all the nodes runssame routing protocols. Initially take largest AS and place server on router
having largest number of network interfaces. Repeat the algorithm with second largest AS and so on. These both
algorithms are expensive in terms of computation.It takes O(n9) time where nare numberoflocations to be checked.
4 Inthird approach,a region to place the replicas is identified quickly in less time. This identified region comprises Wie
collection of nodes accessing the same content, but for which the internode latencyis: low. The identified region. el
contains mt ore numberof nodes with compareto other regions and one2 of the nodesjs selected to play role of replica a
server.
_ 8niet
Puaiice ter.ae

tie liana wean iansasiae,
ult Tolerance
plication and Fa
pistributed Comput
ing (MU
<Sem 8-Co Consistency, Re
we) 9-13
=peee eee
| 54.2 Content Replication and Placement
lowi
~ Replicas are Classified in fol owing three types.
7, Permanentreplicas
as
2. Server-initiated replic
3, Client-initiated replicas
Permanent Replicas - at represents

the distributed
initia | se t of replicas th
licas. th es e ar e files that form
distributed da ta store contains these rep mpl e, in distri bution of web site, t he
lly , exa
Initia i number in many Cases. For
licas are less in
repli dedto one of
the
data store. These rep comi ng re qu e st for file gets fo rwar
single location. The in
d across few servers in
the site are distribute . led as
robin manner of se vers cal
server, may be in round is cop i ed to few number
in which case, a web site ded to one of
the
ot he r type of dis tri bution mirroring is used Th e client s re quest gets forwar
in spread across intern
et.
ch are more
or
ographically of replicas, w
hi
mirror sites. Th
ese servers are ge s having limited number of
ster- based We b si te at ed to number
rr or in g ap proach is used with clu ca n be distributed and replic
server. Mi base
distributed data
co nfi gured. As another example, 4
less statically rvers.
rm a cluster of se
servers th at collecti vely fo
Webserve r
Replicas o rmance. Suppose
Server-Initiated der to enha nc e th e pe rf
ta store in or rst © f requests
arrive to this
by th e o w ner of the da ys , su dd en bu
are created uple of da s
These replicas and ove ra co plicas inre gion
s in comi ng re quests ul to pl ac e t he temporary re
. easily hand le usef
placed in a city server. It may be
ed wh ich is far from
location
unex pect the
server from offers access to
e coming. es e se rvers stores an d
e requests ar rn et . Th
across the in
te
from wher rver whereit is
ll ection servers placed mi ca ll y rep lic atesfiles to the se
rs co n dy na
services offe These hostin
g services ca
ntent pl acementis
easier than
Web hosting d pa rt ie s. ac e d th en co
to thir servers are al re
ady pl
webfiles whic
h belongs ma nce- I f
ove pe rf or
. r to impr
needed in orde requests. Two
s ar e © om pa ra tively less than read
servels- date
placement of ered that up nee dedfiles can be migr
ated from
es an d it is consid . Se co nd ,t h e
orts web pab on server
supP m. First,
it reduces load quests for thosefil
es.
The algorithm th e algo r i t h
of client s that issue many re
d by . he prox im it y
issue s are
considere is ava ila
ble in t ;
le b y clients and
the locations of the
er s th at to the fi
e serv of nu mber of
ac ce ss es
rver to
current server to th k o n count er vic e c a n determine closest se
erver kee
P > t r a c
stin g s
i al gor ithm, cu la r cl ie inv olved inse8w.eb ho
In thi s arti ing databa
uests- For a
P requests for file F
incoming req C, and C,. In thi s scenario, all access
osest to the client s
.
s information, cesse s CS S, F). A deletion

single count of ac
Thi
is cl
this client.
* <
52 4" ered at S, as a
sever her ge
ts regist clients for
stored at re toget count of incoming
req uests from
If file F is 2"
a
maintained. It says th at , if
from C; is al sO from server Ss.
at server 52 (d el l5, F)) ld beremoved
fo r file F af ° e r v e r s
d the n fi le sh ou
se the load on
t h r e s h o l d t h is t hreshol o f re p licas. This can again increa
ps belo w
agai n redu
ces number
rver s dro wer in thi s manner of the same file.
file F at se in order to su rv ive at leas t one copy
from >
e d out Ser Tech
he file e carrie
t jons ar
Publications
Removing t c i a l a c
e spe
0, som
other se rve! 5. 5

S Distributed Computing (MU-Sem 8-Comp.) 5-14 Consistency, Replication and Fault Tolerancg 2
e050
F). if count Of
4 In the same way, file F is replicated if incoming requests for it exceeds the replication count rep(S,
on server in proximity of clients
requests forfile F is in between these two thresholds then file is permitted to migrate
requestingit.
4 Every server again assesses the placement of file which it stores. In case of numberof access requests countif drops
below threshold (del(S, F)) at serverS, it deletes F, provided it is not a last copy. Certainly, replication is carried out
| only if the total numberof access requests for F at S exceeds the replication threshold rep(S, F).
8i| 3. Client-Initiated Replicas
4
These replicas are initiated by client and also called as client caches. Caches are always said to be managedbyclients,
In many cases, the cached data should be inconsistent with data store. Client caches improve the access time to data
: i as accesses are carried out from local cache. This local cache could be maintained on client9s machine orin separate
| |!
machine in sameLAN asclient. Cache hit occurs if data is accessed from the cache.
I 4 Cache holds the data for short period. This is either to prevent extremely stale data from being used or to make room 4
| for other requested data from data store. Caches can be shared between manyclients to improve cache hits. The
I; assumption is that, requested data by one client may be useful to another nearby client. This assumption may be
1 correct for sometypes of data stores.
4 Acache eitheris usually placed on the same machine as its client or on a machine shared byclients on the same LAN.
In some other cases, system administrators may place a shared cache between a number of departments,
organizations, or for whole country. Other approachis to place servers as caches at some specific locations in wide
area network. The client locates nearest server and requests it to keep copies of the data the client was previously
| accessing from somewhereelse. )
!| 5.4.3 | Content Distribution

Managementofthe replicas also involves propagation of updated contents to the appropriate servers.
State versus Operations
| Following are the different possibilities for propagation of updates.
4 Onlya notification of an update is propagated.
fF ! Data is transferred from one copyto othercopy.
4 Update operationis propagated to other copies.
Propagationof notification of update
4 In aninvalidation protocol, if updates occur in one copy then other copies are informed about these update. This
y informs to other copies that the data they hold are no longervalid. It may specify that, which part of the data store has
is been updated, so that only changed part of a copyis actually invalidated.
[aoLeninmatinligeI i Mali Be one Tk
4 In this case, only notification is propagated. Before carrying out the update operation on an invalidated copy, it needs
to be updatedfirst, depending on the particular consistency model that is to be supported. The network bandwidth
needed topropagate notificationsis less. Invalidation protocol. works best in situation where many update operations
are Carried out compared to read operations (read-to-write ratio is small). x
Ey TechKnewicte® ce
Publications ~ia

Distributed Computing (MU -Sem 8-Comp.)
4
o-1 Consistency, Replication and Fault Tolerance
S
As an example, consider the zeae:
=
copies.=
Suppose, freq
Or © where upd
modified data to all other
8 uent updateS 4 Pdates are propagated by transferring
ta update
| jince ce ¢
there .is no read operation b Ompared to read operation. Consider two consecutive
will simply
overwrite thefirst update in all copies. | " as been performed between two updates, second update
- In this case sending notification only will be more beneficial.
propagation of updates by transfer
ring the d a ta
- This approach is mo re suited when nr read .

to-write
.
ratio is relatively high. In this case, chances are more of modified
data will be read is b efore next u oo.
pdate takes pl Hence, in such situation transferring the modified data to
all other
copies is necessary. place.
overhead canalso be sav ed by sending

j sin .
gle message for multiple
ple modifications. In this case, these modifications are
modificati
packed in to single message.
Propagation of update operations

.
- This appro
ppro ach iinforms to each copy about which update operationit should carry out. It sends only parameter values
ach
i
whiich will be required to carry out update operation at target copies.If size of parameters to be sentis relatively small
then propagating update operations requires less bandwidth.
if operations to be carried out
~ {tis also called as active replication. At each replica more processing poweris required
are more complex.
Pull versus Push Protocols

replicas
updates are propagated by servers themselves to the
- Push Protocols are server-based protocols in which
s
approaches usually used between permanentreplica
although they do not request for these updates. Push-based
caches. Push Protocols are generally used
but can also be used to send updates to client
server-initiate d replicas,
ncy isn ecessary.
where maintainin g high degree of consiste
between manycl ients.
manent replica and server-oriented replicas are often shared
~ The largely shared replicas, per tion on these replicas. In this case, read-to-update ratio at each replica is
out read opera
Clients frequently
carry client asks for recently
ration by cli ent s exp ect ed to get updated copies.If
e, each read ope
comparatively high. Henc
ately makesit available.
-based protocol immedi
updated copy then push . : .
s Ww ks to get upd ate sas per request|
per request issued. In this case, eit her server or client
pu ll -b ased protocol orks Bet UP
~ On the other hand, serv
n
er to obtain na any updates done recently there. These
protocols are called as client based
other
sends request to an
| in case of web cac hes whenever requests arrives for the data,. they of
ten check with original
protocol. For example, nce they were cached
cached data Is up to date or not si
m whether
web server to confir nsferred to cache and then returned to the
web serv er then it gets tra
original
modifie d at
~ If cached data are ocol is effici ent when update operations are relati
vely high compared to read
l-b ase d pr ot
A pul
q sting & client.
reque gh). This | s
when there are non shared client caches. In this case there is single
update ra ti on hi
operations (read-to- dvantage of a pull-
many th is pr ot oc ol Is ef ficient. The main disa
are share d among clients th en also
client. If caches time in case of cache miss.
ush based approach Is IncreaseIn response
compared toap
based approach
TechKnowledge
BT Publications

_ Pe
Distributed Computing (MU-Sem 8-Comp.) 5-16 Consistency, Replication and Fault Tolerancg _
4 Considerthe situation with single (non-distributed) server

and manycl ients :
© In this case, push base protocol has to keep track of manyclient caches. This puts load
on server. Server should be
stateful server althoughit is less fault tolerant. In pull-based approach, server does
not require to keep track of
many client caches.
|
9 In push-based approach, server sends the messagesto clients. These are updateor fetch update later messages,
In pull-based approach, client sends the messagesto server. Client has to poll the server to
checkif any updates ©
occurred and if then fetch the updates.
Oo Whena server sends modified data to the client caches, the responsetime at the client side is zero. When
server
' pushesinvalidations, the response time is the sameas in the pull-based approach, and is decided by the time
involved in accessing the modified data from the server.
4 As a balance between push and pull-based approaches, a hybrid form of update Propagatio
n based onleasesjs
introduced. A lease is a time interval in which server pushes updatesto clients.
In other word,it is a promise by the
serverthatit will push updates to the client for a specific time. After expiry of the lease, the client is forced to poll
the
server for updates andpull in the modified data if required. In other approach,a
client requests a new lease to push
updates whenthe earlier lease expires.
Unicasting versus Multicasting
4 The use of unicasting or multicasting depends on whether updates needs to be pushedor pulled. If
serveris a part of
data store and push updates to other n numberof servers then it sends n separate messages. In
this case, with
multicasting underlying network takes care of sending n messages.
4 Supposeall replicas are placed in LAN then hardware broad casting is available. In
that case broadcasting or
multicasting is cheaper. In this case, unicasting the update is expensive and less efficient. Multicastin
g can be
combined with push based approach wheresingle server sends message to multicast group
of other n servers.
5.5 Fault Tolerance
In distributed system, partial failure may occur. In partial failure, failure of one component of system may
affect
operation of other componentbut yet some other components may remain unaffected. The distributed system should be
designed to recover from suchpartial failure automatically without affecting the performance.
9.5.1 Basic Concepts
Distributed system should tolerate faults. Fault tolerant system is strongly related to dependable system.
Dependability covers following requirements fordistributed system along with other important requirements.
oO Availability : It refers to the probability that system is performingits operation correctly. If system is working at
any given instantof time then it is highly available system.
Oo Reliability : It refers to the working of system continuously without failure or interruptionduring a relatively long
period of time. Although system maybe continuously available, but we cannotsaythatitis reliable.
o Safety : Safety refers to the situation that when a system temporarily fails to function correctly, nothing
catastrophic takes place. | !
©. Maintainability : It refers to how failed system can be repaired without any trouble. System should be easily.
e
penton iting
,
maintainable which will then show high degree ofavailability.

ee 8 i
-
Sores
Bs ee Pg tye
wearer or
panacea aeeat
ey Pusiicatien®
lechhawioe
=
i Sette
a
fee Gita
y

co i
-Sem 8-Comp.) 5-17 Fault Tolerance
x= nsistency, Replication and
pependable system should also offer hi all the
¢ a a e of secu rity . The syst em is said to fail ifit is not offering
services for whichit was designed. An e that may cause a failure. The cause of an
error
uld prov; sor 'S a part of a system's stat
is called a fault. System sho Provide its services although faults exist in the system
There are three types of faults. These are transient, intermittent,
. t faults occur once and then
-
eat ed and permanent. Transien
vanish. If the operation is rep , the fault goes away. An intermittent fault occurs, the
n disappears of its own
accord, then
| again appears , and So on. A permanentfault
e continues tolive until the faulty part is replaced.
5.5.2 Failure Models

of execution. Failure
~ In distributed system, communicati processes, machi nes can fail during the course
on channel, may Occurs and to provide the
model suggest the wa different components of the system
ys in which failures in
understanding of this failure.
is not
their clients. This system
- Suppose in distributed system, servers communicate with each othe r and with rying out
or possibly both, are not car
that servers, communication channels,
satisfactorily offering services indicates
the work what they are supposedto do.
. If server fails to give
h failure occurs when ser ver halt but it was working correctly until it halts
- Inthis example, cras omission failure.If serverfails
ming requ ests from other serv ers or clients t hen it is considered to be
reply to the inco sage then it is receive
omis sion failu re and if fails to rece ive the incoming mes
send
to send the message the n it is interva | then it is timing failure.
if server gives
specified time
response is not w ithin the ng. If server
omission failure. If server value of the respons e is wro
is response failu r e. Value failure occurs when
incorrect response then it produces arbitrary
sition failure. Sup pose server
state tran
flow of control then it is
deviates from the correct ary failure.
ses at arb itr ary tim es then it is referred as arbitr
respon
Omission Failures or communication channel. Beca

use of this
fa ul ts th at occur due to failure of process
refers to the e action or task that it is su
pposed to do.
The omission failure fails to Ca rry out th
unication ¢ han nel
failure, process or comm
this
ilures prprogogram step. ep. In In this
Process Omission Fa hes and never ex ec u te s its further ac ti on or
i of pr o cess iis when iIt cras process repeatedly, the
Fcc ]
The main omission fa
il ure rocess does not responds to requesting
~ any P
When
tely SstoPs-
case, process comple ;
detects © tem, timeout can poi
nt towards
requesting P rocess on use of timeouts. In asynchronous sys
: n ner relies
or messages have yet not
ly
have crashed, may be executing slow
~ above m°
The detection of crash Inres ponding-
cess is not
only that pro
process crashis called as fail-stop. This fail-stop behaviour
elivered to the system. rects the crash of proces> then this
deli
O ut to know when otherprocess fail to respond and
hen processes uses time
- Ifother p rocess surely de us system W
can be formed in synch
delivery of messaBe> are
| Failures : buffer. Communica ation channel transport
Communic atio
unicat n Omis sion «ret d puts message in its outgoing |
d primitive an . o age sage
executes receive primitive to take this mes
x
pe xecute s sen 3s Q'S incoming buffer. Process Q then
> Sending process
vin vers the message.
this message tO os Ae
.
er. -
from its incoming bur# . .
ver TechKnewledge
Pubticatiors
'
ee |
ae

w Distributed Computing (MU-Sem B-Comp.) | 5-18 Consistency, Replication and Fault Tolerance
4 If communication channel is not able to transport message from

process P's outgoing buffer to process Q's its incoming
buffer then it produces omission failure. The message may be dropped due to non-availability of buffer space at
receiving side, no buffer space at intermediate machine or network error detecte
d by checksum calculation with data
in message,
4 Send-omissionfailures refer to loss of messages between sending process and its outgoing
buffer. Receive- Omission
failures refer to loss of messages between incoming buffer and the receiving process.
Arbitrary Failures
4 In this type of failure process or communication channel behaves arbitrarily. In this failure, the responding process
may return wrong values or it may set wrong value in data item. Process mayarbitrarily omit intentional
processing
step orcarry out unintentional processing step.
4 Arbitrary failure also occurs with respect to communication channels. The examples of these failures
are: message
content may change, repeated delivery of the same messages or non-existent messages may be delivered.
These
failures occur rarely and can be recognized by communication software.
Timing Failures
4 In synchronous distributed system,limits are set on process execution time, messagedelivery time and clockdrift
rate.
Hence, timing failures are applicable to this system.
4 In timingfailure, clock failure affects process as its local clock maydrift from perfect time or may exceeds
bound onits
rate.
4 Performancefailure affects processif it exceeds the defined boundson the interval between two steps.
4 Performancefailure also affects communication channels if transmission of message take longer time than defined
bound.
Masking Failures
4 Distributed system is collection of many components and components are constructed from collection of other
components. Reliable services can be constructed from the components which exhibit failures.
4 For example, suppose data is replicated on several servers. In this case, if one serverfails then other servers
would
provide the service. Service maskfailure either by hiding it or by transformingit in more acceptable type offailure.
5.5.3 Failure Masking by Redundancy
The use of redundancy is main technique for masking the faults. Redundancy is categorized as information
redundancy, time redundancy, and physical redundancy. With information redundancy extra bits are added for
recovery. For example, hamming code addedat senderside in transmitted data for recovery from noise. ©
4 Time redundancy is particularly useful in case of transient or intermittent faults. In this, action is performed again and
again if needed with no harm. For example, aborted transaction can be redone with no harm.
4 -
With physical redundancy, to tolerate the loss or malfunctioning .
of some components either extra hardware © r
software components are addedin system. . a
Publicatie® te tne
=.

vette
a
NT be
g (MU-Sem 8-Comp) leranc e
pistributed Computin ion on and Fault To
8catati
te aon
a
Consistency, Replic
=
-.*<
3 3 a9
ence
Process Resiili
ag a ee
_
5.6
-
7
?
In distributed system, , failure of proce sses may happen. In this case, fault tolerance 's achieved
| by replicating
eto proups.
ies
==
56.1 Design Issues
Ei
s.
all the group member
gets delivered to
n group, then messagesent to this group
this proc ess9s job.
if identi cal proce sses are organ izedi
Every process in group receives this message.If one of the membersfails th e other can take over
oups- During
gro age d dyna : lly. It allows creating new groups and destr oying old gr
mica
These ese p proc ess grou ps may be man
ltiple groups
be a mem ber of mu
or leave it. A process can
process can join a group
the system operation, a nt and group membership.
msare required for group manageme
simultaneously. As a result, mechanis age to
cess can send a mess
sses to deal with sets of proces ses as a single abstraction. Hence a pro
Groups permit proce from
r locations, which may va ry
8
ers or number o f servers an d thei
Sm
with out knowin g to whic h serv
a group of servers
one call to the next. . The
decisions are made collectively
In flat grou p, all the processesare at equal level and all
~ Twotypes of groupsexist. ma king is complex as all
are
ion is t hat the re is no sing le point offailure. But decision
advantage of this organizat
ad.
ing incu r delay and overhe
involved in this process. Vot group, a request
s aS COO rdinator andall
the others are workers. In this
e pro ces s act
Insimple hierarchical gro
up, on ropriate
-
over to coordinator an d
then it decides which workeris app
nt is firs t get s han ded
from worker or e xternal clie or crashes the
lure is the problem. If coo rdinat
it the re. In this group sing le point of fai
forwards
to carry it out, and
whole group will affect.
| |
Group Membership join or leave the group.
also allows the process to
and destroy the groups. It
responsible to cre
ate of
Agroup server is and their exact membership. In this, single point
- da ta ba se of all the gr oups
intains a comp
le te
The group se ver ma gementwill affect.
ou p se rv er crashes then group mana
m. If gr
failure is agal n the proble hip in distributed manner. if reliable multicasting
is existing, an external
members just
Another approach is t
o manage group. To leave a group, a member
- mbers declaring its wish to join the
a mess@ ge to all gro
uP me
process can send ry other member
.
odbye me ss ag e to eve up. After leaving gro
up, process
sends a go e all the mes sages sent to that gro
receiv m it
ins the er
oup,it should
ers should not receive messages fro
~ As soon as process jo any other group memb
e from that group and i and group can no longer
should not re ceive any messaB e situation where many
i supposefails
machine
| to deal with th
protoco
There should be some
|
function atall.
n
licatlo
5.6.2 Failure Masking and Rep formed by
us to mas k one or more faulty proc essesin that group. A group can be ed by this
rmit s le es s ca n be replac
| rocesses 4Peorga in a group. Hence, single vulnerab proc
~ Group of identi cal p nizing then :
an be carried out ither with primar
y-based protocols, or through replicated-
icating the p rocesses
e
repl can
Rep lication
e fault.
group to tolerate th
write protocols. TechKnowledge
Publtcatians

Distributed Computing (MU-Sem 8-Comp.) Consistency, Replication and Fault Tolerance
| 4 Primary-basedreplication forfault tolerance usually emerges in the form of a primary-backup protocol.

In this case, a
process groupis organized in a hierarchical manner in which a primary coordinates all write operations. Primary is
fixed but its role can be taken over by backups wheneverneeded.In a situation of crash of primary, the backups run
some election algorithm to elect a new primary.
4 .Replicated-write protocols are used in the form of active replication, in addition to using quorum-based protocols.
These solutions considerthe flat group. The main advantageis that such groups have no single pointoffailure, at the
cost of distributed coordination. The advantage of this organization is that there is no single point of failure. But
distributed coordination incurs delay and overhead. .
4 Consider the replicated write systems. Amountof replication required is an important issue to consider. A k fault
tolerant system survivesfaults in k components and still meets its specifications. If processes fail slowly, then having k
+ 1 of themis sufficient to give k fault tolerance.If k of them just stops, then the reply from the other one can beused.
In case of Byzantine failures, processes continuing to run whensick and sending out erroneous or random replies, a
minimum of 2k +1 processors are required to attain k fault tolerance.
9.6.3 Agreementin Faulty Systems
4 Fault tolerance can be achieved byreplicating processes in groups. In many cases,it is required that, process groups
reaches some agreement. For example; electing a coordinator, deciding whether or not to commit a transaction,
dividing up tasks among workers etc. Reaching such agreement is uncomplicated, provided communication and
processesareall perfect. Otherwise problem mayarise.
4 The generalaim ofdistributed agreementalgorithmsis to haveall the correctly working processes reach an agreement
on someissue, and to build that consensus within a finite numberof steps. Following cases should be considered.
Q Synchronousversus asynchronous systems: A system is synchronousif and only if the processes are known to4
operate in a lock-step mode.
o Communication delay is bounded or not: Bounded delay means every messageis delivered with a globally and
predetermined maximum time.
Messagedelivery is ordered or not: Messages from same senderis delivered in the same orderin which they were
sent.
o Message transmission through Unicasting or Multicasting.
Byzantine Agreement Problem
4 In this case, we assume that processes are synchronous, messages are unicast while preserving ordering, and-
communication delay is bounded. Consider n number of processes. In this example, let n = 4 and k = 1 where kis
numberoffaulty processes. The goalis that, each process should build the vector of length n.
Step 1 : Each non-faulty process i send v; to other process. Reliable multicasting is used to send the v,. Consider
Fig. 5.6.1. In this case, process 1 send 1, process 2 send 2 and process 3 lie to everyone, sending x, y, and z
respectively. Process 4 send 4. Process3 is faulty process and process 1,2 and 4 are non faulty processes.

a
|
3 _ pistributed Computing (MU-Sem _
Tolerance Mt
Consistency, Replication and Fault
|
il
nt
il
K,
f
Faulty process HI
Fig. 5.6.1 Hdl
torsis shown below. iy
form ofvec
- Stepp2: Theres ults collected by each process from otherprocessesin the | i
-o Process1:(1, 2, x, 4) iy
14)
Oo ; ess2: (1, 2,
Proc y, 4) in|
o Process3: (1, 2, 3, 4)
Oo Process 4:(1, 2, z, 4) (
other .
that each process receives from
ts vector to ever other process. The vectors
Step 3 : Now, every proces s pass esi
valuesa to |. th ese are new 12 i
process, it lies and sends new
ow. Since process i s faulty
each process are shown bel
iy
|;
9 values.
Process 4
Process 1 Process 2 hi
| !
(1, 2, xX, 4) ||
(1, 2, Y, 4) |
(1, 2, X, 4)
(a, b, c, d) (e, f, B, h) (1, 2, y, 4) iiy

(4224) (bed 11)
2,24) a majority, . in
vectors. If any value has
ch of the newly received
ement of ea } it
es s in sp ec ts t he i= el rrespon ding element of
the result vector is marked
each proc , th e co ik
- Step 4: Now, has a ma jo ri ty
If no va lue
he values for V1, v,, and v,,
whichis the correct result.
Result ve ctor.
that value is put in me nt ont
ree
4 all come to ag
ses 1, 2, and
UNKNOWN.Proces e agreement is
OWN,4) vant. The goal of Byzantin
Result = (1,2, UNKN is als o irr ele
nnot be decided, but
aboutV 3 ca
t : hat | _faulty processes only.
~- These proc esses conclude fort h e non
nthe value 5.6.2.
process. it is shownin Fig.
° ensusIS reached o
that cons esse s a nd one faulty
o non-faulty proc
en only
tw
ae, ifn = 3 and k=1 th
TechKnow
Ey Publ tedge
ications

a
ee ee ee weiner ees
Distributed Computin (MU-Sem 8-Comp.) 5-22 Consistency, Replication and Fault Tolerance
re
4 The results collected by each process from other processes in the

form of vectors are shown below.
© Process 1: (1,2, x)
O Process 2: (1, 2, y)
© Process 3: (1, 2, 3)
4 Following vectors are received by processes. There

is no majority for element 1, element 2 and 3
marked as UNKNOWNand hence
also.All will be
, algorithm is failed.
Process 1 Process 2
(1, 2, y) (1, 2, x)
(a, b, c) (d, e,f)
4 In system,if k faulty processes are there then agreement can be achieved if 2k+1 correctly functioning (non-faulty)
processes present for a total of 3k+1.
9.6.4 Failure Detection
4 For proper masking of failures, it is necessary and requirement to detect them. It is needed to
detect faulty memberby
non-faulty processes. The failure of member detection is main focus here. Two approaches are
available for detection
of process failure. Either process asks by sending messages to other processes to know
whether others are active or
not or inactively wait until messages arrive from different processes.
4 The latter approach useful only when it can be certain that there is sufficient communication between
processes.In
fact, actively pinging processes is generally followed. The suggested timeout approach to know whether
a process has
failed or not, suffers with two major problems. As a first problem, declaring process9s failure as it does
not return
response to ping messages is wrong as this may due to unreliable networks. In other argument,
there is no practical
implementation to correctly determine failure of process due to timeouts.
4 Gossiping between processes can also be used to detect faulty process. In gossiping, processtells other process about
its active state. In other solution, neighbors probe each otherfor their active state instead of relying on single node
to
take decision.It is also important to distinguish the failure due to process and underlying network.
4 This problem can be solved by taking feedback from other nodesor processes.In fact, when observing a timeout on a
ping message, a node requests other neighbors to know whetherthey can reach the presumed failing node.if a node
is still active, that information can be forwardedto otherinterested nodes.
ee
9./ Reliable Client-Server Communication oe

Snee ee
4 While considering the faulty processes, fault tolerance in distributed system should also consider the failures of the
communication channels. Communication failures are also equally important in fault tolerance. Communication
channels may also suffer through crash, omission, timing, and arbitrary failures.
4 Reliable communication channels are required for exchanging the messages. Practically, while designing the reliable
communication channels, main focus is given on masking crash and omission failures. Arbitrary failure may occur in the
form of duplicate messages due to retransmission byoriginal sender.
444

pistributed Computin
g (MU-Sem g-C
Co
omp.) .
5-2
Consistency, Replicatio n and Fault
Tolerance
io :
57.1 Point-to-Point Communi Cat n
TcP is reliable transport

| Protocol whi ch .
7 TCP
point-to-point communication in distributed system.
su Oo
js by using a cknowiea,, reliable
handles lost of messages ;
e geme
e ntand retransmissi ar on. Hence, this omissionfailure !s masked by
. such failures
TCP. Client cannotnotice
_ if TCP connection is sudd enlybroke a

no more messages can be transm
itted
ion channe l, tr,E n it Is crash failure. In this failure,
through the communicat
This fai i new connection by distributed
ilure is only masked through establishing
.
system.
5.7.2 RPC Semanticsin the Presenceof Failures
teatinn n is
In Remote procedurecalllls (RPC), communicatio eeshed betweenthe client and server. In RPC, a client side
i establi
-
is to hide
p rocess Calls a procedure implemented on a remote machine (server). The primary goal of RPC
a
dure calls.
i g remote procedure so that remote procedurecalls looks like just local proce
whi callin
communication while
not easy. In RPC system,
s between remote andlocal procedure calls is
lf any errors occur, then masking difference
following failure may occur.
The client cannot locate the server.
9
nt to the serveris lost.

The request message from the clie
98
receiving a request.
The server crashesafter
O80
lost.
the server to theclientis
The reply message from
oOo
er sending a request.
The client crashes aft
cate the Server

The Client Cannot Lo stub is old. Meanwhile server
In other case, version of client
n that all servers 4 re down.
Inthis case, it may ha
p pe present. Hence, binder will
~ server side new version of stub is
and compi led. At
ta lled
e rfaces are ins
evolves and new int l | be reported.
th se rv er and failure wi ures
be unable to bind
wi uage supports for writing proced
n ca n be ra ise d upon errors. Many lang
exceptio ackis that, many languages
is problem, an use d for this purpose. Again drawb
~ Asasolution to th l handler ca n also be
.
errors. Sif na arency by using exceptions or signal handlers
to ac hieve transp
SS
that are invok ed upon It is no t po ss ib le
r signals.
sup port 5 e x c eptions o
do not
t is carried
Lost Request Messages retransmission of reques .
starts tim er while sending request. The

os orclient stub es. Server will not be able to differenti
ate between
r
problem, | r expir
before time
=
- To deal with this re pl y do es not arrive arise.

Oo r this case, no problems
a g e was truly lost. In }
out if acknowledgement © m e s s
. Then
i . n
client will falsely conclude that server is down

ost. In this case,
.
and retransmissiO
original ver handle it.
7 7
ssage Was notlost then let the ser

If request me
2s ey ae
rve 9=.
.
t S¬nd s request to server
e when clien
a etot) Set atm
ient .
quest a n d reply is sentto cl
processed the re
servel-
1. Arequest
arrives at
: WF lechtaomioda
ubllca
tionrs
ues .

5-24 Consistency, Replication and Fault Tolerance
Distributed Computing (MU-Sem 8-Comp ) oe
t. Server crashes before it can send the request..
2. A request arrives at server. Server has processed the reques
essi ng the request.
3. Arequest arrives at server. Server is crashed before proc
In (3)
4
In case (2) and (3) different solutions are required. In (2), system has to report failure to the client.
retransmission of request is required. Client ret ransmits when its timer expires.
At least one semantics technique can be used. It
In (3), Client waittill server reboots. Client may rebind to new server.
RPC has been carried outat lea st one time, but possibly more.
means, try until reply arrives from server. It means,
RPC has
In at-most-once semantics technique,failure is im mediately gets reported to the client. It guarantees that the
been carried out at most one time, but possibly no ne at all. In other technique, client gets no help and no guarantee.
RPC has beencarried out from 0 to any number of times. It is easy to implement.
Lost Reply Messages
In this case client retransmits request message when reply doesnotarrive before timer expires.It concludes that
request is lost and then retransmits it. But, in this case client actually remains unaware about what is happened
actually. Idempotent operations can be repeated. These operations produce same result although repeated any
numberof times. Requesting first 512 bytes offile is idempotent.It will only overwrite at client side.
Some request messages are non-idempotent. Suppose, request to serveris to transfer amount from one account to
other account. In this case, each retransmission of the request will be processed by server. Transfer of amount from
one account to other account will be carried out many times. The solution to this problem is, try to configureall
requests in idempotentway.This is not possible as many requests are inherently non-idempotent.
In another solution, client assigns sequence number to each request. Server can then keep track on most recent
request and it can differentiate between original (first) and retransmitted requests. Now servercan reject to process
any request a second time. Client may put additional bit in first message header so that server will take it for
processing. This is required as retransmitted message requires morecare to handle.
Client Crashes
In this case, client sends request to server and before reply arrives it crashes. The computation of this request is now
going on at server. Such computation for which no one is waiting is treated as an orphan. This computation at server
side (orphan) wastes CPU cycles,locks files, hold resources.
Supposeclient reboots and sends same request. If reply from orphan arrives immediately then there will be confusion |
state. This orphan needssolution in RPC operation. Four possible solutions are suggested as below.
1. Orphan Extermination : Client maintains log of message on hard disk mentioning whatit is about to do. Aftera
reboot,log is verified by client and the orphan is explicitly killed off. This solution needs to write every RPC record
on disk. In addition, these orphans may also do RPC, hence, creating grandorphans or further descendents that
are difficult or practically not possible to locate.
2. . Reincarnation: In this solution, no needto write log on disk. Timeis divided in sequentially numbered epochs.
Client broadcasts a messagetoall machines telling the start of a new epoch after its reboot. Receivers of this
broadcast messagethenkills all remote computations going on the behalf of this client. This solution also suffers
through network delay and some orphan maystay alive. Fortunately, however, when they report back, their
_ Teplies will include an out of date epoch numberso thattheycan easily noticeit. ,

.
=:
- 8ag
Nea
<4. Se
pistributed Computing (MU-eM
2 8-Com Consistency, Replication and Fault Tole
rance
BA 225 .
fie= 3, Gentle Reincarnation - Wh
7 en an epoch broadcast arrives, each machine checks about any remote computations
: .
runninglocally,9 if runni ng, tries owners cannot be
its best to locate their owners. Computations are killed in case
located.
4, Expiration In this solution, RPCis < fini shes in this

Pposed to finish the job in specified time quantum T. If not .
"
time, another tim ¬ quantumis r 4
a nuisa nce. In contrast, if after a crash the client wal tsatimeT
ted which is
prior to rebooting, all orphans a ques in
re certa to be gone
5,8 Reliable Group Communication

TT
5.8.1 Basic Reliable-Multicasting Schemes

,
- R Reliable multi cast services promise that messages are delivered to all members in a process group. Most of the
en two processes. They seldom offer reliable
transport layers only offer point-to-point connection betwe
reliable point-to-point connection with
communication to a collection of processes. So each process has to set up
lt.If
ngreliability through point-to-point connectionis difficu
other process. If number of processesis large then achievi
ereliability through point-to-point connection.
number of processesis small thenit is simple to achiev
age sent to particular
ing message for that group. In other word, mess
- All the processes in group should receive incom
some situations that need to be
all m embers of that group. There are
process group should be delivered to the
her this newly joined process should
ed. Thes e are: if proc ess joins the group during communica tion, whet
consider
communication.
a sending process crashes during
receive the message or not and if
and reliable
te between reliable communication in the presence of faulty processes
- It is necessary to d 8fferentia communication in the presence of faulty
pro ces ses functi ons correctly. In case of reliable
communication when faulty processes in group receives message.
d to be reliable if it is guaranteedthat, all non-
!s sai
processes, multicasting message can be delivered,
arriv e eem ent on what the grou p actually looks like before a
y
It is necessar to at an agr
ng constraints.
besides various orderi o is a member of the grou
p and who is not.
y ag re em en texis ts on wh
e case when alread
- It is simple to handle th the group during
fail, and processes do not join or leave
pro cesses do not
.
Especially, if it is as su me d that,
each current group
s that ev ery message should be delivered to
hen reliable multicasting simply mean
communication t
8 group
member. In9 simple case,
me ssages by all group members in same order.
of re ceiving the
is requirem
ent
an dit is easy to implementi
f numberofreceivers are small.
Sometimes, it ssag es i any ord er
!1
n receive me from
members ca system offers only unreliable multicasting. In this case, some messages
8 ication
Consider underlying commun In this case, a sequence numberis
~ livered to all the receivers (group members).
. .
be lost or may not be de ages are rece ived in the order they are sent. Sender
sender may it multicasts- Supp ose mess
for a
messages till acknowledgment (ACK) for that messageis received.It is easy
Oo each messaB&
assigned by sender t
dy sen
ore alrea For missed message by receiver, sender receives negative
ACK (NACK)
maintains buffer to st ga me ssage-
it is missin
tion, sender can retransmits message whenit has notreceivedall
receiver to detect e. In other solu
mits the messaB sages.
Sender then retrans g gy backin g can be used to minimize number of mes
e. iP
thin a certain tim
acknowledgments wi
Publications
a

Consistency, Replicatio n and Fault Tolerance
Distributed Computing (MU-Sem 8-Comp.) 5-26 4_4_
44__
9.8.2 Scalability in Reliable Multicasting
s. Senderhas to receive n ACKs from n

Abovereliable multi h ot support larg e numbers of receiver
Mao SERENE S200 EF be swampe d with such feedback messages
receivers. If large numbers of receivers are there t hen the sender may
network.
called as feedback implosion. Also, the receivers are s pread across a wide-area
| ieved by
444444E
' ling |can be ach

send er. Hence, better sca
Instead of sending ACKs by receivers, only NACKs can be sent to
n. In this case also
reducing numberof feedback messages. !n this case, there will be less chances of feedback implosio
whichis not delivered. Sender has to remove
sender has to keep messagesin history buffer to retransmit the message
messages from history buffer after some time has elapsed in order to prevent It from overflowing. This may cause
problem if NACK arrives and same messageis already removed from the buffer.
Nonhierarchical Feedback Control
It is necessary to reduce numberof feedback messages while offering the scalable solution to reliable multicasting. A
more accepted model used in case of several wide-area applications is feedback suppression. This approach underlies
the Scalable Reliable Multicasting (SRM) protocol. In SRM, receiver only sends NACK for missed message. For
successfully delivered messages, receiver does not report acknowledgements (ACKs) to sender. Only NACKs are
returned as feedback. Whenevera receiver notices that it missed a message, it multicasts its feedback to the rest of
the group.
In this way, all the members of the group know that message m is missed by this receiver. If k number of receivers has
already missed this message m, then each of k members has to send NACKto sender so that m can be retransmitted.
On the other hand, suppose retransmissions are always multicast to the entire group, only a single request for
retransmission can be sent to sender.
For this reason, a receiver R within group who has missed the message m sends the request for retransmission after
some random timehas elapsed. If, in the meantime, another request for retransmission for m reaches R, R will hold
back its own feedback (NACK), knowing that m will be retransmitted soon. In this manner, only a single feedback
message (NACK) will reach sender S, which in turn next retransmits m.
_ Feedback suppressionhas been usedas the fundamental approach for a numberofcollaborative Internet applications.
Although, this mechanism has shownto scale reasonably well, it also introduces a numberof serious problems.It
requires accurate scheduling of feedback messagesat each receiver so that single request for retransmission will be
returned to the sender. It may happen that, manyreceivers will still return their feedback simultaneously. Setting
timers for that reason in a group of processesthatis spread across a wide-area networkis difficult.
Other problem in this mechanismis that, a retransmitted messagealso gets delivered to the group members who have
already successfully received this message. Unnecessarily, these processes have to again process this message. A
solution to this problem is to let these processes which have missed the message m, join another multicast groupto
receive message m. This solution requires efficient group managementwhich is practically not possible in wide area
network, A better approachis therefore to let receivers that tend to miss the same messages team up andsharethe -
same multicast channel for feedback messages and retransmissions.
aE
Tech
Publicatiars

WyZ Ze distributed Computing (My -Sem 8-Comp.)
e
Consistency, Replication and Fault Toleranc
Hierarchical Feedback Control
For large groups Ps of rece

recei
ivers, hiera
i rchical feedb
dback
ack contr
con ol is bette r approach.
h
Root
S - Sender
C - Coordinator
(b)
(a)
Fig. 5.8.1
of receivers. The group of receivers is ©

- Consider only a single sender multicast messages to a very large group
then organized into a tree. The root of the tree is formed by
partitioned into a number of subgroups. These subgroups
riate for small groups can be used
the subgroup in which sende r is present. Any reliable multicasting scheme approp
oup which has
5.8.1 (a). Each node in the tree represents subgr
within any subgroup. The tree is shown in Fig.
(b).
organization as shown in Fig. 5.8.1
eceivers) of the
handles retransmission requests of the members(r
- Each subgroup appoints local coordinator C which
. If the coordinator itself has missed a
. The lo ca l coordi nat or also main tains its own history buffer
same subgroup ransmit message m. In a scheme base
up to ret
d on ACKs,
coordinat or of the parent subgro
message m,it requests the message. If a coordinator has received ACKs for
an ACK t o its parent if it has received the
a local coordinator sends as wel l as from its children, it can rem
ove m from its history buffer.
s in its su b group,
er
message m from all memb var to build tree dynamically.
diff iculty in this approach. In many cases, it is necessary
e tr is ma in
The construction of th
ee approachis then
~ tree in th e
underlying network,if there ts one. In principle, the
lt icast
use th e mu
One technique is to
in the manner just
r so that it can proceedas local coordinator
er i n th e networklaye
to improve each multicast rout to carry out. As a result,
a tio ns f° existing computer networks are difficult
adapta
described. Practically, such ular.
ons are pop
g soluti
tioon-
icaati
applic eveel mu tricastin oct
n-llev able to a large numberofreceivers spread across a wide-area
h em es th at are scal
sc oblems.
2 : Mm ultica es
on is available, and each solution introduc new pr
~ Finally, designing reliable gle best soluti
sin
problem. No
network is a complex
ic Multicast
4 : «66.8.3 Atomic in the presence of process failures. In particular, distributed system should
ing ses ions.
[ticast : e r all proces or to none at all. It is often needed in manysituat
n - =
8
weconsi 8d
derer reli able MU
-« delivered to eith
Now ed in the same orderto all processes called as
are deliver
vent that that all mess98
es
guarantee that 4 .
Also,it is also necessary Fé . |
~~. the atomic multicast P roblem
Publicatiears

lerance ol
on an d Fault To
-~
8 ee . ~ 4 - 7 7 -_
Consistency, Replicati eet!
e distributed
- .
to p of
.
a dis tri buted system. Suppos ssages can
obcated etal as an application on to which me
= Consicer i re
sed databa constructed
se
s construct ion of process groups
~ facilities. Hence, it per mit re plica of the
system oners reliable multicasting Th ere is on e process for each
c ess group.
ated database also forms pro
be refiably sent. The replic ns
processes belong to one gro
up. te operatio
database. All these t loc all y. As eries of upda
ried ou s lost for
ic a st to all repl ica and operations are car n t he n this updatei
ions are mu lt
one U pdate oper
atio
The upd ate operat the replica crashes during
ba se . If on e o f
out on data r replicas.
is to be c arried carried out at the othe has missed a
Wh er ea s, sa me U pdat e is cor rec tly
en it re co vers. Yet, it
this re plica. the crash wh
t up to d ate wi
th the
e sa me st ate it had before is br ou gh
r to th it
plica can recove it is necessary
that
ions 2 re to be
The crashed re e ca r ri ed ou t aft er its crash. Now, ic h or de r t hese operat
dates which ar missed operation
s, and in wh
numb er of up i ng precisely the
ires know
other re pl
icas. This requ nt to all
da te op er at ion that was se
the up update is
carried out e d system, then none at all. The
or te d by underlying distribut y re pl ic as or by
lticasting is
su pp
er car ried out ata
ll nonfault of the group. Th
e
if atomic mu th em crashed is eith ic a no lon ger is member
st before on e of
the crashe d re pl
replic as ju nfaulty replicas
have agreed that ash.
e al l no recover s from cr
carried out if th n the gr ou p once More after it again. In order
pl ic a is no w forced to joi
is re gi st er ed as being a member |
crashed re onlyif it omic
recovered replica s. As a result, at
tion s wi ll be forwarded tot his th th e res t of the group member
r a wi en
forces settlement wh
e to date
U p d a t e op ica9s sta te needs up
the gr ou p, recovered repl ve aC on si st en t vi ew of the database, and
to join eser
nfaulty processes pr
assures that no
multicasting oup.
and rejoins the gr
a replica recovers
chrony are sent and receive

d. On
Virtual Syn er within which messages
com mun icatio n lay
distributed system has the application which runs
in
We assume that each ica tio n layer and then delivered to
ffer ed in com mun
are first bu der had when message m
was
each node, messages cess es in the group which sen
ew is vie w on the set of pro
higher laye r. Group vi m sent by sender. All sh ould
agree
up sho uld hav e sam e view a bout delivery of message
multicast. Every proces s in gro
up view a nd to no other member.
vered to each member in gro
_ that, message should be deli
process joins the
.G and while multicast is going on, new
mess a ge m is multiicast by sender with group view
i i
4 Supp ose
group view changes due to change in grou
p membership. This change in
group or leaves the group. Because of this
g group ) is multicast to all the members of the
group. Now we have
group membership message C (joining or leavin to all
nee ded about either m is delivered
Iticast messages
two multi i transiit: m and c. In this case, guarantee is
in
all.
m is not delivered atall
group view G before each one of them delivered messagec, or
processes in group
ivery of m is permitted to fail. In-
Aeee
.
4 When gro up member
er alll
shi
the e e ee
ig:
ee s of crashing of the senderof m, then del
Ji case, eith e
none.This guaranteesthat, a snesog
tiscr to ero uni sie w Se delieered tor e S eu know abort of new memberor
suli
n e
ty proc ess in G. If sender cra shes during multicast the
ain in < nom aur ast with this property .a to
may be delivered to all rem e
g processes or ignored by each of them.Reliabl multic
be virtually synchronous.
4 Consider processes P,, P,, Pz, and P in Afte have been |
G. Gro up vie w G = (P,, P2, P3, P,). siti casting mmes sag es
_ grou p view eeedan -
multicast, P, crashes.
ne G=( P,, P,, Pad. Before crash, P, succ 3
o
but notto P,. Virtual oh deli vere d at all. This indi eens
will not be icates that, mess age
sen t before cras h nxt that, its messages eeds bet wee n rema ini r remo vi ng P; |
had never been i ° *s Com mun ica tio n then
ided its
proc
stat e has bee n brought a pe es afte
ey
vers, , it can join the group prov
__ from group. After P, reco p to date. |
ee
Tech
Publications

5-28 Consistency, Replication and Fault Tolerance
¥ Distributed Computing (MU-Sem 8-Comp.)
f a distributed system. Suppose distributed
Consider a replicated database constructed as an application on top ° 5 to which messagescan
system offers reliable multicasting facilities. Hence, it permits construction ol process p of th e
process group. There is i one process fo r each replica
be reliably sent. The replicated database also forms =)
.
database.All these processes belong to one group.
outlocally. A series of update operations
i
The update operations are multicast to all replica
p and operati ons are carrieddate operation thenthis updateis : lost for
during one up
is to be carried out on database.If one of the replica crashes
licas. - |
this replica. Whereas, same updateis correctly carried out at the otherrep
before the crash when it recovers. Yet, it has missed a
The crashed replica can recover to the samestate it had
is necessary thatit is brought up to date with the
number of updates which are carried out after its crash. Now, it
and in which order these operations are to be
other replicas. This requires knowing precisely the missed operations,
carried out |
then the update operation that wassent to all
If atomic multicasting is supported by underlying distributed system,
or by none at all. The updateis
replicas just before one of them crashedis either carried out at all nonfaulty replicas
memberof the group. The
carried outif the all nonfaulty replicas have agreed that the crashed replica no longer is
crashed replica is now forced to join the group once Moreafterit recovers from crash.
ain. In order
Updateoperationswill beforwarded to this recovered replica onlyif it is registered as being a memberag
atomic
to join the group, recovered replica9s state needs up to date with the rest of the group members. As a result,
multicasting assures that nonfaulty processes preserve a consistent view of the database, and forces settlement when
a replica recovers and rejoins the group.
Virtual Synchrony
Weassumethateach distributed system has communication layer within which messages are sent and received. On
each node, messagesare first buffered in communication layer and then delivered to the application which runsin
higher layer. Group view is view on the set of processes in the group which sender had when message m was
multicast. Every process in group should have same view about delivery of message m sent by sender. All should agree
- that, message should be delivered to each memberin group view and to no other member.
Suppose message m is multicast by sender with group view G and while multicast is going on, new processjoins the
group or leaves the group. Because of this group view changes due to change in group membership. This change in
group membership message c (joining or leaving group) is multicast to all the members of the group. Now wehave
two multicast messages in transit: m and c. In this case, guarantee is needed abouteither m is delivered to all
processes in group view G beforeeach one of them delivered messagec, or m is not deliveredat all.
When group membership changeis the result of crashing of the sender of m, then delivery of m is permitted tofail. In
this case, either all the members of the G should know abort of new memberor none. This guaranteesthat, a message
multicast to group view is delivered to each nonfaulty process in G. If sender crashes during multicast then message
maybedelivered to all remaining processes or ignored by each of them. Reliable multicast with this property is said to
be virtually synchronous. |
Consider processes P,, P,, P;, and P, in group view G. Group view G = (P,, P,, P3, P,). After some messages have been
multicast, P, crashes. So group view G = (P,, P,, P,). Before crash, P, succeeded in multicasting messages to P, and P,
but notto P,. Virtual synchrony guaranteesthat, its messages will not be delivered at all. This indicates that, message
had neverbeensent before crash of P,. Communication then proceeds between remaining members after removingP3
from group.After P; recovers, it can join the group providedits state has been broughtup to date.

Fault Toleranc®
.
ee
es
mp.)
4
ing (MU-Sem 8-Co

_4
pistributed Comput
4
: yessage Ordering
h following four orderings
The multicasts are classified wit
asts
1. Unordered multic
2. FIFO-ordered multicasts
3. Causally-ordered multicasts
are not
d multicasts 9 rante es
4. Totally-ordere In thi s multicast, 8U
chronous mul ticast . ple in
le is a virtually syn er the exam
un orde re d mu lticast which is reliab
ffer en t pr oc es se$. Consid
An by di the
messages are delivered tion blocks
red re ga rd in g, the order in which received mit ive . Th e rec cive ope ra
assu receive pri
rary with send and 3
le mu lt ic as ui ng is offered by a lib
which reliab
to ft. at each
il a message 6 delivered of events
calling proc ess unt P,. Following are the ord eri ng
processes P,, Py. and
gro up wit h thr ee communicating
Consider
process. Process P,
Process P, Process P,
4444
receives M,
recenves M#,
sends Mm,
receives M, receives M,
sends m, ring multicast. In
me grou p vi e w does not change du
ers. Assu yer at P,
tog roup memb mmunication la
M e s s res (71, and m, d th en m, - In contrast, co
icasts first m, an
- Process P, mult yer at P, recenves delivered in the orde
r they
ppose communication la aint, Mes sages May be
this situation su #,- As there 6 no or
dering constr
and then
ives first M,
suppose rece vered by the
Me ss ap es fr om same process are deli
were received. incoming
ltica sts, the n following four
ab le ri ro -ordered mu Co ns id e r communication betwee
case of re li nt .
ave been se me Way, My will
- Inthe second
sa me order as they h li ve re d befo re m,. In the sa
n layer in th e always de
communicatio Messabe m, will be grou p. If communic
ation layer at particular
g. 35 shown below, es se s in th e
O rderin all proc
processes. In FIFO is rule
is followed by
ived and d eliv
ered m,. On the other hand,
fore IT) s- Th r m, till it h as rece
vered be t delive , ed.
be always deli should no 4_ have receiv
d th en Mx " layer in the order they
st an n
process receives M2 fir
icatio
livered by commun
processes are de
ed from different
messages receiv
receives M, sends m,
cends M receives #;
sendsm, 4
receives M, receives M3
sends #,
receives Ms receives M,
receives M, receives M,
ation layer b
ial ly cau sal ly re la te d mess ages are delivered by communic
7otent receiver side,
usally precedes message m, then at
es. If message M1 ca
In reliable causally or
dered mull from sameprocess or from different
~ ese m essages can be
betwee m nt and then M- Th
considering causality
iver #
communication layer Wil del
Processes-
WFlechKacutetgs

444 5-30 44 Consistency, Replication and Fault Tolerance
4 SSS
Total-order multicast imposes additional constraint on order of delivery of messages. It says that, message delivery
may be unordered, FIFO or causally ordered but should be delivered to all the processes in group in same order,
Virtually synchronous reliable multicasting which offers totally ordered delivery messages called as atomic
multicasting.
9.9 Recovery a . =
5.9.1. Introduction
It is necessary to recover from thefailure of processes. Process should always recover to the correct state. In error
recovery,it is necessary to replace erroneous state to error-free state. In backward recovery, system is brought from
current erroneous state to previous state. For this purpose, a system state is recorded and restored after some
interval. Whencurrentstate of the system is recorded, a checkpointis said to be
made.
In forward recovery, an attempt is made to bring system from current erroneous state to new current state
from
whichit can carry onits execution. Forward recovery is possible if type current error occulted
is known.
In distributed system, backward recovery techniques are widely applied as mechanism to recover from
failures.It is
generally applicable to all systems and processes and can be integrated in middleware as general-pu
rposeservice. The
disadvantageofthis technique is that,it degrades performanceas it is costly to restore
previous state.
As backward recovery is general technique applicable to all systems and applications, guarante
e cannot be given about
occurrenceoffailure after recovery from the same one. Hence, applications
support is needed for recovery which
cannotgive full-fledged failure transparency. Moreover, rolling back to state
such as moneyis already transferred to
other accountis practically impossible.
As.checkpointing is costlier due to restoring to previous state, many

distributed system implements it with message
logging. In this case, after checkpoint has been taken, process
maintains logs of messages prior to sending them off.
This is called sender-based logging. In other scheme, receive
r-based logging can be carried out where receiving
processfirst log an incoming message and then deliverit to the
applicationit is currently executing.
Checkpointing restores processes to previousstate. From
this state, it may behave differently than before failure
had
occurred. In this case, messages can be delivered in differe
nt order leading to unexpected reactions by receivers
Message logging now helpsto deliver messagesin the expected
order as actual replaying of the events since last taken
checkpoint.
5.9.2 Stable Storage
4
The information needed to recover to the previous state neededto
be stored Safely so thatit Survives process crashes
site failures and also storage mediafailures..In distributed system, stable
storageis importantfor recovery |
The storage is of three types: RAM whichis volatile, disk which survives
CPU failure but can belostin case offailure of
disk heads. Finally, stable storage which is designed to survivefrom all types
of failures except major calamities.
- Stable storage is implemented with pair of ordinary disks. Each block on second
drive is an exact copy of the
corresponding block onfirst drive. Update ofblock first takes placein first
drive and once updated,it is verified. Then
same block on second drive is done.
RE Publications
TechKaswiedgé
-

olerance
and Fault T
Replication
: pistrib Consistency,
updated.
a
= nd drive is
a
wees
covery i . two co
this situation, re k for block. When e. Now
In
by co mp ar in g both drives bloc t drive to another driv
° carrie d ou t co pi ed to firs
Dice then block
on oot4 it is then
as correct one and
rive is considered
poth the drive will be identical. c overed by
th is ca se, block is re
pue dust particles and general
wear andtear,a valid block may give checksum error.In a
ions that need
(second) disk. Stable storage is most benefici al for applicat
he corresponding block on other
using t
high degree of fault tolerance.
5.9.3 Checkpointing rd
needed to reco
stora ge. It also
required sav e sy st em state regularly on stable e receipt
to s has recorded th
in backward erro r reco very , it is
ted snap shot ,if one pr oces
distributed snapshot.In distribu
-_
consistent global state called as the messaBe-

re sho uld be oth er proc ess who has recorded the sending of
of message then the e. A
to its local stable storag
eac h proc ess save s its state on regular basis
r d error recovery sch
emes, failure. A recovery
line
In backwa om process or system
fro m the se local statesis build to recover fr overy line.
consistent global sta te buted sna pshotis rec
The most recent distri
pe eed
collection of che ckpoints.
signifies th e most recent consistent
| 8J 4
| K [
Pry Fl
8, \ Failure
{
4
Y
| p 7S
2 ; Inconsistent
;
Recovery line collection of
Initial state Checkpoint checkpoints
very Line
Fig. 5.9.1:A Reco
Independent Checkpointing d way then it is not easy to find

recovery
fr om time to time in an uncoordinate
st at e e local
lf each process reco
rds its local ocess is rolle its mo st re ce nt ly recordedstate. If thes
pr d back to
=
ch
very line,
t. This process of
ea
find out this reco then fu therrolling back is necessary to carry ou
line. In order to snap sh ot
distrib uted
geth er do not form
states to ino effect. .
bac k m a y lead to dom |
cascaded roll - |
|
Py m2
\: \ Failure
! | Yo
7\
P2
initial state Checkpoint
effect
Fig. 5-9-2 : The domino
recorded checkpoint. As a result
iis required to restore its state to the most recently
s P2,
After crash of proces
be rolled back.
s P, wi ll al so needs to
Proces Hh
pene:

omp
Comput
Distrbued = Distributed
8-C ting .)_
C ing (MU-Sem _5-
(MU-Sem 5-32 ______
8-Comp.) 32_ CON
Consistency E SIS
, Replication S IEN
and Fault Tolerance
st!sti [| not bok ot. Again, iit

4=
is Talat
i of m, but se nder of the same messageIs
The restored state by P, showsreceipt banpe
record snapshted
ed in stets iatina
ich P P,
is require P, to roll back to previous state. However, the next state to which . hed back
of message m,although; P, has recorded receiptof it. As a result, P, again needsto
= 8 7 e ro =
oo with other processis

This process of rolling back without coordi ting ,
nating called in dependent checkpointin
inting g The other
-
inti j i
Solution is to globally coordinate checkpointing which requires global synchron ization. This may
scalesate rto uce introd be
performance problems. Another drawback of independent checkpointing is that, each loca
cleaned up timeto time.
Coordinated Checkpointing
In coordinated checkpointing, all processes synchronize to write their state to local stable storage in cooperative
manner. As a result of which, the saved state is automatically globally consistent. In this way, domino effect is avoided
here.
|
A two-phase blocking protocol is used to coordinate checkpointing. A coordinator
first multicast
CHECKPOINT_REQUEST to all the processes. Receiving processof this messa
ge then takes local checkpoint, queues any
successive message handedtoit by the application it is runnin
g, and acknowledges to the coordinator that it is has
taken a checkpoint.
When coordinator receives acknowledgement from

all the processes, it multicast CHECKPOINT_DONE
message to
permit blocked processes to carry on their work.
As incoming messages are not registered as part
of checkpoint, this
approach guarantees giobally consistent state.
All the outgoing messages are queued
locally till receipt of
CHECKPOINT_DONE message.
This approach can be improved if checkpoi

nt request is multicast to those Processe
s only that depend on the recovery
of the coordinator. A processis dependenton the coordinatorif it
has received a messagethat is dire
causally related to a message that ctly or indirectly
the coordinator had sent since the
last checkpoint. This leads to the
incremental snapshot. conceptof an
The existing message-logg

ing approaches can be
easily distingy;
Processes. An orphan proces
sis a pro cess that survives the cra
.
with the crash ed process after its recovery. sh of anoth er Proces
ut whosestateis inconsistent
44 ihe
<or TechKuewledge
Peblications

es _Pistributed Computing (MU-Sem 8-C
A omp
?
5-33 Consistency, Replication and FaultTolerance
p Q crashes and recovers ,
1
m Mp is never replayed
| so neither will m,
4
mii m3
4
ee + Unlogged message Time :

"Logged message |
g. 5.9.3
Fig. 5.9.3 : Incorrect replay of messages after
recovery, leading to an orphan process
Consider three processes P yv P, and P; as shownin Fig. 5.9.3. Process P, receives messages m, and
m, from P, and P,;
respectively. Process P, then sends messages m; to P;. Message m, was logged but m, was not logged. After crash
process P, recovers again. In this case, for recovery of P, only logged
message m, will be replayed. As m, was no
logged, its transmission will not be replayed and hence, transmission of m, may nottake place.
However, the state after the recovery of P, is inconsistent with that before its recovery. Especially, P,; keeps a message
m, which was sent before the crash, but whosereceipt and delivery do not happen when replaying what had taken
place before the crash. Suchinconsistencies should clearly be avoided
99.5 Recovery-Oriented Computing
The other way of carrying out recovery is basically to start over again. It may be much cheaper to optimize for
recovery, then it is aiming for systemsthat are free from failures for a long time. This approachis called as recovery-
oriented computing. One approach is simply reboot the system.
For clever reboot only a part of the system, it is essential to localize the fault correctly. Just then, rebooting simply
involves deleting all instances of the identified components together with the threads operating on them, and (often)
to just restart the associated requests. Practically, rebooting as a recovery technique requires few or no dependency
between system components.
In other approach of recovery-oriented computing, checkpointing and recovery techniquesare applied, but execution
are allocated more buffer space, changing
is carried out in a changed environment. The basic idea is that, if programs
be avoided..
Order of message delivery etc the n many failures can
Review Questions
of roplication,
Write the advantages
p What is replication?
as scaling technique.
= 82 Explain replication
y? Explain.
2.3 What is continuous consistenc
y model with oxamplo,
=a" Explain sequential consistanc
le.
y model wilh oxamp
E Qs Explain causal consistanc
.no
CRNcareer
l wiihiat
rrexdocmaae
. a6 4 Explain FIFO consistoncy
4

Distributed Computing (MU-Sem 8-Comp.) 5-34 Consistency, Replication and Fault Tolerance
ees
Q.7 Explain release consistency model with example.
Q.8 Explain entry consistency model with sigiiigto.
Q.9 Whatis basic difference between consistency and coherence?
Q.10 What do you meanbyeventual consistency? f
Q.11 What do you mean by eventual consistency?
Q.12 Explain ant twoclient-centric consistency models with example.
Q.13 Explain ant two data-centric consistency models with example.
Q.14 Explain read your writes and write follows readsclient centric consistency.
Q.15 Explain permanentreplicas, server-initiated replicas andclient-initiated replicas.
Q. 16 Explain different techniques of propagation of updates.
Q.17 Explain different failure models.
Q.18 Explain Byzantine Agreement Problem with example.
Q.19 Explain different failures that can occur in RPC andtheir solutions.
Q.20 Explain different approachesfor scalability in reliable multicasting.
Q.21 Whatis mean by atomic multicast? Explain.
Q.22 What is mean by atomic multicast? Explain.
Q.23 Whatisvirtually synchronous reliable multicast? Explain.
Q.24 Explain different types of ordering of the messagesin group communication.
Q.25 Whatis stable storage? Howit plays role in recovery ofdistributed system?
Q.26 What is checkpointing?
Q.27 Explain independent and coordinated checkpointing approaches.
Q.28 Whatis message logging? What are the advantages of messagelogging?
Q. 29 Explain recovery-oriented computing. |
oOo

ee
Distributed File Systems
and Name Service
_4444
ut ee
pay a
i
ation, Case
ie
File Accessing models, File-Caching Schemes, File Re plic

,File models,
Introduction and features of DFS File System (AFS), Introduction
to Name
s (DFS), Network File System (NFS), Andrew
Study: Distributed File System y
Global Name Service, The X.500 Director
irectory Services, Case Study: The
services and Domain NameSystem,D y.
ted Systems: Google Case Stud
Service , Designing Distribu
SC
6.1 Introduction
Files are used for
acc ess ed, use d, pro tec ted and implemented.
ed,
iles are structured, nam
File system describes howf vides sharing of informati
on.
inf orm ati on on sec ondary storage.It also pro
permanent use of tem Car share these
era l C omp ute rs. Com puters in distributed sys
on sev
, files are available ent and it is a software
In distributed system tem . Ser vic e Oo ffe rs particular function to cli
ed file sys
les by using distribut cess invokes the service
physically dispersed fi vic e SO ftw are on single machine. Client pro
rver runs ser
e machines. Se
entity running On 5 om as client interface.
d operations called
through some se t of de fi ne
a file, read from a
e se t of pri mit ive file ope rat ion sare create a file, delete
vices to client
s. Th rols local secondary-
File system offers file ser e th set of th es e operations. File server cont
rm d wi per request of
it e to fi le . Client interface is fo ce ssed fr om these devices as
file, and wr files are stored. Th
esefile s ar e ac
e de vi ce s su ch as disks on which
stor ag
In
icated machines.
4 client.
system (DF S). Server may run on ded
for distributed file
ent implementation buted operating
There can be differ server Mm DFS can be part of distri
n both client
and
ay run on same machine. stem and network operating system. DFS
en
other implemca ta ti c ware layer anaging communicat
m ion between file sy
n bea soft and storage devices which are
or it tional centralized file system. The servers
system client as convenow, ; g e
quest of client by arrangin th
: : its
¢
should come into vieW to itstwork should be 8avisible to clients. DFS should fulfil the re
machinesin ne
on different
or data- to service
required files ured with amountof time required ces
eration of DES,its performance is meas ll sto rag e spa
As data transfe ci
s involved in OP ed by 4 DES includes dif
ferent and remotely loc
ated sma ;
rag e SP ace Mm sn ag
the clienttr request . sto
ien
following - irrespective ofits location
DFS supports the . Any node in the system can transparently accessfile
mation sharin
g ut physically relocating the
o Remote infor user to work on different nodesat different times witho
allows t he
ty - DFS
o User mobili evices.- -
, , fect the
d
secondary storaBekeep s mult iple copi es of file on differen t nodes. Failure of any node or copy does notaf
> Availability : DFS
Sw Sere= 444
: ru | operation. Fadlicattcas
wrens met
Pe sa jet

Ww Distributed Computing (MU-Sem 8-Comp.) 6-2 Distributed File Systems and NameService
EEE
© Diskless workstations : DFS provides transparentfile accessing capability; hence, economical diskless workstations
can be used.
DFS provides following th ree typesofservices.

© Storage service : DFS deals with storage and management of space on secondary storage device that is used to
store files in file system. Files can be stored in blocks. It offers logical view of storage system by allowing
Operationsfor storing and accessing the data in them. :
oO True File service : True file service supports operations on individual file. These are creating and deletingfiles,
modifying and accessing data from files. Typical design issues to support these operations include file accessing
techniques, file sharing semantics, file caching and replication mechanism, data consistency, multiple copy update
protocol and access control mechanism.
O Nameservice : Nameservice offers mapping betweenfile names tofile IDs.It is difficult for human to remember
file IDs. Most file system uses directories to carry out mapping called as directory service. This service supports
creation and deletion of directories, adding newfile to directory, deleting the existing file from directory, changing
nameofthe file etc.
6.2 Desirable Features of a Good Distributed File System
Following are the features of good DFS:
Transparency
Structure transparency: DFS uses multiple file servers. Each file server is user or kernel process to control secondary
storage devices of that node on whichit runs. Client should be unaware about location and numberoffile servers and
storages devices. DFS shouldtreatthe client just like single conventional file system offered by centralized time sharing
OS.
Access Transparency: Accessing the local and remote file should be carried out in similar manner. DFS should
automatically locate accessedfile and support to transporting the datatoclient.
Naming Transparency: name offile should remain same when it is moved from one node to other. Its name should be
location independent.
Replication Transparency: Existence of multiple replicated copies and their location should remain hidden from
clients.
User mobility
DFS should permit the user to work on different nodes at different times. Performance should notaffect if user works
on nodeotherthan his node. User9s homedirectory should be automatically made available when user logs in on new
node.
Performance
DFS must give performance sameascentralized file system. User should not feel the need to place file explicitly to
improve performance.
Simplicity and ease of use

Userinterfaceto file system should be simple and small number of commandsshould be supported. It should support
for all types of applications. The semantics of DFS should be sameassingle conventional file system for centralized
time sharing system.
TechKnowledge
<Peslicatians

ES
scalability 4
;
to service loss OF
The DFS should also su pports for growth iT
and users in netw ork. Such growth should not lead
;
of nodes
performance degradation of the system
4, High Availability be
an
This failure ©
functii iif parti;al : one or more components.
continue to failure occurs in
DFS should
ce due to failure
*
on
failure. Degradation in performan
communication link failure , node f ailure, and secondary storage
must be in proportional to thatfailure.
7, High Reliability be loss in

pies of the critical fi les that can
_<T tically create backup co
DF S sh ou ld su pport for reliability. It should automa
Fe fail mum loss of their data.
lur e. Users should en su re that there will be no or mini
the fai
8, Data Integrity in it. These access

uld gua rante e the int egrity of data stored
shared file by many users sho .
Simultaneous accesses to rency C ontrol mechanism
be properly syn chr oni zed by using proper concur
ny users should
requests to file from ma
9. Security m to
the security mechanis
pri vacy to use r's data. It should implement
offer the should also be supporte
d.
red so that it can righta location policy
DFS should be secu Sec ure d access
ess .
om unauthorize d acc
protect file data fr
10. Heterogeneity support for

platforms use d are different. DFS should
itecture and low to
stem, m achine9s arch ms for different applications. DFS sh ld al
ou
mputer Pp latfor
ted sy
In large distribu di ff er en t co
that users can us
e
heterogeneity so or st oraBe 3rea in si
mple manner.
Pe
ty of node
integrate a new
6.3. File Models fiability :

se d on structure and modi
dels ba
e two file mo
Following ar
tu red Files
Unstructured and struc as un-interpreted
63.1 sequen ce of dat a. Struct ure of the file appearsto file server
5 unstructured structure and meanin
i gram and OSis not
g of data In file is up to the pro
: i
~ red
iles : File i
Fil of
Unstructu : model as
terpretation N! . Most of the operating system used this
tes- The in by MS-DOS and U
useda UNIX
es in different way.
n ce of by is
seque le model he fil
ation 20 view content oft
-< fi
is
: n the same- Th ff er ent ap pl ic
iint ns '§ easi er. Di
different applicatio-< prese
of
of records. Record is smallest unit
h e
ar in geof fi les byby nted to file server a5 order sequence
sh To access the
file 15 P . he same file system may have different size of records.
- In this model, int
Structured Files files For example seventh
~
ed. pifferent its po sition within needs to be specified.
acc ess . ,
data that can be
re files wi non-index ed records, P s, a record is access ed by using key value.
ure files with indexed record
struct
record in struct
ing of file. In file has a nameand its data
. All operating systems
record f rom beginn ch is string of the charac ters. Eve ry
e of the file. Such
exa mpl e, the date and tim e of file creation and thesiz
na me with each file, for
File is identified by its -
utes.
- 4 ion em is called as attrib :
informal
3
othe r
associate by operating
ion associated
:. extra informat :
ae
"3 :

6-4 Distributed File Systems and Name Service
% Distributed Computing (MU-Sem 8-Comp.)
~ Thelists of attributes are not same for all the systems and vary from one operating system fo BROBIEN, ING mB
existing system supports all of these attributes, but each one is present in some system.
6.3.2 Mutable and Immutable Files
4 Mutable Files : Most of the OS uses this model. In this file, each update operation onfile update overwritesits old
as single stored sequence that is changed by each
content and new content is produced. Hence,file is represented
update operation.
4 Immutable Files : In this file model, each update operation creates new version of the file. Changes are made in new
version and old version is retained. Hence, more storage space is required.
6.4 File-Accessing Models
6.4.1 Accessing Remote Files
to accessthe file.
DFS mayuse oneof the following models to service request of the client
to access remote file. Naming
-~ Remote Service Model : In this model suppose client forward the request to server
through remote-service
scheme locates the server and actual data transfer between client and server is achieved
accesses and returns
mechanism. In this mechanism, request for accesses is forwarded to server, which then performs
data packing and
the result to user or Client. This is similar to disk accesses in conventional file system. Hence,
communication overhead is significant in this model.
al file
Data-Caching Model : Caching can be used to improve performanceof remote-service mechanism. In convention
te
system, caching is used to reduce disk I/O. The main goal behind caching in remote-service mechanism is to reduce
arncaeter ssi
and
network traffic and disk 1/O. If data is not available locally then it is copied from server machine to client machine
SS ee Oe 88st ot
cached there. This cached data is then used byclient at client side to process its request. This model offers better
performanceand scalability.
All the future repeated accesses to this recently cached data can be carried outlocally. This will reduce additional
networktraffic. Least Recently Used (LRU) algorithm can be usedfor replacing the cached data. Master copyoffile is
available on serverandits part is scattered on manyclient machines. If copy of the file at client side modifies thenits
master copy at server should be updated accordinglycalled as cache-consistency problem.
DFS caching is network virtual memory which works similar to demand-paged virtual memory having remote server as
backing store. In DFS, data cached at client side can be disk blocks or it can be entire file. Actually more data are
cached byclient than actually needed so that most of the operations are carried outlocally.
6.4.2 Unit of Data Transfer
The unit of data transfer is fraction of data that is transferred betweenclient and server dueto single read or write
operation. In data-caching modelof accessing remotefiles, following four models are used to transfer the data.
Teck
Publications

Distributed Computin |
c Com
s and NameService
Distributed File System
= =
ile-level Transfer Mode]
j
In this model, whole file j sf

d in either direction across the network. Hence, the network protocol overheadIs
required only once.Its s Calability jisb arver
etter re
load and networktraffic.
.
_ Disk access routines on server a .
is known that accesses are for
file instead of random
optimized as it
lways better
blocks
ofthis form
One
te quired to transferthe file to the form compatible to clientfile system. The drawback
time transfer is requi
is that it requi
equires sufficient space atclient as entirefile is cached. Amoeba, CFS and Andrew File system
uses this transfer model.
Block-level Transfer Model

red in continuous
le blocks. Usually, File is sto
Transfer of file data between client and server takes place in units offi e, it is
this model, no need to trans fer entire file to client machine and henc
blocks of fixed size on secondary storage.In
tations.
better alternative approach to diskless works
, it generates more network
cess to entire file thenall blocks needs to be transferred and hence
4 But, if client needs ac
model.
S and Sprite use this transfer
traffic and has poor pe rformance. NFS, LOCU
Byte-level Transfer Model rs flexibility as any

wee
tw nclien t and ser ver takes place in units of bytes.It offe
In this model, transf
er of file data be a
ficult to manage cached dat duet
oits variable length.
que ste d and hence, it is dif
within file can be re
range of data bytes
uses this model.
Cambridge file server
Model
Record-level Transfer err ed between client
and server. Research
ords are transf
only. Hence, rec
pli caable to structured files
lic
This model is ap
S) us es this mo del.
Storage system (RS
65 File-Caching Schemes edfiles in main memory in order

to reduce
ning the cached data of recently access
for retai e dueto reductionsin disk transfers.
ata. 8formanc
emes are bas ically It gives good pe
These sch
fort he same
:
sk accesses syste m. It also helps to achieve scalability and reliability
repeated di e p e r f o rmanceof the file stof DFS.
mprov
oim ching is used by mo
e,!file ca
e. Therefor
~ Hence,it is good approach " che on clien t nod
of modifications and
i possible toca
e decisions like cache location, propagation
as remote data !5 ress th
for DFS sho yld add
~ File caching scheme
8validation of cache-
65 1
%35. Cache Lo cation nile {5 server's disk. rollowing three possible locations can be there to cache the
acation ofthe
al |
a ~ Suppose the origin
A data.
Ke:
if.

ied Computing (MU-Sem 8-Comp.) 6-6 Distributed File Systems and Name Service |
en
Server's main memory : This is no caching schemebefore client can access thefile, the file is first transferred from
server's disk to the server's main memory and thento the clients main memory across the network.In this, one disk
access and one network access is required. It results in good performanceasit eliminates disk access.It is easy to
implement. As file resides on the same server machine,it is easy to maintain consistency between original file and
cached data. The drawback is that, client has to carry out network access for each file access. Hence network cost is
involved.It is not scalable and reliable approach.
Client disk : It involves disk access cost at client machine on cache hit. It eliminates network access. In case of crash,
data remainsin client disk and hence, no need to access again from server for recovery. Thereis no loss of data as disk
is permanent storage. Hence,it offers reliability. Disk also has large storage capacity compared to main memory,
resulting in higher hit ratio. Most of the DFS usesfile level transfer for which caching in disk is better solution as disk
has large storage spaceforfile. The drawbackis that, this policy does not work for diskless workstations.
Client9s main memory, It works for diskless workstations and avoids network access cost and disk access cost.It
contributes scalability and reliability as access request is served locally on cache hit.It is not preferable compared to
client disk cache if increased reliability and large cache size is required.
Cachelocation can either be main memory or disk. If cache is kept in main memory then modifications done on cached
data will lost due crash. If caches are kept in disk then they are reliable. No need to fetch the data during recovery as
data resides on disk. Following are the advantages of main memory caches.
oO It permits for diskless workstations
o Data access takes less time from main memory comparedto access from disk.
os Performance speedup is achieved with larger and inexpensive memory which technology demandstoday.
are kept in main
o To speedup I/O server caches are kept in main memory.If both server caches and user caches
memory then single caching mechanism can be used for both.
and remoteservice. In NFS,

4 Many remote-access implementations take the hybrid approach considering both, caching
client- and server-side memory
for example, the implementation is based on remote service but is improved with
caching for performance.
6.5.2 Modification Propagation

taneously be cached on multiple client nodes. If all
4 If caching of the datais at client side then file9s data may simul
es are consistent. If one client changes file data and
these cached data copies are exactly same then cach
becomesinconsistent.
correspondingdata onother client nodes are not changed then caches
several approaches are available. These approaches
4 To maintain consistency between all the copies of file data,
.
design issues for DFSs.
depend on the schemes used for following cached
server.
o _Whento propagate modifications made to cached data tofile
O Howtoverify validity of cached data.
used to write back updated data to the master copy offile
4 System's performance and reliability depends on the policy
whichis available on server.
~4 Following policies are used :
WT peaications

4
c pistributed Comput
ing (MU-Sem 8-Comp
.) ice
7 Distributed File Systems and Name Serv
4 policy
E<This policy sends mo
dified Cache entry ii
mmediate]
y to the serverfor updating the master copy ofthe file. This policy is
ro
i
more
Es,
reliable aslittle informatio is
n
lost when client system crashes. The write performance is poor as each write
wee
ae
access has to wait until the
information Is Sent to the ser
a"
iy
L
ve lr.
|the advantagesof
é data caching are only for read accesses as remote service methodis usedfor all write accesses.It is
4
-
i
io
. suitable in cases whereread to write access

a
- ratio is quite large.

-
played-write policy or write-back caching

a
¢
In this policy there is delayin writing the modifications to the master copy on server. Modifications are written first in
__ cache and thenlatertime is done on master copy.First advantageofthis policy is that write accesses completein less
time as writes are made to cache.
Second, the data which maybeoverwritten prior to writing back on master copy, so last update needs to be written.
The limitation of this policy is that, if client crashes then unwritten data are lost and hencelessreliable.
There are variations of delayed write policy. One choiceis to flush a block as soon asit is set to be ejected from the
client's cache. This alternative can lead to good performance, but some blocks can exist in the client's cache a long
time before they are written backto the server. |
A negotiation betweenthis choice and the write-through policy is to scan the cache at regular intervals and to flush
is to write data
blocks that have been modified since the most recent scan. So far one morevariation on delayed write
File System (AFS).
back to the server whenthe file is closed. This write-on-close policy is used in Andrew
6.5.3 Cache Validation Schemes
Client hine always use cached data for accesses which is consistent with master copy at server. If client
4
lent machin
determines that its cached copy is out ofdate, then it should cache the up-to-date copyof data.
4
Following two approachesare used :
Client Initiated Approach

lidity check in whichit contacts the server.In this check, client checks consistency of its cached
a
The client starts a valial ing consistency semanticsis decided by frequency of validity checkin
with th e da ta in ma st er COPY- The res ulti g.
data
:
carrie d out prior to every access OF onfirst access to file
. or at regular intervals. The validity
_ Validity checking can
n be of last modification of the cached version of data with server's master copy
tb comparing time .
check is carried out DY .
cached data is considered as up to date. Otherwise recent version is accessed from the
-. Version. If two are same the
' Server.
aRave Initiated approach

he recordsof the files or portion of files that each client catches. Server must take immediate
The server maintain s tne
ee
=e
consistency. If two differentclients in conflicting modes

potential in cache file then there
action wheneverit d etects a
is possibility of inconsistency:
Ti =
Publications

Distributed Computing
(MU-Sem 8 -Comp.) 6-8 Distributed File Systems and Nam
e Service
4 Anotification should be sent
to server when a fi le is opened, and
for every Open. When
the intended read or write mode must be specified
ever server detect
s that file has been opened simultaneouslyin confli
action by disabling cachin cting modes, it can take
g for that particular fil e. Whenc
atchingis disabled then a remote-service mode
becomesactive. of Operation
This policy has some problems also. It viole

ts traditional client server model. Hence,it makes code for clien
Programsirregular and complex.It also t and server
requires stateful file server whi ch has lim
servers in case of fai
itations compared tostateless file
lures, A validation is
carried out with check on open poli
should be used along with serv cy whichis client initiated approach;it
er initiated approach.
Comparison of Caching an
d Remote Service
Sr.
_ Caching
No. 3 Remote Service :
}
1. It allows to serve remote acce
sses locally so that they can
be Remoteaccess is handled acro
faster as local accesses ss the network
SO slower compared to Caching.
2. It reduces the network traf
fic, server load and improv
es It increases network traffic,
scalability as server is contacted server load and
occasionally.
degrades performance.
3. In case of Caching, for transm
ission of big chunks of data
overall network overload requir , In case of remote serv
ice, transmission of
ed is at lower side compared to
remote service. series of responses to specific
requests involve
higher network overhead
aS compared to
Caching.
| 4. Caching is better in case of infrequent
writes but if writes are In
frequent remote service, there
then mechanism used to
Overcome consistency
is always
problem incur large overhead communication between client
and server to
in terms of performance,
networktraffic, and server load. have a master copy consistent
with client9s
cached copy.
2. Caching is better option for machines
with disk or large main If machines are diskless and
memory. with small main
memory then remote service
should be carried
out.
6. In case of caching, the lower-level
inter-machine interface is The remote service concept is
different from the upper-level user just an extension
interface. of the local file-system interface
across the
network.
6.6 File Replication

4
Replicating the file on many machines impr
oves availability, If nearest re plica is use
d then service time also red
Thereplica of the same file should be kept uces.
on failure indepe ndent machines
so tha t availability of one replica is not
affected by avallability of remaining other
replica, Re plicationof files should
be hidden from users,
It is the responsibility of naming scheme
to a re plicatedfile nameto
a particular replica. Highe r levels
invisible fromexistenceof replicas, should remain
a NT RR ee ee,
hi
eh eeey
Tech
Pubtitatianrs

and Name Service
en! mee Distributed File&
F jowerlevel different lower level + 8 of
At Kames are used for different replicas, Replication control euch as deterrninea thon
<
ee ree of replication an ' ica pat es then
_ the deg and placement of replicas should be provided to higher levels. As one repl
ES users port of view, the o> aioli
. from Changes should be reflected in other copies as well,
af { Replication and Caching
seplication, replies 6 Crested at cer .

in sted at terecr, wherest Carhed copy is assaristed with clients
' avi e and
¢
to cache ; , 46° urate, availabl
| A£reples pervetert 25os Cormp
6 mere persi Compa
ercd to
red d copy Replica i: wittety known, secure
° stent
complete
5 en pyatiatulity ana
depend
Cached (ooYT caniere er Geegrect-a 2's (ae Ai oy © ottett juattert:
feisterme of replica
perfor rate ©
wae Put.
ilacd with replica then 2% Race rT
cs. & recede bo Ge peranten oily ect
~ Cache toy % clecgoectubecesé ort fei.
Advantages of Hepisation
Peaiherwaterge sree heer aaPaeaurt gees

T fay } =F .
rf tied, 3° 74
Prec sy ther Ceery ays

Foture in cw?s.
re pgst et ALP Prag, wi?
7
Frye?sf.457"
. ae pli? s
oT =,
increased Awailatniiey
8
I? orirmary ce sy Upils, 30
resi Fee
pel oe newer ot sevyer9s.

five ceitical Gets tam
sgt Sec t cHr.
o oon
i tue
gadiha att é
Opt ahora. prud
Pecoyes (otter secret

Cars Ber acters sacad Cave et fature is
event vereert, recovery Wt
taer
ofdee lees ante susriadblte OF Set )
% ?-
increasedt Meilatnlity
Cees re
er eepiested toO% he
ft .
sectaery 3) a & :
cot«y to reexpeet ror oth

¬ f ¥
. + , he
ia & srs
* <
s disc to catast ofa "a

ba 3
sf, fst
possible. Perruannnt 44
iailins
replications oft ers ret ! aihy ce ror Thue peoochke whO
ne 2eCSRs tire & he3s
sifeswet thats % nye se tivser is 7
yep ia. share
bgrowed Hievgnortet Fyeser
ierss bee
-<-,*
etary CORY aeg ent aeess

oo _--*
than that of pri Chenmt machine ther Ch

She server thot peciches Or
3 e
wthr
.
F re ve s= + ae <2
we gucttabhem
eor9 traffic
Reduced Netw Cee ty.
4 we (eee
greets ee
a
s im porailel, & results in

.s
iocally
rve cet by different server
t tix serwee ed
gftercert
2
5
che
eS
rt9 t Ceq pae tts are se
i:
7
4. timp asd Syste Thronghart
esrove
Fie weld net arrive
satus st 1
4 Ste servers thor all the efpert t Peejuuesis For that
if he ted throwers
servers. It
ofa eet = iTTerert cleent 3 trecpuests are serviced by different
§, Better Scalability : ef filer oP ED
ausrtaved and &
BESS
at one file serwer. that ery. toad! that
can he replicated on file server
bey thierts for lirmited) tirme pericti
> =
re weet
nade.
senornous coer tin of cent
ternporary
88.3 Replication Transparency

ge replication. although file 6 replicated, it should appear as a single
logical file to its
are about t
tant
and non-replicated file as well. Following are the two impor
ace fo; replicated of
chent inter? replicas and replication control.
naming
cation transparency OFS
e
renticeteare
;r
rete

W Distributed Computing (MU-Sem 8-Comp.) 6-10 Distributed File Systems and Name Service
44444
Naming of Replicas
4 As immutable objects are easily supported by kernel, single identifier can be assignedto all the replicas of immutable
object. Asall copies are identical and immutable, kernel can use any copy. In case of mutable objects,all copies may
not be consistent at particular instance of time.
- If single identifier is assigned toall replicas of mutable object then kernel cannot decide which replica is most up-to-
date. Therefore consistency control and management for mutable objects should be carried out outside the kernel,
~ Naming system should map a user supplied identifier to the appropriate replica of mutable object. In caseif all the
replicas are consistent then mapping must provide locationof the replicas and their distance from client node.
Replication Control
- Replication control is transparent from user and handled automatically. Replication can be carried out system
automatically or can be carried out manually.
4 In explicit replication, users control the process of replication. Created process specifies server on which file should be
placed. If needed then additional copies are created as per request fromusers. Users also haveflexibility to delete one
or more replica. In implicit replication, entire process of replication is controlled by system automatically. Users
Me
remain unawareof this process. Server to place thefile is selected by system automatically. System also creates and
deletes replicas as per replication policy.
6.6.4 Multicopy Update Problem
As replicas of file exist on multiple machines, it is necessary to keepall the copies consistent. If update takes place on
one copy, it must be propagated to all other copies. Following approachesare used :
Read-Only Replication
In this approach, only immutablefiles are replicated as they are used only in read only mode.Thesefile gets updated
after longer period of time.
Read-Any-Write-All Protocol
This protocol allows replication of mutable files. In this approach, read operation is performed on any copy of the file
but write operation is performed by writing to all copies of file. Before performing update to any copy,all copies are
locked, then they are updated, and finally locks are released to complete the write.
Available-Copies Protocol
4 In read-any-write-all protocol, if server with replicated copy is down at the time of write operation then write cannot
be performed. Available-copies protocol allows this operation. In this approach, read operation is performed by
reading any available copyofthefile but write operation is performed by writing to all available copies offile.
The assumptionin this protocol is that, down server when recovers, it brings its state up-to-date from other server's
copies. This protocol provides high availability but does not prevent inconsistencies in failure of communication links.
aa

;, _pistributed Computing
(MU-Sem 8-Comp )
6-11 - Distributed File Systems and Name Service
5
pimary-COPY Protocol] .
inthis protocol for each rep}; F

<plicated file, one copyis kept as primary and all other copies are considered as secondary
oo
P
copies. The write op
eration is Carried out
only on primary but read operation can be carried out on any Copy including
primary.
a: Shea 43 bboy . oP + gcd :

Se rver with seconda 'y Copy either receives notification of updates from server with primary copy or updates
_ are
equesti
req wi secondary Copyitself.
ed by y server with Primary copy immediately informs updates to secon
n dary copies if any
occurs to it:
quorum-Basedprotocol
- This protocol is applicable in network partition problem where replicated file are partitioned in two more active
groups. All above protocols discussed has some restrictions.
|
Following are the definitions of read and write quorum.
(i), Read quorum : Toread file, if minimum copies of replicated file F have to be consulted out.of n replicated
copiesof F then set ofr copies is called as read quorum.
(ii) Write quorum : To carry out write operation on the file, if minimum w copies of replicated file F have to be
written out of n replicated copies of F then set of w copiesis called as read quorum. =
Restriction in this protocol is that sum of r and w should be greater than n (r+w>n). It ensures that, between any pair
of read-quorum and write-quorum,there is at least one copy common whichis up-to-date. This protocol needs to
identify current updated copyto in order to update other copiesof the file. This problem is resolved with assigning the
version number to copy when it gets updated. Highest version number copy in quorum is current updated copy. New
Version number to be assigned is one more than the current version number.
~ In read operation, only highest version numbercopyis selected
to read from read quorum. For write operation, only
highest version number copyis selected from write quorum to carry. out write operation. Before
performing write, the
Version number is incremente d by one. Once update is carried out, the new update and new version number is written
_ toall the replicas. . f. an ong AY IN # at: ig
Consider n = 8, r=4, and w= 5. In this example, r+w>n condition is satisfied. If write Operation is carried out on write
4_4 , 4 f * *
Quorum {3, 4, 5,6, 8 } then these copiesare updatedversions with new version number.If read operationis performed
>. 7, 3} then copy 3 is common copy as read operation mustcontain atleast one copyof previous
4 read quorum (1, fs .
write-quorum This exactly ensures copy 3 as having largest version number from read-quorum. Hence, read is carried
Out with copy3.

:
ves suggested for this protocol
Following arefew special alterna
o All Protocol : Suitable to use when read to write ratio is high in case of r=j and8w = hy
Read-. Any-Write Suitable to use when j
-Any protocol .
write to readratio is high in'case of r = n and w =12
EEO Read-All-Write
5 Protocol : Suitable to use when read
, to write operations js near
a Majority-Consensu
um is of same size or of nearly equal size. If n=12 then r = 6 and 'v 1. In this protocol, read
w=/7. .
ighted Vo ting: In this protocol, diff
Consensus with We erent votes
nce.
d pe rforma <e erent replicas are assigned with diff
ing, reli8babilility an
i
consider
a
UP TechKnowieaga
Pudlications

y Distributed Computing (MU-Sem 8-Comp.) 6-12 Distributed File Systems and NameService
oO If replica X is accessed more frequently then morevotesare assignedto it. Size of quorum depends on replicas
selected for quorum.In this protocol, to ensure non-null intersection of read and write quorums, condition
r+w>v wherevis total number of votes.
6.7 Case Study : Distributed File Systems (DFS)
DFS forms basis for many distributed application and sharing of data is fundamental toit. DFS supports to share data
by multiple processes over long period of time while offering security and reliability.
SunMicro system9s Network File System (NFS) and Andrew File system (AFS) are the examples of distributed file
system.
6.7.1. Network File System (NFS)
into one
NFS is developed by Sun Microsystems and it is used on Linux to join file systems of different computers
and offers a number of
logical whole. Version 3 of NFS was introduced in 1994. NSFv4 was introduced in 2000
improvements over the previous NFS architecture.

clients and servers can be on the
NES allows a number of clients and servers to share a common file system. These
NFS permits every machine to be
same LAN or in wide area networkif server is located geographically at long distance.
both a client and a server simultaneously.
NFS Architecture
of clients. NFS offers interface to clients

NES uses remotefile service model. Remote server manages all the accesses
ently. All the file operations are implemented by
similar to conventional local file system to access the files transpar
service model. NFS also supports upload/download
server but their interfaces are offered to the clients. It is remote
be
and then uploads it to the server so that updated file can
model in which client downloads the file for operations
used by other clients.

The
nt uses system calls of its local operating system (UNIX).
UNIX based version of NFS is predominant version. Clie
\,I
|
|
d out to VFS
ace to virtual file system (VFS). The operations carrie
local UNIX file system interface is replaced by interf
performs remote
m or to NFS client component. NFSclient then
interface is either passed out to local file syste
tions as RPC to
server.This means NFS client implements file opera
procedure call (RPC) to access files at remote
ces betweendifferent file systems.
| remote server. VFS hides the differen
and NFS server
sts. RPC server stub unmarshals these requests
NES server on server side handles incoming reque
file
the VFS layer. VFS implements local
rt them to regul ar VFS file operations that are afterward passed to
conve file system -
file systems, provided, if local
system where actual files reside.
In this way, NFS is independent oflocal
model of NFS.
compliant with file system
44
OF TectKacietse
Peblicationrs ;

; pistribute
d Computin
g(
Musser 8-Comp )
Distributed File Systems and Name Service
Server
ayn i 8 ; System Call layer
eee ¥Stem(VFS) layer! 44 :

_ ame Virtual
J Virtui filesystem (VFS) layer
yer
Localfile
eS ag
systan 4<44
ystem roe T
<interface 2! fi oie ,
eriace Ns a NFS Clients <NFS server| [Local : system

4 : <s} :[> interface9 ©:
ea RPCclient st b t :
os F RPC server stub
*
Network
Fig. 6.7.1 : Basic NES Architec

ture for UNIX Systems
ile System Model, Communication and
Naming
, InNFS files are considered as uninterpreted sequence ofbytes. Files are

organized in naming graph in which nodes
represent files and directories. NFS supports hard link or symbolic links just like UNIXfile systems.
.
files are accessed by means of UNIX like file handle. At first, client searches file name by using naming Services to obtain
the file handle. Each file has a several attributes whose values can be found and changed. These attributes includes
length and type offile, It9s file system identifier, and last time file was modified.
for reading the data from file, read operation is used. Client specifies offset and numberof bytes to read. Writing data
tofile is carried out with write operation. The position and numberof bytes to write is specified by client.
-
for communication, NFS protocol is placed on top of RPC layer. This is due to independence of NFS on OS, network
for the
| architecture and transport protocols. Open network computing remote procedure call (ONC RPC) is used
everal RPC requests in one request in order to reduce number of messagesto
communication. NFS supports to group >
be exchanged.
dures which do not have any transactional semantics. All the operationsin compound
This j ro cedu eas ;
_This is called as compound p , ests. Conflict s cannot be avoidedin case if same operationsare invoked by _
requ
Procedure is executed in order of their
Other clients
clj simultane ously. ;
+d wnitially which cannot supports to lock a file for operations. Therefore, a
ln NFS, stateless server WaS implemen e situation. in later version, stateful approach is implemented to
support
| 8eParate lock manager is USE to handle sam nts can use cacheseffectively
d .
ork so that clie
: Working across wide area netw
.
remo te file9 system at serverto its requesting clients. NFS allows
access for
Nts naming model offers transparen' server. The xP orted directory :by server can be maintained in client9s local
ae s
lent to mo unt rt le
pa offi sy stem a
pe sis that it does no t al lo w sharing fi
.
le s.
; appro ach in
_ Space, The drawbackofthis ~ ait
ted directories. Every NFS server exports one or moreofits directories.
bis9 ' O
Re . NFS sé 8ents Particular directories along with its subdirectories, so infact entire
E Mote clients access the ed bY remote cl jents.
. 5 .
-'$se directories then acc©° as 4 unit
Cj rt ed SF TechKnowledge
*ectory trees are usually exP® Publlcutions

6-1 4 Distributed File Systems and NameService
¥ . Distributed Computing (MU-Sem 8-Comp.)
: i Linux. As a result of this, these
The |directories list that server exports are keptin file, often in /etc/exports in case of _ exported directories by
directories canbe exported automatically wheneverthe serveris booted. Clients can a
: Lo f its directory hierarchy and
mounting them. Whena client mounts a remote directory at NFS server,It becomespart 0
client can access this mounted directory.
File Handle and Automounting

; i ique to all file systems
y
It is reference to a file within file system and created by server hosting the file system. It is uniq
exported by server.It is created at the time of creation offile.
In version 4 of NFS, it can be variable up to 128 bytes. File handle is stored and used by client for most of the
operations, and hence, avoids look up for file which improves performance.
After deleting file, server cannot reuse the same file handle asit can be locally stored by client. This may lead to wrong
file access by using the same file handle byclient.
Iterative look up with not permitting look up operation to cross a mount point leads to the problem in getting initial
file handle. To access the file in remote file system, client must provide file handle of directory where look up would
take place. This also requires providing name ofthe file or directory that is to be resolved.
NFS version 4 solves this problem by offering a separate operation pootrootfh that informs server
to solve all file
namesrelative to rootfile handle of the file system it manages. In NFS, on demand mounting of remotefile
system is
handled by automounter which runs as separate process on client machine.
File Attributes
In NFS version 4, attributes are classified as mandatory

attributes that every implementation must support,
set of
recommended attributes which should be preferably
supported and additional set of named attributes.
The named
attributes are encoded as an array with entry as attribut
e-value pair. These are not part of NFS protocol. Followi
the example
ng are
s of some mandatory attributes.
TYPE : Type offile (regular, directory, symbolic link).
0
SIZE : Length offile in bytes

0
CHANGE : Indicator for client to checkif

0
and/or when the file has changed.

FSID : Server-unique identifierof file9s file syste
Oo
m.
Following are some ofthe general recommended
file attributes :
° ACL : Access control list associated with file,
10 FILEHANDLE: Serverprovide file handl
e of this file
oO __FILEID:File system unique identifier for thisfi
le.
_FS-LOCATIONS : Locations
in network wherethis file
0
system can be found.

- OWNER:Ownerofthe file.
98
TIME_ACCESS : Last time of acces

oO
sing the data offile.

© _,TIME_MODIFY : Time when file
data were last modified.
o TIME_CREATE: Time when file was
Created.
Sor TechKnowledge
Publications

:
; Distributed Computing (
(MU-S em 8-Comp.)
6-15 Distributed File Systems and Name Service
_ synchronization
o
Distributed file system allo ,

ye WS to shareits files among multiple clients. These shared files between multiple users
should remain consistent. F
Or this purpose synchronization is needed, In NFS file locking is carried out by separate lock
manager. In version 4 of es
NFS file lockingis integrated in NFSfile access protocol. It may happen that client or server
mayfail while locks are
still held. Recovery mechanism should handle such problem to ensure consistency.
P. Fou r lockin B Operations are defined in NFS version 4 for locking.
Same file can be shared by multiple clients if they only
read data. It is necessary to obtain write lock to access to modify part
of the file data.
- Following operations are provided for locking :
o Lock : Create lock for range of bytes,
oO Lockt : Test whether conflicting lock has been grante
d,
oO Locku : Removea lock from range of bytes.
oO Renew : Renew a lease on specified lock.
- Lock is nonblocking operation and used to request a read or write lock on consecutive range of bytes in file. In case of
conflicting lock. lock cannot be grated and client has to poll the server later time. Once conflicting lock has been
removed, server grant the next lock to theclient at top of requesting FIFO list maintained at server side. Lockt is used
to check whether any conflicting lock exist. Removing lock id done by using Locku. if client does not renew lease on
acquired lock, server will automatically removes it.
Replication and Consistency
- In NFS version 4, cache consistency is handled in implementation-dependent manner. Client has memory cache to hold
data read fromserver. In extension to memory cache, disk cache also exists in client machine. Client caches file data,
attributes, and directories and file handles. Client caches data obtained from server while performing several read
done on data then cached
operations. Several clients on the same machine may share the cache. If modifications are
data is flushed back to server.

it should revalidate it. Server may hand
~ If part of file data is cached, then whenever client opens previously closed file,
e open, close operations. Sever is in charge of checking
over some rights to client so that client can locally handl
should succeed or not.
whether opening file by client
~ Clients can also cache attribute values which can be different at different clients. Modifications to attribute value
tories.
approach is used for file handles and direc
hould be eim
snou immediately forwarded to server. Same
NES on 4 offers minimum support for file replication. Only replication of whole file system is possible.
- version
Fault Tolerance and Security
~ RPC m mechanism in NFS does not ensure guarantee regarding reliability.It lacks in detecting the duplicate messages, In
byclient. This Problem is solved by means of
case of loss of server replay, server will process retransmitted request
. aol emen ted by server. Each client request carries trans
dupplicate-request cache Imp
action identifier (XID) that is cached
t server. After processing the request, server also caches reply. e
est arrivesa
by server when requ
444___
Ye TechKnowtedge
Pudlicetro se

9
ww Distributed Computing (MU-Sem 8-Comp.) 6-16 Distributed File Systems and Name Service
After timerat client expires before reply comes back then client retransmits same request with same XID. Three cases
occur. If server has not yet completed original request,it ignores retransmitted request.In other case, server may get
at which reply sent
retransmitted request after reply sentto client.If arrival time of retransmitted request and time
are nearly equal then server ignores retransmitted request.If reply is really did get lost then cached reply is sent to
client as reply to retransmitted request.
In case of locks, if client is granted a lock and it crashes.In this case, server issues lease on ever lock. If not renewed by
client then server removeslock freeing the resources held by lock. In case of server failure, a grace period is provided
to server after it recovers. In this grace period, clients can reclaim same lock that was previously granted to them.
As a security measure, older NFS used Diffi-Hellman key exchange to establish a session, NFS version 4 uses
authentication protocol Kerberos. RPCSEC_GSS secured framework is also supported foe setting up secured channels.
| Authorization in NFS is analogues to secure RPC.ACL file attribute is used to support access control.
6.7.2 Andrew File System (AFS)
UNIX programs can transparently access the remote shared files. Like NFS this transparency is offered by Andrewfile
system (AFS) to UNIX programs. Normal UNIX primitives are used by UNIX programs to access AFS files without
carrying out any modifications or recompilation. AFS is compatible with NFS. File system in server is NFS based. Hence,
File handles are use for referenceof file. Remote accessto file is provided via NFS.
client
Scalability is major design goal of AFS. It supports to large number of active users. The whole file is cached at
node. Following are two design characteristics.
oO Whole-file serving : The entire content of directory and file are transferred to client machine by AFSservers.
fe) Whole-file caching : This transferred file by server is then stored on the local disk which is permanent storage.
Cache contains several recently used files. Open request are carried out locally.
4 Following is the operation of AFS :

o If file is not available on client machine. Suppose user process issues open system call for the file in sharedfile
file.
space. Nowserver is located and request is sent for copyof
its file
The copy is then stored in local UNIX file system in client machine. The copy is then opened and
descriptor is sent to client.
process in client issue close system call, if
oO Processes in clients perform operation on local copy of the file. When
t and timestamp on the
the local copy is updated thenits content is sent to server. Server updatesthe file conten
use.
file. The local copy is stored on client9s disk for future
Implementation
called as Vice and Venus. Vice
4 AFS implementation contains two software components which are UNIX processes
process runs on server machine as user-level UNIX process.
At client side, local and sharedfiles are available.
Venus process runs on client machine as user-level UNIX process.
cachedby clients in disk cache.
Local files are handled as normal UNIX files. Sharedfiles are stored on server and
22 TechKnowledge
Pulictcafiers

'
r
S
Distributed Computing (MU-<Sem 8-Com Service
al Distributed File Systems and Name
=
\
UNIX kernel on each client and serveris mo ;

are done to interpret open,
difiedf version of BSD UNIX. These modifications
\
close and other system c alls when th e r . <

S
oe,
machine. At client side, , on one of the file
copies of files from shared space.
of cache is carried o Se is used as cache storing files cached
The management by Venus process
ut
, at
- It removes LRU (least rec
helen that new files accesse d from server gets space. Thesize of disk cache
client is large. Vice sees implements hierarchic
processes at client side
that is needed by une flat file service. Venus
directory structure is assigne¢ with 96-bit file
identified by thi ws mee Each file in shared file space
to translate path name to file
identifier.These files are is the job of Venus process
y this identifier. It
identifier.
Cache Consistency in AFS

running at client side.
ess at serve r side provides call back promise whenever it transfers file to Venus process
Vice proc notified to
that guarantees that, whenever any client updates file copy, it will be
In fact, call back promise is token
Venus process.
machine. These are : valid and
at client
s along with cached file on disk cache
-~ These tokens are stored in two state us processes) to which
car rie s out 4 requ est t o modifyfi le, it notify to each client (Ven
cancelled state. When serv er
r receiving call back by Venus
fica tion (cal l bac k) is RPC fr om server to Venus process. Afte
token was issued. This noti
mise as cancelled.
process,it sets the call back pro
s tate. If it cancelled state then recent
g h file is ava ila ble in cache, Venus checksits
althou
- For open operation on file, r further operations.
. Fo r val id state, cac hed copyis then opened fo
d from server
copy offile is fetche |
: l is modif ied in AFS host so that it can
d to off er AFS service.
UNIX kerne
is de di ca te
cess 4 nd server
Vice is user level pro e UNIX file descriptors.
instead of conventional fil
~
le file op er at io ns by usin g file handles
hand
in Name System
Services and Doma
68 Introduction to Name their
s or to obtain addresses of resources by submitting
es ses to locate object
used 5 y client
proc ines, network
~ Name service is r in fo rm at io n regarding users, mach
use ses and othe
services often d, x9, hole addres
names. Name
s etc. ters, files, web pages 9
remote object d system. These resources comprise compu
domains and in di st ri bu te
so urces mesof entities are
t o refer to re ne example of N2 s» web page. These na
- N ames are
re use d
h is used to ac ce
more. URL
is t me whic
services and many
name space: i
organized in esses
if
fi ers and Addr
Names, identi ntities
6.8.1 bits or characters. There are manye
of
'
use s string network connections etc. It also includes
. NameImailboxes,
entitypages,
*
to refer tO web
ames are
B . ;
syste# " uch as d messages,

:
In distributed $5
|
wo . ,
ibuted sy stem
printers: ;
disks, COMPU ters etc.
exist in distr les,
name
h as files, need to be accessed. To access the entity we need access point. The
resources Suc
atities, theY It indicates
omputer changes location, a different IP address is assigned toit.
hen mobile ©
> In order to operat a course of time.
res-
of access point IS 4 dd = .
a7 TechKnowledge
that entity MaY chan UOltcations
TR

ee i ee ot ee ee atte Ue Se et nn ep eeee se eh ele Ot Pere eas ee 4_
Service
6-18 Distributed File Systoms and Name
"W Distributed Computing (MU-Sem 8-Comp.)
independent from its
Address is also a name. Name should be location independent. Nam
eof the entity should be
address.
ity and each entity
the entity. identifier should refer to one entity
Identifier is also a name whichis used to identify
ref er to the same entity.
identifier. Identifier should always
should refer to by at the most on
endly name
. Huma n friendly
'
identifiers are repr esented in the form of bit strings.
In many computer systems, addresses and
some service.
such as URLis represented asstring of characters. Many of the names ar e specific to
6.8.2 NameServices and the Domain Name System

resolution. Name
naming contexts, Name service supports name
Name service maintains collection of one of many
stem is open, name management is separated
ted sys
resolution is to look up attributes from a given name. AS distribu
from other services.
ent services. Sometimes resources created in

Same naming scheme should be used for the resourtes managed bydiffer
e common nameservice is required.
different administratiwe domains can be shared Therefor
Originally name services were quite simple as they were Cengned for single administrative domain. Considering the
name-mapping 5 required. Global name service is
large distributed system with interconnection of networks, a larger
the example of narwng serve

ed on scalability to large
Global name service and handle service are the two examples of mame service that concentrat
number of objects. Internet Domain Name Service (ONS) fs widely used.
Name Spaces
Names in distributed system are orgarized in name space. Name spacecan be represented as labeled diagraph having
This leaf node stores information
two types of nodes. Leaf node represents named entity and it has no outgoing edges.
it is
about entity such as address. ft also stores state of the entity, for example,in file system it contains complete file
maintains directory table in
representing. Directory node in name space has number of outgoing edges. Directory node
which outgoing edge represented as pair (edge label, node identifier),
Naming graph contains root node. The path in naming graph contains sequence of labels. For example, path N:<label 1,
label2,......, label n> contains N asfirst node in graph. Such sequenceis called as path name. If first name in path name
is the root of naming graph then it is called as absolute path. Otherwiseit is called as relative path.
DNS namesare called as domain names. They are strings just like absolute UNIX file names. DNS name space has
hierarchic structure. A domain name comprises of one or more strings called as labels. They are separated by delimiter
<= (dot). There is no delimiter at beginning and end of domain name.Prefix of nameis initial section of name. For
example dcs and dcs.qmw are both prefixes of dcs.qmw.ac.uk. DNS servers do not recognize relative name. All the
namesreferred to the global root.
In general, alias is similar to UNIX like symbolic link which allows substituting the convenient name in place of
complicated one. DNS permit aliases in which one domain nameis defined to stand for another. Aliases provide the
transparency. Aliases are generally used to specify machines name on which FTP server or webserverruns.Alias is
updated in DNS database suppose webserveris moved to another machine.

ice.
Distributed File Syste
; ms and Name Serv
Ee Distributed Compututiing (MU-Sem 8-Comp.)
, Ze silthasiaen ity for assigning names within it. Domains
strative author
Fe
i: N aming&
domainiis name space for with there is single admini
e can have same
: names. Only some domain namesidenti fy the domain. A machin
&
in 8 coll <wn of domain
in DNSare rs managed by
i
j are stored bydistinct name 4
ming data belonging to different naming domains
: name as domain. Nami
i corresponding authorities
Combining the Name Spaces

F
and NES is the example of
in other name space. Mounting file in UNIX
It is possible to embed part of one namespace root by
ing each file system9s
4_
be merged by replac
the same. The entire UNIXfile systems of two different machine s can
by
in this super root. In this way name spaces can be merged
super root. Then mount each machine's file system
ility.
es the problem of backward compatib
creating higher level root. It rais ces to be embedded in
it. DCE
ting Enviro nment) name space perm its heterogeneous name spa
DCE (Distributed Compu the mounting of
NFS. These j unctions allow
spac e conta ins junct ions simul ar to mount points in UNIX and
name
heterogeneous name spaces.
n aming service supports
and share them. Spring
em moun ting allo w users to import file s from remote server
File syst ext selectively.
to share individual naming cont
to create name speces dynamically. It also supports
hi
6.8.3 Name Resolution ey
aliases. if ONS server is

re of resolution is use of
e process. Ex ample of iterative natu
iter ativ ain name
4- Name resolution ts the alias to another dom
|
www. des. qmw.ac.uk, it first resolves
requested to resolve
an. alias
generate IP a ddress. Henc
e, use of aliases presents cycles in
which mustbe further resolved to
copper.dcs, qiw. ac. uk, ined and if resolution
tion, threshold should be mainta
t
never terminate. As a solu
h case resolution s
istrator may carry out certain action if
aliases
name space in whic
then terminat e the proces s. In other case ad min
s it
| process crosse
creates cycles. gle se rver. Otherwise server would
ulation. It is not stored on sin
an d used by large pop .
As DNS databa
se is hu ge
re high availability
y used sh ou ld us e replication to ensu .
t are hea vil al name server A
become bottlen eck. Name services tha
domain manage 5 data belong
in| g to domain which is stored on loc
aut hor ity of the eds to ta ke help of other
Administrative re tha n on e do main. Local name ser
verals o ne
re data belonging to mo
name server may sto be answered byit.
na me s as all the enquiries cannot
o resolve the igation. The
nameservers t rs as to resolve the name is called nav
g dat a fro m am on g $ everal nameserve so
cating namin e igation as shownin Fig. 6.8.1.
Process of lo carries out navigation. DNS carries outiterativ nav
luti on so ft wa re
reso
client name
Be
E.. or |
Fig. 6.8.1: Client iteratively contact name server 1
to 4 in order to resolve name.
L
a Ba
PeSlitatiers
[ a |

ii a et arm me ale eloe
W Distributed Computing (MU-Sem 8-Comp.) 6-20 Distributed File Systems and Name Service
Initially client presents name to local name server.If it has the name,it returns it immediately, Otherwiseit will
suggest another server which can help. Now resolution proceeds at new server.If needed, further navigationis carried
out until the nameis located or discovered to be unbound. In multicast navigation client multicast name to be resolved
and needed object type to groupofservers. Server storing this namedattribute only replies to the request.
In recursive name resolution, name server coordinates the resolution of name and returns result back to user agent.It
is further classified as non-recursive and recursive server controlled navigation. In non-recursive server controlled
navigation, client can choose any name server.
This server then either multicast request to its peers or it communicates iteratively with peers. In recursive server
a a a
controlled navigation, client once more contact single nameserver.
If this server does not hold namethenit contact to its peer that holds larger prefix of the name, which in turn attempt
to resolve it. It is repeated until nameis resolved. Fig. 6.8.2 shows non-recursive and recursive server controlled
navigation,
Name Name
server 2 sorver 2
2 2J
te Name / e Name 3 y4 3
[creat[44 server 1 N server 1
4
qn
3 A
Narne Namo 4
server a server 3
(a) Non-recursive server controlled Navigation (b) Recursive server controlled Navigation
Fig. 6.8.2
The access to name server running in one administrative domain by client running in another administrative domain
should be prohibited. In ONS and other name services, client name resolution software and server maintains cache of
results of previous name resolution. On a clients request to resolve name,first client name resolution software enquire.
cache.If recent result of previous nameresolution is found then return it to client. Otherwise request is forwarded to
server. That server in turn mayreturn result cached from other server.
Caching improves performance and availability. It saves communication cost and improves response time. High level
nameservers such as root servers are eliminated due to caching.
6.8.4 Domain Name System
DNS naming database is used across the internet. The objects named by DNS are computers. For these computersIP
addresses are stored as attributes. In DNS domain names simply are called as domains. Internet DNS has bound
millions of names. The look ups against these namesare carried out from all around the world.
Mainly DNSis used for naming across the Internet. DNS name spaceis partitioned organizationally and as per
geography aswell. In name highest level domain is mentionedat right. Following are the generic domains.
Oo _ com : Commercial organizations
Oo edu : Educational institutions and universities
o gov : US Governmentalagencies

2
stems and Name Service
pistributed Computing (MU-Sem 8-Comp.)
6-
21 Distributed File Sy
:
6 mil : US military Organizations
9 «Cnet: major network support centers
> org: Organizations notlisted in above domains
9 «Cint: International organizations
in. Some of the examplesare:

_ every country has its own doma
9. CUS: United states
9 fir: France |
s to edu
o in: \ndia ich correspond
example, India has dom jin ac.in wh
use their ow n domain. For
_ Countries other than USA
domain.
DNS Queries p electr onic mail host :

tion and lookin gu in
for simp! e host name resolu co ntains doma
Internet DNSis used name si n to IP ad dresses. URL
to resolve host mmunicate
app lications us e DNS rs use http t O co
l ution : In this, address. Browse
- Host name reso
s D NS enquiry an d obtains IP
make
to browser, it
name. When it is B iven port number.
ata n IP ad dres s with a reserved t o IP addresse
s of mail host.
with web servers main names in
solve do
re uses DNSt o re
mail softwa
ti ng the mail host : Electronic
Loca
y Pr ocessing
Na me Servers, Navigatio n and Quer s e and by replicati
ng and caching the
part of
DNS e na mi ng da ta ba
servers. Each
partitio ning th ical network of
hi ev ed wi th co mbination of ta ba se is di str ibuted across log
- Scaling is ac D NS da the domain.
th e point of re
quirement.
req vest can be
fulfilled locally within
se clos e to at loca l
this databa local dom ai
n so th e
s e mainly of na ming data is div
ided into zones wher
pa rt of da ta ba DN S
server holds d addresse
s of other se
rv er s.
do ma in na mes an
Each server also re
cords
su b-domains. For exa
mple, it could
low ing data. le ss th e data about
ai fol d
each zone cont
ns domains an lege.ac.in).
r names in artment (department.col
tr ib ut e data fo an d le ss the data about dep
It contains
at
(college-ac-in) data for the zone.
o ga ni za ti on ma in ta in 5 trustworthy and reliable
for or ne whic h
contain data least two se
rvers in zo data which gives IP addres
ses of these
ad dr es se s of at
ga te d su b-domains and
Names and r dele
o lds data fo
me serv er s that ho
na
Names of cation and caching.
o
s qu ic kl y. th at go ve rns the re pli
_ server ment ess. Each entry
zone manage in fu ture name resolution proc
related to a t it can be
required
o Parameters servers SO th er by client will remain usefu
l. In this
othe r un -quthoritative serv
n cache d ata from cached
da ta fr om
s query after timeto live period
~ Any server ca live value
so that rver. if client send
ive se
ns time to |data fr
o networktraffic and offers flexibility
in zone contai server ive server.
This minimizes
th is un-authoritative 362! n to
guthoritat
Ca se ,
ta ti ve se rver contact
ri
then un-autho ecifies W hat is required. It
can be IP address, name
ni stra tors- e of query sp
to system admi utes- TyP
attrib
bitrary
ed to store ar
DNS can be us
her information.
Tech
n Publications

Y
SSS 9Meim.§=£6eR ee
_Distributed (MU-Sem 8-Comp.) 6-22 Distributed File Systems and Name Service
4
4 ADNS clientis resolver whichis implementedin library software.It is simplest request-reply protocol. Both iterative
and recursive resolutionis supported by DNSandclientside software specifies the type of resolution to be carried out.
4 Nameservers store the zone datainfiles in the form of resource records. Following are some examples of resource
recordsfor Internet database.
<Typeot |Assodated| eee Seoweaning

> Record Entity c TAte aes is Pada, SteNeae
A Host IP address IP address of this computer
MX Domain List of <preference, host pair> Refers to mail server to handle mail
addressed to this node
PTR Host . Domain name Holds canonical nameof host
TXT Any kind Arbitrary text Text string

SOA Zone Pararneters governing zone Holds information on represented zone
NS Zone Machine architecture and operating Holds information on the host this node
system represents.
CNAME node Domain name for alias Symbolic link.
Berkeley Internet Name Domain (BIND) is an implementation of DNS for machines having UNIS OS running on them.
Client programs link in library software as the resolver. DNS name servers run named daemon. BIND permits three
servers which are primary, secondary and caching-only servers. The named program implements just one of these as
per configurationfile contents.
Typically, organization has one primary, one or more secondary thatoffer service for name serving on different LANs
at the site. In addition to this, caching-only servers are run by individual machines to minimize networktraffic and
speed up the response time.
6.9 Directory Services
Directory services are attribute based naming systems.Directory service stores binding between namesand attributes
and that look up entries matches with attribute based specifications called as directory service. Some of the examples
are LDAP, X.500 and Microsoft9s Active Directory Services. Directory service returns attributes of any objects.
Attributes are more powerful compared to names. For example, Programsusually can be written to select objects by
their attributes and not names.
networking.
A discovery service is a directory service. This discovery service registers services provided in spontaneous
In spontaneous networking devices gets connected at any momentof time without any warning. There is no any
administrative preparationcarried out for these devices when they connect in network.It is required to support set of
clients and services to be registered transparently. There should not be any humanintervention for the same.
automatically. It also
To support this, discovery service offers interface for registering and de-registering these services
hotel should be able to connect his
offers interface for the clients to look up these services. For example, customerin
laptopto printer automatically without configuring it manually.
eee
Tech:
Publications

fulfills
discovery service that finds available network printers that
Printing service may specify whether <laser= or <inkjet=, offers color
E printing or not, location of printer
etc
s system is developed fo
runs in all the machines and machines
e networking. It assumes JVM
a with al other rene for the service,
transaction, for shared da places, and for ki method invocation. It offers facility to discover
in Jini system are look up services, Jini
| 4- eee related components
dients, and Jini services. Jini client uses Ook upservice. To locate the look up service, client multicast to well known IP
multicast address. Look up P services li
services listen on a socket bound to the same address in orderto receive such requests.
6.10 The Global Name Service (GNS)

a
- Global Name Service (GNS) was designed at DEC Systems Research Center.It offers facility for resource location,
guthentication and mail addressing. Following are the design goals of GNS.
o Tohandle arbitrary numberof namesandto serve arbitrary numberof organizations.
o.=.:«d Ad long life time.
0. High Availability.
o Fault isolation: Local failure does notaffect the entire system.
c Tolerance of mistrust.
hence, it considered the
- These goals indicate that any numb er of computers, users Can be added in system and
may also change. The
a | structure changes then the structure of name space
support for scalability. If organization
of individuals, organization etc.
service should also accommodate the ch anges in the names
esthat, changes
and naming databa se, caching is used in GNS.It assum
- Considering the large si ze of distributed system r
n of updates is adopted. Client can detect and recove
r | ent ly an d hence slow propaga tio
will occu
in database nfr equ
frorn use of out of date naming data.

holds names and value. Path names are used to refer
posedof tr ee of directories which
~ GNS naming database is com ctory identifier (DI). A director
y
unique dire
Each directory is assigned with ;
ectory similar to4O UNI
the directory reference Theem.leaves of directory tress also hold values which are further organized in value
X files.syst
holds list of names and re 8

tree,
ectory name , value name>. First part identifies directory and second refers to
o parts: <dir
~ InGNS, names consi et of tw value tree named as <DI of
Root
at tribut os O
f us er
«
Ra jesh in directory C fe
wore : ored in
ulncdedbeasst
value tree. The can be re
the
%% word in yalue tree
esh. The pa
Girectory/A/B/C, Raj h/password»
Rajes
<D\ of Root directory/A/B/¢. s. rtition is again replicated to many servers, If two
. js vinned
partitioned and stored jn many server Lach pa
an . Twodirectory trees can be merged by
ES '
~The directory tree I pon take place then consistency of ee is maintained
wo wees tO be merged. This problemis solved by unique directory identifier.
k
' Of more concurrent updatroots of these
local user agent add protic to the directory ay <#567/A/1/C, Rajesh».
ft
ie
. fajesh,
ve
Hiverting new foot abo
ee
the name © JAB

bt
ee
h Whenever client use - e 7.
In the same way, wed agent deal with relative name referting
10 GNS serv
a
fe
fe
yed ma
a
>
nds
this dew
User apent then se
t
orles.
8 =e yea= Se - ee a i, Orhan: teraead SSS
. ey .
to working direct
ees pet ae SmaS et aa tial ge.
& serge a
iy 4 nl
eee
aed Ferneed
Rekits #lisaa
a saa
YN.
W Distributed Computing (MU-Sem 8-Comp.) | 6-24 Distributed File Systems and Name Service
i
4 GNS maintainstable of well-knowndirectories in which it list all the directories which are used as working roots. This
changes
table is held in currentreal root directory of the naming database. Whenever real root of naming data base
due to adding the new root,all GNS servers are informed aboutlocation of new root.In GNSrestructuring of database
can be doneif any organizational changes occurs.
Di: #567
if : t
Dieu Ae
Directory
Rajesh
Value Tree
password
Fig. 6.10.1 : GNS directory tree
6.11 The X.500 Directory Service
X.500 Directory Serviceis used to satisfy descriptive queries to look up namesandattributes of other users or system
resources. The use suchservice is quite diverse. For example, the enquiries can be for accessing white pages to obtain
user's email address or yellow pages query maybefor obtaining names and telephone numbers ofgarages.
Individuals or organizations can use directory service to provide information about them and resources they wish to
offer for use in network. It is possible for users to search directory for information with partial knowledgeofits name,
structure and contents.ITU and ISO standard organization have defined X.500 Directory Service as a networkservice.
It is used for access to hardware and software services and devices. The X.500 servers maintain the data in tree
structure with named nodeJustlike other name servers. Each node oftree in X.500, stores wide range ofattributes.
The entries can be searched by any combinationof attributes.
The X.500 name tree Is called as Directory Information Tree (DIT). The entire directory structure along with data
associated with nodesIs called as Directory Information Base (DIB).
EE
Uf" lechinewiedes
Peblications

00 servers.
issuing the requests. Clients establish connection with server to access the directory by
- Client can contact any server. j¢ .
servers or redirects the Client to an not in the part of DIB of contacted server then it will either invoke other
clients are called as Directory User Agent verver. In X.500, servers are called as Directory Service Agents (DSAs) and
one of the navigation modelin which cach san Following Fig. 6.11.1 showsthe service architecture of X.500.It is
DSU serversto satisfy the client requests VAclient interact with Single DSU server whichfurther accesses the other
DUA DSA
DSA
DUA ;*4+ DSA

/ ~ DSA
DUA DSA fo
DSA |
Fig. 6.11.1 : X.500 Service Architecture

Each entry in DIB consists of name and attribute. The name of an entry corresponds to path
from rootto entry in DIT.
Absolute or relative path names can beused. Also DSU can establish a context that includes base node
and then use
shorter relative namesthat gives path from base nodetorelative entry.
4 Directory can be accessed by using two access requests : read and search.
Oo read : An absolute or relative name to an entry is given along with a list of attributes to be read. DSA then
navigates DIT to locate the entry.If the part of tree that includes entry is not available on this DSA server, then it
passes request to other DSA servers. The retrieved attributes thenit returns to client. :
© Search: This access request is attribute-based, base name and filter expression are supplied as arguments. The
base name indicates the node in DIT from which search is to begin and filter expression is to be evaluated for
every node below the base node. The searchcriterion is specified byfilter. The search command then returns
namesfor all entries for which filter evaluates TRUE value.
4 DSAinterface offers operations to add, delete and modify the entries. Access controlis also Provided
for both query
and update operation. As a result, access to somepart of DIT can berestricted to user or group ofusers.
6.12 Designing Distributed Systems - Google Case Study
6.12.1 Google Search Engine

~ From distributed systems viewpoint, Google offers attractive extremely challenging requirements, mainly in terms of
scalability, reliability, performance and openness. Like any web search engine, Google search engine return an ordered
list of the most relevant results that match to the given query by searching the contentof the Web. The search engine
contains a set of services for crawlin g the Web and indexing and ranking the searched Pages.
44
Fea.

he ee om es oa
ystem.This
~ The crawler locates and retrieves the contents of the W eb and passes
the contents onto the indexing subs
recursively reads a given web page, harvesting all the links
a
is carried out by a software service called Googlebot, which

ons for the harvestedlinks. .
from that web page and then scheduling further crawling operati
ally,
the Web which is on a much larger scale. More specific
4 The indexing produces an index for the contents of
ns
web pages and other textual web resources onto the positio
indexing produces an inverted index mapping wordsin
for
position in the document and other related information
where they occur in documents, including the accurate
queries for words against
The index is also sorted to support efficient
example, the font size and capitalization.
locations.
nce of the web page s that contains a particular
set of
Indexing does not provide information about the relative importa
make sure that important pages are
keywords. In ranking, higher rank denotes importance of a page and it is used to
Ranking in Goog le also considers factors like
returned nearer to the top of the list of results than lower-ranked pages.
nearness of keywords on a page and their font size (large or small).
6.12.2 Google Applications and Services

web-based applications that also includes Google
4 Apart from the se arch engine, Google now offers a wide range of
ely supported by Google. It provides software
as a service (SAS), which
Apps. Now days, cloud computing also extensiv
ions. A Google Apps is the main example of it.
offers application-level software over the Internet, a5 web applicat
Google Docs, Google Sites, Google Talk and Google
Google Apps includes set of web based applications such as Gmail,
Calendar.
now offers its distributed systemsinfrastructure as a
Apart from SAS, with new addition of Google App Engine, Google
services, including its web search engine. The
cloud service. This cloud infrastructure supports all its applications and
cture, permitting other organizations to run
Google App Engine now offers external access to a part of this infrastru
s of Google applications:
their own web applications on the Google platform. Following are the example
messages.
Gmail : It is a mail system of Google that hosts
g of documents held on Google serversin
Google Docs : It is Web-based office suite which supports for editin
shared mode.
Google Sites : These are Wiki-like web sites having shared editing facilities.
6
Google Talk: It offers instant text messaging and Voice overIP.

000
on Google servers.
Googie Calendar : It is Web-based calendar having all data hosted
wikis and social networks.
Google Wave : It isa collaboration tool integrating email, instant messaging,
Google News It is automated news aggregatorsite.
0
ing high-resolution imagery and unlimited

Google Maps : This App offers scalable web-based world map includ
ighee a Selig, deeee
0°
user generated overlays.

e with unlimited user-generated overlays.
oO Google Earth : This App offers scalable near-3D view ofthe glob
ucture of Google to other parties outside of Google.
oO Google App Engine : It offers distributed infrastr
considers scalability
Google infrastructure provides scalability and it uses approaches to scale in large scale. Google
expecting better results.
problem related to dealing with more data, more queries, and
44_4

te
6-27 Distributed File S
to cust omer.Its service
leve | agreement
ers
ity f
Google provides high reliabil ty for the sea
rch functionality and Appsit off
. es e Google applications.
ud all th
gua antees tooffer services that incl leting web
contains 99.9% guar s the target of comp
. e ke ep
gh performanc _ Google ile handling
very
m ncy to user interactions to gain hi ng requests wh
anele ;ai- or lows late roughput to re ac t t o all in co mi
applic ions
at
ea et s. Go in 0.2 se conds and achieving the th te th e wi de range of web
oe . so r to accommo da
ss in orde
ts for openne
highly suppor
ets. Google also
. as shown in 6.12.1. As
<
to be developed in future em architect ure
ove ral l syst es and
en ts , Go ogle has develope d the
well -kno wn Google servic
ated requ ir em om and th e oud
To satisfy above st yi ng computing platfo
rm is at the bott
wa re support for se
arch and cl
th e un de rl ng mi dd le
shown in Fig. 6.
12.1 , ctures offe ri rvices for
th e top. Th e distributed infrastru e co mm on di stri buted syst em se
at th y and
applications are infrastructure offers ility, reliabilit
is pr ov id ed by middle layer. This y tech ni qu es that handle scalab
computing Ap ps and encapsulat
es ke
ogle services and
developers of Go underlying
nc e. rvices thro ug h reu se of the
performa and se on
new applications lementing comm
b oo tstrap s the development of g Go ogle co de base by imp
re
This infrastructu to the growin
largely co herence
rv ic es an d, mo re subtly, offers
system se
sign principles.
strategies and de
Q pogie Applications and Services :
Google
itecture of
e Ov er al l System Arch
Fig- 6.12.1 = Th
nfrastruct ure set of distributed services

offering
6.12.3 Goog le I sy st em is constructed as a Google,
le infra stru
cture. The s a common se
ria lization format for
ws Go og mp on en toff er service supports
g- 6.12.2 sho ffers co ogle publish -subscribe
Following Fi pro tocol bu mote invocation. Go
deve lopels- The plies in re
onalit toy ibers.
core functi n oO
j mbersof subscr
the serializatio lly large nu
together with
at io n of events to potentia Me
ssemin
the efficient di a ae
ribute d Computation |
pist
ubby 3 = BigTable
< a
nation
Data and Coordi
ba=
8Publish-Subscri
n paradigms
Communicatio ogle Infrastr
ucture d
Fig- 6.12.2 = Go of data couple
d\ ab st ra ct io ns for the storage e
ucture mized fo r th
as tr uc tu red and semi-strS offers a distributed file system opti
ffers U a : GF
© ate ss to the dat coordination serv ices and the ability to
ordinatl
on serv ices
rvices. Chubby supports
4 Data and co in
cations and sebute databaseoffering access to semi-structured data.
coord
:
with services tO suppo aP og a dist ri d
ements © Google llel and distributed computat
ion over the physical
particular requir of data. carryi ng ou t P ara
| volumes mean s for y large datasets,
stor
e smal atlon over potentially ver
a
pReduceé suppor Penitcatiors
infrastructure - Ma

&e eee
7 Distributed Computing (MU-Sem 8-Comp.) 6-28 Distributed File Systems and NameService
eee
4 Sawzall provides a higher-level language for the execution of such distributed computations. Communication is
ee__4_E_EEEEEEEEE4E4E4E4EeEEeeEEEeEeEEOO 2
supported through RMI, RPC and publish-subscribe service.

6.12.4 The Google File System (GFS)
4 The GFS mainly aims at demanding and rapidly growing needs of Google9s search engine and the Google web
applications. Following are the requirements for GFS :
O 4_GFS must run reliably on the physical architecture.

Oo. GFSis optimized for the patterns of usage within Google, both in terms of the types of files stored and the
4
patterns of access to thosefiles.

oO GFS must fulfill all the requirements for the Google infrastructure as a whole.
4 GFSprovides a conventional file system interface offering a hierarchical namespace with individual files identified by
pathnames.Following file operations are supported. The main GFS operations are very similar to those forthe flat file
service.
create 4 create a newinstanceof a file;

0
delete 4 delete an instanceof a file;

06000
open 4 open a named file and return a handle;

close 4 close a given file specified by a handle;
read 4 read data from a specifiedfile;
0
write 4 write data to a specifiedfile.

0
4 The parameterfor GFS read and writes operations specify a starting offset within the file. The API offers, snapshot and
record appendoperations. The snapshot operation offers an efficient mechanism to make a copy ofa particularfile or
directory tree structure. The record append operation supports the commonaccess pattern whereby multiple clients
carry out concurrent appends to a givenfile.
GFS Architecture
4 Fig. 6.12.3 shows overall GFS architecture. In GFS, the storage offiles is in fixed-size chunks, where each chunk is 64
megabytesin size.This size chosenis very large comparedto other file system.It offers highly efficient sequential reads
and appendsoflarge amounts ofdata.
Contro! Flow
Client
library, 3.
44 oe ome ome me
Fig. 6.12.3 : Overall GFS Architecture
UF TechKnomtedgs reels
Puaiications esi macne ars

ope ration Son individual chunks Figu _ INKS and also supports standa
re shows an instance of a GFS fj rd operations on files, mapping down to
le system asit m i i
ich joi aps onto a given physical cluster.
clients concurrently accessin th the data. ch jointly offer a file service to large numbers of
&
Th e mast er manages
metadata about
the file system
n and
It defines namespaceforfiles, access control informatio
em.
:
the mapping of each particular file to the
cated and location of the
is main tain ed in the maste r. Reni; _4= set of chunks. All chunks are repli
replicas are or hardwarefailure occurs.
key meta data is main tain ed sein ides livinatcecs reliability in case if any softw
The event of crashes.
alized whic h can be si ren log. This helps in recovery in the
- The masteri s centr
operations log is replicated on several remote
restored ~ point of failure. The
machines, so the master can be on ature. GFS avoids heavy useof caching and clients do no cache file data.
sng location information ofichurte4
s is cached atclients whenfirst accessed, in order to reduce interactions with the
distributedfile system.
Q. 1 Explain in short services providedby
distributed file system.
Q.2 Explain desirable features of good
Q.3 Explain different file models.

essing models..
Q.4 Explain different file acc
een client and server.
er en t techniqu es to transfer data betw
Q.5 Explain diff
ate modifications.
n different po licies to propag
Q.6 Explai
on schemes.
t cache validati
Q.7. Explain differen
ing and r emote service.
erentiate b etween cach
Q.8 Diff ion.
ing a nd replicat |
n cach
Q.9 Differentiate betwee
p lication? Explain.
vantages of re
Q.10 What are the ad
ion transparency:
Expiainin detail replicat spdato multiple copies offile.

Q.14
hes to
coPioS
for updating of multiple
:i s i
xplai i
syste# (NFS)network file
Q.14 Explain in detail
Q.15 Explain architecture a ve System (AFS)-

addresses? Explai
Q. 16 Explain in detal ea
ide
Q.17. Whatare names;
cam YechKnowletge
Publications
saiesireiei
a

Distributed Computing (MU-Sem 8-Comp.) 6-30 Distributed File Systems and Name Service
Q. 20 Write note on domain namesystem.

Q. 21 8Write note on directory services.
Q. 22 Write note on X.500 Directory Service.
Q.23 Write note on global nameservice.
Q.24 Explain Google applications and services.
Q. 25 Explain the overall system architecture of Google.
Q. 26 | Explain Google file system.

OOO

DC Techmax Everything You Need To Prepare For Distributed Computing Subject in The Final

Uploaded by

Copyright:

Available Formats

You might also like

DC Techmax Everything You Need To Prepare For Distributed Computing Subject in The Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DC Techmax Everything You Need To Prepare For Distributed Computing Subject in The Final

Uploaded by

Copyright:

Available Formats

Communication

2.1 Layered Protocols

Session qeetecitesSessionprotocol ________ 5

Physical _....--.. Physicalprotocol 4

Fig. 2.1.1: Interfaces, Layers and Protocols in OSI model

Scanned with CamScanner

2.1.1 Lower Layer Protocols :

Followingis discussion about functionsof each layer in OSI model.

2.2 Interprocess Communication (IPC)

2.2.1 Types of Communication

Scanned with CamScanner

2.3. Remote Procedure Call (RPC)

23,1 RPC Operation

Scanned with CamScanner

2.3.2 Implementation Issues

Machinesin large distributed system are not of same archi

4 As client and server applications are in different address space

4 Server stub at server machine then creates pointer to this data

Scanned with CamScanner

2.3.3 Asynchronous RPC Clie own444 >/ lao7

2.3.4 DCE RPC

Scanned with CamScanner

Steps in Writing a Client Server Application in DCE RPC

Binding Client to Server in DCE RPC

Scanned with CamScanner

Client 1. Register the

2.4 Remote Object Invocation

Skeleton invokes same - i OseShade

Scanned with CamScanner

Compile-time Versus Runtime objects

recovery.In this case, all the object references become invalid.

Scanned with CamScanner

Example : Java RMI |

The java Distributed Object Mode!

Scanned with CamScanner

Java Remote Object Invocation

2.5 Message-Oriented Communication

2.5.1 Persistence Synchronicity in Communication

Scanned with CamScanner

tran sien t comm unic atio n and in this communic

destination host. ion. This sent

Persistent Asynchr munic at io n se rv

Scanned with CamScanner

Distributed Computing(MU-Sem 8-Comp.) Communication 4

Scanned with CamScanner

omp.) 2-13 Communication iba

A - === == ~~ me == === === =~ Pai hid.

_oriented model using wh pp Ht,

s whic h is the example of messaging. «

Scanned with CamScanner

4 Close: Client and serverbothcalls close primitive to close the connection.

Message Passing Interface

communication. Following are some of the messaging primitives used.

Scanned with CamScanner

of the messag e need not be active

Message-Queuing Model ion has its

be executin g after its messag e is delive

Scanned with CamScanner