AOS Material

What is a Distributed System? What is a Distributed System? (cont.
)
P1 A distributed system is a
n From various textbooks: P2 P3
Network
set of physically separate
l “A distributed system is a collection of independent processors connected by
computers that appear to the users of the system as a one or more
P4 P5
single computer.” communication links
l “A distributed system consists of a collection of
autonomous computers linked to a computer network and n Is every system with >2 computers a distributed
equipped with distributed system software.” system??
l “A distributed system is a collection of processors that do l Email, ftp, telnet, world-wide-web
not share memory or a clock.”
l Network printer access, network file access, network file
l “Distributed systems is a term used to define a wide range backup
of computer systems from a weakly-coupled system such
as wide area networks, to very strongly coupled systems l We don’t usually consider these to be distributed
such as multiprocessor systems.” systems…
MIMD
Two Taxonomies for Classifying Computer Systems Classification of parallel and
MIMD distributed
computers
n Michael Flynn (1966)
Architectures tightly loosely
l SISD — single instruction, single data coupled coupled
l SIMD — single instruction, multiple data multiprocessors multicomputers
l MISD — multiple instruction, single data (shared (distributed /
Tanenbaum memory) private memory)
l MIMD — multiple instruction, multiple data (date?)
n More recent (Stallings, 1993)

bus switched bus switched
Sequent Ultra- Workstations Hypercube
computer on a LAN
n Tightly coupled ≈ parallel processing

l Processors share clock and memory, run one OS, communicate
frequently
n Loosely coupled ≈ distributed computing
l Each processor has its own memory, runs its own OS (?),
communicates infrequently
Classification of Operating Systems Classification of Operating Systems (cont.)

n Multiprocessor Operating System n “True” Distributed Operating System
l Tightly-coupled software (single OS) running on tightly-coupled
hardware l Loosely-coupled hardware
l A process can run on any processor n No shared memory, but provides the “feel” of a single
memory
n Single ready queue!
l All memory is shared l Tightly-coupled software

l File system similar to that on non-distributed systems n One single OS, or at least the feel of one
n Network Operating System l Machines are somewhat, but not completely, autonomous
l Loosely-coupled hardware
Disk1
l Loosely-coupled software
M1 P1 M2 P2 M3 P3
n Each computer runs its own OS
n User knows which machine he/she is on
l Goal: provide access other machines on the network, share Network

resources
M4 P4 M5 P5
l Typical utility programs: rlogin, rcp, telnet, ftp
Printer4 Disk5
Protocol Layers
Network Protocols
n Network communication is divided up into seven layers
l Each layer deals with one particular aspect of the
n ISO OSI 7-layer model
communication
n TCP/IP suite
l TCP/UDP l Each layer uses a set of routines provided by the layer
l IP below it
l Ethernet/Token Ring
l Each layer ignores lower-level (and higher-level) details and
l ICMP
problems
n Each layer takes a message passed down to it by a higher

layer, adds some header information, and passes the message
on to a lower layer
l Each layer has the illusion of peer-to-peer communication
l Eventually the message reaches the bottom layer, and get
physically sent across the network
ISO OSI protocol summary

ISO OSI 7-layer protocol scheme n Application layer — provides network access to application
programs
l Examples: telnet, ftp, email (SMTP)
n Presentation layer — provides freedom from machine-dependent

representations - maintains structured information - arrays, records,
etc: translates between machine presentations if necessary,
encryption/decryption, compression/decompression
n Session layer — provides communication/synchronization between

processes, Not required in connectionless communication
l Example: Remote Procedure Call (RPC)
n Transport layer — accepts messages of arbitrary length between

hosts, error control for out-of-sequence and missing packets
l Examples: TCP (connection-oriented), UDP
(connectionless)
ISO OSI protocol summary (cont.) TCP/IP Protocol suite

n Network layer — provides switching and routing needed to (1) n Upper layers
establish, maintain, and terminate switched connections, and l ftp — file transfer protocol
(2) transfer data (packets) between end systems n Sends files from one system to another under user
command
l Examples: IP (connectionless), X.25 (connection- n Handles both text and binary files
oriented) n Supports userids and passwords
l telnet — remote terminal protocol

n Data link layer — reliably transfers packets (broken up into n Lets a user at one terminal log onto a remote host
frames) over a communication link, error correction within l smtp — simple mail transfer protocol
frame/ flow control n Transfers mail messages between hosts
n Handles mailing lists, forwarding, etc.

l Examples: Ethernet
n Does not specify how mail messages are created
l dns — domain name service
n Physical layer —converts 1s and 0s into electrical or optical
signals, and transmits frames of bits across a wire / cable n Maps names into IP addresses
n A domain may be split into subdomains

l Examples: RS-232-C (serial communication lines), n Name severs are usually replicated to improve reliability
X.21
TCP IP
n IP — Internet Protocol
n TCP — Transmission Control Protocol
l Connection-oriented (3-way handshake) l Connectionless
l On transmit side, breaks message into packets, assigns sequence
l Unreliable
numbers, and and sends each packet in turn
n Packets may be lost, duplicated, or delivered out of order
n Sends to a particular IP address and port
n Flow control — doesn’t send more packets than receiver is l Forward packet from sender through some number of gateways
prepared to receive (routers) until it reaches the final destination
l On receive side, receives packets, reassembles them into messages
n A gateway accepts a packet from one network and forwards it to a
n Computes a checksum for each packet and compares it to
host or gateway on another network
checksum sent, discards packet if checksums don’t agree
n Reorders out-of-order packets l Destination has specific Internet address, which is composed of two
l Reliable parts:
n Packets must be acknowledged n network part — network the host is on
n If sender doesn’t receive an acknowledgment after a short period,
n address part — specific host on network
it retransmits that packet
l Congestion control — don’t overwhelm the network l Routing is dynamic — each gateway chooses the next gateway to
send the packet to
n Gateways send each other information about network congestion
and gateways which are down
Ethernet Token Ring

n Network is a bus – broadcast to anyone who cares to listen
n Devices are joined in a ring
n Every Ethernet device (everywhere in the world!) has a unique address
n Transmission
l The Institute of Electrical and Electronics Engineers (IEEE)
allocates addresses to manufacturers, who build a unique address l unique message (token) is circulated in the ring
into each Ethernet device l Token is free when no device is transmitting
n Transmission – Carrier Sense Multiple Access with Collision Detection l To transmit, a host waits for a free token, attaches its message to
(CSMA/CD) it, sent the token status to busy, and sends it on
l Carrier sense: listen before broadcasting, defer until channel is l Destination removes the message, sets the token status to free,
clear, then broadcast and sends it on
l Collision detection: listen while broadcasting n Advantage: not sensitive to load
n If two hosts transmit at same time —collision — the data gets n Disadvantage: complexity – token maintenance is complex
garbled
n Each jams network (short jammed signal is issued), then waits a
random (but increasing) amount of time, and tries again
ICMP
n A part of IP that is less widely known is the Internet Control Message
Protocol (ICMP)
l Allows gateways and hosts to exchange bootstrapping information,
report errors, and test the liveliness of the network
n Some useful programs using ICMP:

l traceroute /usr/contrib/bin/traceroute
n Displays route taken to reach destination, time for each hop
n Sends multiple (?) 20-byte packets
l ping /usr/sbin/ping
n Tests that destination is up and reachable
n Sends ICMP echo request to destination
n Destination sends ICMP echo reply
n Sends 64-byte packets repeatedly

Communication Primitives Message passing
§ Communication primitives
• send(destination-process, message)
• receive(source-process, message)
§ blocking vs. non-blocking primitives
§ message passing model • a message is copied at least three times:
• communication primitives sender (user buffer -1-> kernel buffer) -2->
o blocking vs. non-blocking receiver(kernel buffer -3-> user buffer)
o synchronous vs. asynchronous • non-blocking - faster, harder to code, riskier, requires additional
o direct vs. indirect OS support
§ remote procedure calls o send - return as soon as message copied to kernel buffer
• motivation o receive - provide a (user) buffer for message to be copied to
• overview and return

• binding • blocking - safer, easier to think about, slower
• parameter and result passing o unreliable - block until message is sent/received
• stub generation o reliable - receipt is confirmed by acknowledgement, block until
• execution semantics ack is sent/received
Direct vs. indirect communication

Message passing (cont.)
§ Direct communication — explicitly name the process you’re
communicating with
o send(destination-process, message)
§ synchronous vs. asynchronous primitives
• synchronous (also called randezvous) – o receive(source-process, message)
send blocks until receive executes on • Link is associated with exactly two processes
receiving computer - safer, easier, less
o Between any two processes, there exists at most one link
concurrent
o The link may be unidirectional, but is usually bidirectional
• asynchronous -multiple sends can be
executed before receives (messages
buffered) - more dangerous (what to do with § Indirect communication — communicate using mailboxes (ports)
messages to crashed receiver?), complex, (usually) owned by receiver
concurrent o send(mailbox, message)
o receive(mailbox, message)
• Link is associated with two or more processes that share a

mailbox
o Between any two processes, there may be a number of links
o The link may be either unidirectional or bidirectional
Why is message-passing not ideal? Remote Procedure Call (RPC)

§ Disadvantages of client-server communication via message § RPC mechanism:
passing:
• Hides message-passing I/O from the programmer
• Message passing is I/O oriented, rather than request/result
• Looks (almost) like a procedure call — but client invokes a
oriented
procedure on a server
• Programmer has to explicitly code all synchronization
§ RPC invocation (high-level view):
• Programmer may have to code format conversion, flow
control, and error control • Calling process (client) is suspended
• Parameters of procedure are passed across network to
§ Goal — heterogeneity — support different machines, different
called process (server)
OSs
• Server executes procedure
• Portability — applications should be trivially portable to
machines of other vendors • Return parameters are sent back across network
• Interoperability — clients will always get same service, • Calling process resumes
regardless of how vendor has implemented that service
§ Invented by Birrell & Nelson at Xerox PARC, described in
• OS should handle data conversion between different types
February 1984 ACM Transactions on Computer Systems
of machines
Remote Procedure Call (RPC) I/O protection
pack unpack § To prevent illegal I/O, or simultaneous I/O requests from
§ Each RPC invo- call parameters parameters call
multiple processes, the OS typically performs all I/O via
cation by a client client client server privileged instructions
process calls a stub stub server
• User programs must make a system call to the OS to
client stub, which return unpack pack return perform I/O
builds a message results results
and sends it to
kernel kernel § When user process makes a system call:
a server stub
• A trap (software-generated interrupt) occurs, which causes:
§ The server stub
o The appropriate trap handler to be invoked using the
uses the
message network trap vector
to generate a o Kernel mode to be set
local procedure call to the server • The trap handler:

§ If the local procedure call returns a value, the server stub builds a o Saves process state
message and sends it to the client stub, which receives it and returns o Performs requested I/O (if appropriate)
the result(s) to the client
o Restores state, sets user mode, and returns to calling
program
RPC Invocation (more detailed) Binding

§ Binding = determining the server and remote procedure to call
1. Client app. procedure calls the client stub
§ Static binding — addresses of servers are hardwired (e.g., Ethernet
2. Client stub packs parameters into message and traps to the kernel
number)
3. Kernel sends message(s) to remote kernel
• Inflexible if a server changes location
4. Remote kernel passes message(s) to server stub
• Poor if there are multiple copies of a server
5. Server stub unpacks parameters and calls server app. procedure
6. Server app. executes procedure and returns results to server stub § Dynamic binding — dynamically assign server names
7. Server stub packs result(s) in message(s) and traps to kernel • Broadcast a “where is the server?” message, wait for response
8. Remote kernel sends message(s) to local kernel from server
9. Local kernel passes message(s) to client stub • Use a binding server (binder)
o Servers register / deregister their services with the binding
10. Client stub unpacks result(s) and returns them to client app.
server
o When a client calls a remote procedure for the first time, it
queries the binding server for a registered server to call
Parameter and result passing Parameter passing (cont.)

int query (int key, int number, § Handle different internal representations
nr_hits = query(key, 10, result); tuple values) • ASCII vs. EBCDIC vs. …
{
… • 1’s comp. vs. 2’s comp. vs. floating-point
return (num_hits); • Little endian vs. big endian
}
• Establish a canonical (standard) form?
marshal unmarshal
§ What types of passing are supported?
• Remote procedure can’t access global variables must pass all
network necessary data
int query (int key, int number, • Call-by-value (procedure gets a copy of data) — pass
tuple values) parameters in message
nr_hits = query(key, 10, result); { • Call-by-reference (procedure gets a pointer to data)
…
return (num_hits); o Can’t do call-by-reference
} o Do call-by-copy / restore instead
unmarshal marshal – Instead of pointer, pass item pointed to

– Procedure modifies it, then pass it back
o Inconsistency if client doesn’t block
network
Generating stubs Error handling, semantics
§ C and C++ may not be descriptive enough to allow stubs to be
generated automatically § RPC call may fail due to computer or communications failure
typedef struct { char add(int key, tuple value); § what to do if RPC call fails?
double item1; char remove(int key, tuple value); § three execution semantics
int item2; int query(int key, int number, tuple values[ ]);
• “at least once”
char *annotation;
} tuple; o if call succeeds – at least one execution of remote procedure
happened
• Which are in, in-out, and out parameters? o if fails – none, partial, multiple executions
• Exactly what size are parameters (e.g., integers, • “exactly once”
arrays)? o if succeeds – exactly once
o if fails – none, partial, one

• What does it mean to pass a pointer?
• “at most once”-
§ inerface
Using db OSF’s DCE Interface boolean
Definition long query
add ( Language (IDL) (
to specify
{ [in] long key, [in] long key, o if succeeds – exactly once
procedure
typedef struct signatures
{ for stub
[in] generation:
tuple value [in] long number,
double item1; ); [out, size_is(number)] o if fails - none
long item2; tuple values[ ]
[string, ptr] boolean remove ( );
ISO_LATIN_1 [in] long key,
*annotation; [in] tuple value
} tuple; );
Stateful vs. stateless server

§ Stateful server — server maintains state information for each client
for each file
• Connection-oriented (open file, read / write file, close file)
• Enables server optimizations like read-ahead (prefetching) and
file locking
• Difficult to recover state after a crash
§ Stateless server — server does not maintain state information for

each client
• Each request is self-contained (file, position, access)
o Connectionless (open and close are implied)
• If server crashes, client can simply keep retransmitting requests

until it recovers
• No server optimizations like above
• File operations must be idempotent
Introduction to distributed What is distributed system
algorithms
n Distributed system is a collection of independent
n definition processes that communicate with each other and
n relation to practice cooperate achieving a common goal
n models n each process has its own independent set of
instructions and can proceed with its own speed
n main concepts
n the only way for one process to coordinate with
u states and actions
others is via communication
u computations and properties
n thus the system consists of a set of processes
n causality relationships connected by a network of communication links
n wave algorithms
u ring algorithm
u tree algorithm
u echo algorithm
u polling algorithm
1 2
Why distributed systems Model classification

n By time
u (fully) synchronous model - processes take steps
n distributed system is a convenient abstraction to model simultaneously (execution proceeds in synchronous rounds
various practical computer architectures: u partially synchronous model - processes proceed
u multiprocess OS independently but some timing information (such as the
u multiprocessor computer architecture maximum difference between the slowest and fastest
process) is known
u local area network
u fully asynchronous model - processes take steps in arbitrary
u Internet
order with arbitrary speeds
u VLSI chip
n By communication primitives
n the abstraction of the model allows designing algorithms
u message-passing - processes do not share memory,
that are correct irrespective of implementation details
connected by channels, communicate by sending and
receiving messages
u shared-memory - processes share registers to which they can
apply read/write or more complex operations (like test&set)
F note that unlike PRAM the order of operations on shared
memory objects is non-deterministic
3 4
Computations
States and Actions
n a computation is a sequence of states such that
n Assignment of values to all variables in distributed system is a
u the first state is an initial state and each consequent
(global) state of distributed system. Let Z be a set of system’s states state is obtained by executing an enabled action at the
u if the system is message passing the state includes an preceding state
assignment of messages to communication channels u a computation is either infinite or ends in a fixpoint
n An atomic operation is an operation that cannot interleave with n A computation is weakly fair if no action is enabled in
other operations
infinitely many consequent states
n an atomic operation takes the system from one state to another n prefix – a finite segment of a computation that starts in an
n a distributed algorithm defines a system, a set of initial state and a initial state and ends in a state
set of possible atomic operations n suffix – a computation segment that starts in a state and is
n an operation is enabled at a certain state if it can be executed at that either infinite or ends in a fixpoint
state u a computation is obtained by joining a prefix and a suffix
n the execution of an action is an event u can two separate computations share a prefix? a suffix?
n fixpoint (quiescent state) – a state where none of the actions are both?
enabled
6 7
Program Properties
Causality relationship
n property – a set of correct computations
u safety – there is a set of prefixes none of which a correct n An event is usually influenced by part of the state.
computation may contain (something “bad” never happens)
n two consecutive events influencing disjoint parts of the state are
u liveness – for every prefix there is suffix whose addition creates a independent and can occur in reverse order
correct computation (a something “good” eventually happens )
n this intuition is captured in the notion of causality relation p
n every property is an intersection of safety and liveness properties
n for message-passing systems:
n example:
u if two events e and f are different events of the same process and e
u mutual exclusion problem (MX) – a set of processes alternate
occurs before f then e p f
executing critical section (CS) of code and other code
u if s is a send event and r is a receive event then s p r
u a solution to mutual exclusion problem has to satisfy three
properties: n for shared memory systems:
F exclusivity – only one process at a time is allowed to enter CS u a read and a write on the same data item are causally related
F progress – some process eventually enters CS n p is transitive

F lack of starvation – each process requesting CS is eventually n p is a partial order
allowed to enter n if not a p b or b p a then a and b are concurrent: a||b
u which properties are liveness? safety? both? which are
8 9
redundant?
Wave Algorithms Ring algorithm

n Wave algorithm satisfies the following three properties
u termination: each computation is finite
u decision: each computation contains at least one decide event
u dependence: in each computation each decide event is causally

preceded by an event in each process
n initiator(starter) - process that execution of its actions spontaneously
n Processes are arranged in a unidirectional ring (each process has a
n non-initiator(follower) - starts execution only when receives a
sense of direction or knowledge of one dedicated neighbor)
message
n wave algorithms differ in many respects, some features: n initiator send message 〈 tok〉〉 along the cycle
u centralized (single-source) - one initiator; decentralized (multi- n each process passes it on
source) - multiple initiators n when it returns to initiator, initiator decides
u topology - ring, tree, clique, etc. n Theorem: ring algorithm is a wave algorithm
u initial knowledge:
F each process knows its own unique name
F each process knows the names of its neighbors
u number of decisions to occur in each process

10 11
n usually wave algorithms exchange messages with no content - tokens
Tree
algorithm Polling
n Operates on tree
algorithm
network (can work on
spanning tree of arbitrary
network) - no root,
edges are undirected
(bi-directional)
n leaves of tree initiate the algorithm
n if a process has received a message from all neighbors but one n Works on cliques (complete networks)
(initially true for leaves), the process sends a message to the n one initiator (centralized)
remaining neighbor.
n how many processes decide?
n If process gets messages from all neighbors - it decides
n How many messages are exchanged in the algorithm
n Excluding the forall statement how many processes can decide?
n what other topology can this algorithm be used in
What are these processes?
n Is this a wave algorithm?
n Why do we need forall statement?
n How many messages are sent in the algorithm? 12 13
Chang’s
Echo
algorithm
n Works on networks
of arbitrary topology
n one initiator
(centralized)
n initiator sends messages to all neighbors
n when non-initiator receives the first message it forwards it to all other
neighbors, when it gets tokens from all other neighbors it replies back
n how many processes decide?
n How many messages are exchanged in the algorithm
14
Traversal algorithms What are traversal algorithm
n Traversal algorithm is a wave algorithm with the

following properties:
u sequential polling algorithm u each computation contains one initiator which
u traversing connected networks (tree starts computation by sending one message

construction) u when a process receives a message it either
F complexity measures
sends out one message or decides
F tree terminology
n traversal algorithm can be viewed as follows:
there is a single token that “visits” all processes
F tarry’s algorithm
n we’ve already looked at one traversal algorithm
F classical algorithm
what is it?
F awerbuch’s algorithm
1 2
Sequential polling algorithm Measuring efficiency of algorithms
n Time complexity. Time is idealized

n Same as polling only
F a process executes internal events in zero time
visiting of nodes is done
F one time unit passes from the moment the
in sequence rather than
in parallel message is sent until it is available for receipt
n is it a traversal algorithm? n Message complexity - number of messages it takes the
algorithm to carry out specified task
n how many messages
are sent in the algorithm? n for traversal algorithms time complexity is equal to
message complexity. Why?
n Is it true for all wave algorithms?
3 4
Tree terminology Tarry’s

algorithm
n Spanning tree - a tree that contains all nodes of the network n Works for arbitrary
networks
n leaf - a node that has just one incident edge in the spanning
tree n initiator forwards token to
one of neighbors, each
n rooted tree - a tree that has one distinguished node called root
neighbor forwards token
n node’s ancestor - a node that lies on the path from this node to all other nodes
to the root and when
n father - closest ancestor done returns the token
n node’s descendant - a node that lies on the path from the n a node is free to chose
node to a leaf any node to forward
n child - closest descendant token to
n frond edge - edge that is not in spanning tree n is Tarry’s a traversal algorithm?
n depth-first spanning tree - a tree where a frond edge connects n does Tarry’s algorithm produce a spanning tree?
only an ancestor and a descendant n is it depth first spanning tree?
n What is the complexity of Tarry’s algorithm
5 u 2E 6
Classical Awerbuch’s
depth-first depth-first search
search n does spanning tree
construction in time
proportional to the number
of nodes (linear time)
n prevents token forwarding
over frond edges -
each process knows
which neighbors were
visited before it forwards
n a restriction of tarry’s the token not needed
algorithm n node notifies neighbors
n if a node gets a message that it is visited by sending
<vis> messages to them
over a frond edge it
n time complexity – 4N-2
it immediately forwards it (token traverses N-1 edges
back twice and is delayed at
n is the tree constructed every root node for two time
a depth-first tree? units)
n message complexity – 4E (<vis> and <ack> is sent along each frond
edge twice, <vis> from father to son, <ack> - back, <tok> - twice
7 8
along each tree edge)
Why Do We Care About “Time”
in a Distributed System?
Physical Clocks • may need to know the time of day some event happened on a
specific computer
• need for time in distributed systems u need to synchronize that computer’s clock with some
• physical clocks and their problems external authoritative source of time (external clock
• synchronizing physical clocks synchronization)
F How hard is this to do?
u coordinated universal time (UTC)
u Cristain’s algorithm • May need to know the time interval, or relative order, between
two events that happened on different computers
u Berkeley algorithm
u If their clocks are synchronized to each other to some known
u network time protocol (NTP)
degree of accuracy (called internal clock synchronization),
we can measure time relative to a local clock
F Is this always consistent?
• Will ignore relativistic effects

u Cannot ignore network’s unpredictability
1 2
Physical Clocks Physical Clocks in a Distributed System

• Does this work?
• Every computer contains a physical clock Synchronize all the clocks to some known high degree of
u
• A clock (also called a timer) is an electronic device that counts accuracy, and then
oscillations in a crystal at a particular frequency u measure time relative to each local clock to determine order
• Count is typically divided and stored in a counter register between two events
• Clock can be programmed to generate interrupts at regular • Well, there are some problems…
intervals (e.g., at time interval required by a CPU scheduler) u It’s difficult to synchronize the clocks
• Counter can be scaled to get time of day u Crystal-based clocks tend to drift over time — count time at
• This value can be used to timestamp an event on that computer different rates, and diverge from each other
• Two events will have different timestamps only if clock F Physical variations in the crystals, temperature
resolution is sufficiently small variations, etc.

• Many applications are interested only in the order of the events, F Drift is small, but adds up over time
not the exact time of day at which they occurred, so this scaling F For quartz crystal clocks, typical drift rate is about one
is often not necessary second every 106 seconds =11.6 days
F Best atomic clocks have drift rate of one second in 1013
seconds = 300,000 years
3 4
Coordinated universal time

Synchronizing physical clocks
• The output of the atomic clocks is called International Atomic
Time
u Coordinated Universal Time (UTC) is an international • Use a time server with a UTC receiver
standard based on atomic time, with an occasional leap • Centralized algorithms
second added or deleted u Client sets time to Tserver + Dtrans
• UTC signals are synchronized and broadcast regularly by F Tserver = server’s time
various radio stations (e.g., WWV in the US) and satellites (e.g., F Dtrans = transmission delay
GEOS, GPS)
• Unpredictable due to network traffic
u Have propagation delay due to speed of light, distance from
broadcast source, atmospheric conditions, etc.
u Received value is only accurate to 0.1–10 milliseconds
• Unfortunately, most workstations and PCs don’t have UTC
receivers
5 6
Cristian’s algorithm Berkeley algorihtm
• Send request to time server, measure time Dtrans taken to
receive reply Tserver • Choose a coordinator computer to act as the master
• Set local time to Tserver + (Dtrans / 2) • Master periodically polls the slaves — the other computers
whose clocks should be synchronized to the master
u Improvement: make several requests, take average Tserver
value u Slaves send their clock value to master
• Assumptions: • Master observes transmission delays, and estimates their local
clock times
u Network delay is fairly consistent
u Master averages everyone’s clock times (including its own)
u Request & reply take equal amount of time
F Master takes a fault-tolerant average — it ignores
• to offset variations in time delay – client may average over readings from clocks that have drifted badly, or that have
several requests
failed and are producing readings far outside the range of
• Problems: the other clocks
u Doesn’t work if time server fails u Master sends to each slave the amount (positive or
u Not secure against malfunctioning time server, or malicious negative) by which it should adjust its clock
impostor time server
7 8
Network Time Service protocol (NTP) Compensating for clock drift in NTP
• Provides time service on the Internet • Compare time Ts provided by time server to time Tc at computer C
• Hierarchical network of servers: • If Ts > Tc (e.g., 9:07am vs 9:05am)
u Primary servers (100s) — connected directly to a time u Could advance C’s time to Ts
source u May miss some clock ticks; probably OK (maybe not)
u Secondary servers (1000s) — connected to primary servers • If Ts < Tc (e.g., 9:07am vs 9:10am)
in hierarchical fashion u Can’t roll back C’s time to Ts
F ns.mcs.kent.edu runs a time server F Many applications (e.g., make) assume that time always
u Servers at higher levels are presumed to be more accurate advances!
than at lower levels u Can cause C’s clock to run slowly until it resynchronizes with
• Several synchronization modes: the time server
u Multicast — for LANs, low accuracy F Can’t change the clock oscillator rate, so have to change the
u Procedure call — similar to Cristian’s algorithm, higher software interpreting the clock’s counter register
accuracy (file servers) F Tsoftware = a Thardware + b
u Symmetric mode — exchange detailed messages, maintain F Can determine constants a and b
history
• All built on top of UDP (connectionless) 9 10
Is It Enough to
Synchronize Physical Clocks?
• In a distributed system, there is no common clock, so we have
to:
u Use atomic clocks to minimize clock drift
u Synchronize with time servers that have UTC receivers,
trying to compensate for unpredictable network delay
• Is this sufficient?
u Value received from UTC receiver is only accurate to within
0.1–10 milliseconds
F At best, we can synchronize clocks to within 10–30
milliseconds of each other
F We have to synchronize frequently, to avoid local clock
drift
u In 10 ms, a 100 MIPS machine can execute 1 million
instructions
F Accurate enough as time-of-day
F Not sufficiently accurate to determine the relative order

11
of events on different computers in a distributed system
Logical Clocks From physical clocks to logical clocks
n Physical clocks (last time)
n why physical clocks are not adequate
u With a receiver, a clock can be synchronized to within 0.1–
n event ordering, happened-before relation 10 ms of UTC
n Lamport’s logical clocks u On a network, computer clocks can be synchronized to
u condition within 30 ms of each other (using NTP)
u implementation u Quartz crystal clocks drift 1 µs per second (1 ms per 16.6
u limitation minutes)
n Vector clocks u In 30 ms, a 100 MIPS machine can execute 3 million
u condition instructions
u implementation u We will refer to these clocks as physical clocks, and say
u application
they measure global time
F birman-schiper-stephenson causal
n Idea — abandon idea of physical time
broadcast u For many purposes, it is sufficient to know the order in
which events occurred
F schiper-eggli-sandoz causal
ordering of messages u Lamport (1978) — introduce logical (virtual) time,
synchronize logical clocks
1 2
Events and event ordering The “happened before” relation

n For many purposes, it is sufficient to know the order in which n Lamport defined the happened before relation (denoted as “→”),
two events occurred which describes a causal ordering of events:
u An event may be an instruction execution, may be a function (1) if a and b are events in the same
execution, etc. process, and a occurred before b,
u Events include message send / receive then a→b
n Within a single process, or between two processes on the same (2) if a is the event of sending a message m in one process,
computer, and b is the event of
u the order in which two events occur can be determined
receiving that message m in another
using the physical clock process, then a→b
n Between two different computers in a distributed system, (3) if a→b, and b→c, then a→c (i.e., the
relation “→” is transitive
u the order in which two events occur cannot be determined
n Causality:
using local physical clocks, since those clocks cannot be
synchronized perfectly u Past events influence future events
u This influence among causally related events (those that

can be ordered by “→”) is referred to a causally affects
u If a→b, event a causally affects event b
3 4
The “happened before” relation Lamport’s logical clocks

P Q R n To implement “→” in a distributed system, Lamport (1978)
introduced the concept of logical clocks, which captures “→”
p4 q4 r1 numerically
p3 q3 n Each process Pi has a logical clock Ci
q2 n Clock Ci can assign a value Ci (a) to any event a in process Pi
p2
u The value Ci (a) is called the timestamp of event a in
p1 q1 process Pi
u The value C(a) is called the timestamp of event a in
p0 q0 r0
whatever process it occurred
n The timestamps have no relation to physical time, which leads
n Concurrent events; to the term logical clock
u The logical clocks assign monotonically increasing
u Two distinct events a and b are said to be concurrent
timestamps, and can be implemented by simple counters
(denoted “a || b”), if neither a→b nor b→a
u In other words, concurrent events do not causally affect
each other
n For any two events a and b in a system, either: a→b or 5 6
b→a or a || b
Conditions Satisfied by the logical Implementation of logical clocks
clocks
n Clock condition: if a→b, n Implementation Rules (guarantee that the logical clocks satisfy the
then C(a)<C(b) correctness conditions):
u If event a happens before event b, then the clock value [IR1] Clock Ci must be incremented
(timestamp) of a should be less than the clock value of b between any two successive events
u Note that we can not say: if C(a)<C(b), then a→b in process Pi :
n Correctness conditions (must be satisfied by the logical Ci := Ci + d (d>0) (usually d=1)
clocks to meet the clock condition above): [IR2] If event a is the event of sending
[C1] For any two events a and b in the a message m in process Pi ,
same process Pi , if a happens then message m is assigned a
before b, then Ci (a) < Ci (b) timestamp tm =Ci (a)
[C2] If event a is the event of sending When that same message m is
a message m in process Pi , received by a different process Pk,
and event b is the event of receiving Ck is set to a value greater than or
that same message m in a different equal to its present value, and
process Pk, then Ci (a) < Ck (b) greater than tm :
Ck := max(Ck , tm + d )
7 8
(d>0) (usually d=1)
Example of logical clocks Example of logical clocks

e11 e12 e13 e14 e15 e16 e17
P1 e11 e12
Updating logical (1) (2) (3) (4) (5) (6) (7) The happened P1
clocks using Lamport’s before relationship
method “→” defines an (1) (2)
irreflexive partial
order among events e21 e22
e21 e22 e23 e24 e25
P2 P2
(1) (2) (3) (4) (7)
(1) (3)
n Notes: n A total order of events (“⇒”) can be obtained as follows:

u Clocks initially 0, d=1 u If a is any event in process Pi , and b is any event in process Pk ,
u Most clocks incremented due to IR1 then a ⇒b if and only if either:

u Sends e12, e22, e16, and e24 use IR1 Ci (a) < Ck (b) or
u Receives e23, e15, and e17 set to Ck Ci (a) = Ck (b) and Pi << Pk
u Receive e25 sets to tm +d = 6+1 = 7 where “<<“ denotes a relation that totally orders the processes to
break ties
9 10
Limitation of logical clocks Vector clocks

e11 e12
P1
(1) (2)
n Independently proposed by Fidge and by Mattern in 1988
e21 e22
P2 n Vector clocks:
(1) (3) u Assume system contains n processes
u Each process Pi has a clock Ci , which is an integer vector of

e31 e32 e33
P3 length n
(1) (2) (3) Ci = (Ci [1], Ci [2], … Ci [n] )
u Ci (a) is the timestamp (clock value) of event a at process Pi
n With Lamport’s logical clocks, if a→b, then C(a) < C(b)
u Ci [i](a), entry i of of Ci , is Pi ’s logical time
u The following is not necessarily true if events a and b occur in
different processes: if C(a) < C(b), then a→b u Ci [k](a), entry k of of Ci (where k≠i ), is
u C(e11) < C(e22), and e11→e22 is true

Pi ’s best guess of the logical time at Pk
F More specifically, the time of the occurrence of the last event in
u C(e11) < C(e32), but e11→e32 is false
Pk which “happened before” the current event in Pi (based on
n Cannot determine whether two events are causally related from
messages received)
timestamps
11 12
Implementation of vector clocks Implementation of vector clocks
e11 e12 e13
n Implementation Rules: P1
(1,0,0) (2,0,0) (3,4,1) “enn” is
[IR1] Clock Ci must be incremented
between any two successive events event;
in process Pi : “(n,n,n)” is
e21 e22 e23 e24 clock value
Ci [i]:= Ci [i] + d (d>0, usually d=1) P2
[IR2] If event a is the event of sending (0,1,0) (2,2,0) (2,3,1) (2,4,1)
a message m in process Pi , then message m is
assigned a vector timestamp tm =Ci (a)
When that same message m is received by a different e31 e32
P3
process Pk , Ck is updated as follows: (0,0,1) (0,0,2)
∀p, Ck [p]:= max(Ck [p], tm [p] + d )
(usually d=0 unless needed to model n Notes:
network delay) u Events e11, e21, and e12 updated by IR1
u It can be shown that ∀i, ∀k : Ci [i] ≥ Ck [i] u Receive e22 updated by IR1 and IR2
u Rules for comparing timestamps can also be established so that if u Receive e13 tells P1 about P2 and P3 (P3 clock is old, but better
ta < tb , then a→b than nothing!)
F Solves the problem with Lamport’s clocks
13 14
Application of VC Birman-Schiper-Stephenson (BSS)

causal ordering of broadcasts
n causal ordering of messages
n each process Pi maintains vector time VTPi to track the order of
u maintaining the same
broadcasts
causal order of message
n before broadcasting message m, Pi increments VTPi[i] and appends VTPi
receive events
to m (denoted VTm)
as message sent
u notice that (VTPi[i]-1) is the number of messages from Pi preceding m
u that is: if Send (M1) -> Send(M2) and Receive(M1) and Receive
(M2) are on the same process than Receive(M1) -> Receive(M2) n when Pj receives m from Pi ≠Pj it delivers it only when
u example above shows violation u VTPj[i] = VTm[i] - 1 all previous messages from Pi are received by Pj
u do not confuse with causal ordering of events u VTPi[k] ≥ VTm[k], ∀k ∈{1,2,…n} but i
n causal ordering is useful, for example for replicated databases Pj received all messages received by Pi before sending m
n two algorithms using VC u undelivered messages are stored for later delivery
u Birman-Schiper-Stephenson (BSS) causal ordering of broadcasts
n after delivery of m VTPi[k] is updated according to VC rule IR2 on the
basis of VTm and delayed messages are reevaluated
u Schiper-Eggli-Sandoz (SES) causal ordering of regular messages
n basic idea – use VC to delay out of order message delivery
15 16
Schiper-Eggli-Sandoz (SES) causal

ordering of single messages
n non-FIFO channels okay
n each process maintains VP a set of entries (P’,t) where P a destination
process and t is a VC timestamp
n sending a message m from P1 to P2
u send a message with a current timestamp tP1 and VP1 from P1 to P2
u add (P2,tP1) to VP1 -- for future messages to carry
n receiving this message

u message can be delivered if
F Vm does not contain an entry for P2
F Vm contains entry (P2,t) but t ≤ tP2 (where tP2 current TS at P2)
u after delivery
F insert entries from Vm into VP2 for every process P3 ≠ P2 if they are not
there
F update the timestamp of corresponding entry in VP2 otherwise
F update P2 logical clock

17
F deliver buffered messages if possible
Local state
Global state recording n local state LSi of a site (process) Si is an assignment of values to
variables of Si
n local state n sending send(mij) and receiving rec(mij) of message mij from Si to Sj
n global state may influence LSi
n Chandy-Lamport’s global state n we denote
(snapshot) recording algorithm u time(send(mij) or rec(mij)) the time (physical or point in the
n consistent cuts computation) the send or receive occurs

u time(LSi) time the local state of Si was recorded
n to aid the reasoning we consider the messages sent/received by the

site as belonging to local state
we define
that is
• the message is in transit if it was sent but not received
• the message is inconsistent if it was received but never sent
Global state Chandy-Lamport’s global state recording

algorithm (snapshot algorithm)
n global state is a collection of n works on arbitrary topology system with FIFO channels and arbitrary
local states of all sites and set algorithm whose snapshot is taken (basic algorithm)
of messages in the channels
n does not interfere with the operation of basic algorithm (does not delay,
u notice Singhal does not reorder or drop basic messages)
use messages in his def. – n records the state that might have happened between the beginning and
ours is more precise end of snapshot and led
n one process initiates recording by sending control messages (markers)
global state is consistent if it does not have any inconsistent messages, multiple pro-
that is: cesses can
also initiate
• can C-L record

global state is transitless if there are no messages in transition,
a state where some
that is: P have messages in
every channel? if
note that a consistent state is not necessarily transitless and v.v. yes which one?
• can several
what are the global states on the picture above? independent
snapshots run in
parallel?
Cuts
n cut C is a set of special (cut) events one
for each process
C={c1,c2, … , cn}
n a non-cut event at process P is pre-cut
if it precedes the cut event at this process, post-cut otherwise
n a cut is not consistent if a pre-cut event causally follows a post-cut
event
n Theorem a cut is consistent iff all cut events are concurrent
n assume virtual clocks are run, VTci is timestamp of ci
n we define VTC the timestamp of the cut C:
where sup is comprehensive maximum:
Theorem: the cut is consistent iff

Termination detection in asynchronous Definitions
message-passing systems
n Terminal state - no further steps of the algorithm is possible
n terminal state of process - a special local state of a process where
n definition no event of the process is applicable
n motivation for termination detection n deadlock is an example of terminal state
n main concepts in termination n termination
n flooding termination announcement algorithm u implicit (message termination) - state that allows the receipt of
n Dijkstra-Scholten termination detection algorithm messages but there are no messages in the channels - no
for diffusing computations action can be executed but the processes are not aware of
termination
n Shavit-Francez generalization to decentralized
u explicit (process termination) - a terminal state with every
algorithms
process in terminal state and no messages in channels
n implicit termination is easier to design since the processes do not
need to know the algorithm terminated
1 2
Why termination detection More definitions
n Objective - convert message terminating algorithms into process n Event can be internal or external (receipt of message)
terminating. n Each basic process is assumed to be in one of two states:
n Achieved by adding two additional algorithms. The original u active - an action of the process is enabled (able to execute)
algorithm is called basic algorithm. The added algorithms perform
u passive - no process actions are enabled
two tasks:
n variable statep represents whether process is active or passive
u termination detection - recognize that the basic algorithm is in
message terminating state n the following assumptions about basic algorithms:
u active process becomes passive only on internal event
u termination announcement - makes all processes go into
terminating state u process always becomes active when a message is received
n the added algorithms may exchange messages. They are called u internal events where process becomes passive are the only
control messages internal events (we ignore active -> active events)
n diffusing computation (centralized algorithm)- algorithm where only
one process is active in every initial state
u this process is called initiator
3 4
Termination Dijkstra-Scholten Termination

Announcement Detection Algorithm
n The algorithm works on diffusing computations only, p0 is initiator
n To announce termination
n the algorithm dynamically maintains a computation tree T=(VT,ET)
〈stop〉 control message
u Either T is empty, or T is a directed tree with root p0
is flooded to all
processes u VT includes all active processes and all basic messages in transit
n each process sends n initiator calls Announce when p0∉ VT (which implies that T is empty)
〈stop〉 message to n when process p sends basic message 〈mes〉 it becomes the father of
every neighbor this message
n messages are sent at most once: n when 〈mes〉 is received by q
u on local call to Announce u if q is involved in computation (does not have a father) it set p to be
u on receiving the first 〈stop〉

its father
n the algorithm terminates when it receives 〈stop〉 from every neighbor u q has a father it sends message 〈sig〉 back to p
n the algorithm works on directed and undirected networks, requires no n each process p maintains variable scp that counts the number of sons
identities, leader or topological knowledge u every time p sends 〈mes〉, scp is incremented
n what happens when multiple nodes call announce simultaneously? u every time p receives 〈sig〉 it is decremented
5 u when scp is 0 and p is passive, it sends 〈sig〉 to its parent 6
Termi-
nation Shavit-Francez generalization to
detection decentralized algorithms
n Dijkstra-Scholten algorithm works only on diffusing computations (one
Sp - basic send initiator)
Rp - basic n Shavit-Francez suggested generalization to decentralized algorithms
receive (multiple initiators)
Ip - change from n in their algorithm each initiator maintains a computation tree similar to
active to Dijkstra-Scholten
passive n problem - when the tree on one initiator collapsed the initiator does not
Ap - arrival of know if the computation terminated - there still may be other trees
signal
n solution - all processes participate in a wave
u non-initiator process continues the wave
u initiator process continues the wave only if its tree has collapsed
n by definition of wave a process decides only when all system

processes made at least one move. Thus when one process decides
the basic computation terminates.
n When process decides - it calls Announce
7 8
Distributed mutual exclusion (DMX)
Distributed mutual exclusion
n N processes share a single resource, and require mutually-exclusive
n DMX definitions access
n DMX vs. single-computer MX n conditions to satisfy:
n DMX taxonomy u safety - only one process can access the shared resource at a time
n measuring performance of a DMX algorithm u liveness - if a process requests to access the shared resource it
n trivial DMX algorithm – central coordinator should eventually be given a chance to do so

n non-token-based DMX algorithms n the process accessing the shared resource is said to be in critical section
(CS), process wishing to access the resource is said to be requesting CS
u Lamport’s
u each process may or may not request CS during a computation
u Ricart-Agrawala’s
u the CS execution is always finite
u Maekawa’s
n Assumptions made:
n Raymond’s extension to K-exclusion
u Messages between two processes are received in the order they are
sent (channels are FIFO)
u Every message is eventually received
u Each process can send a message to any other process
1 (fully connected network) 2
MX in single-computer vs. DMX DMX algorithms taxonomy

n Single computer MX (also called shared memory DMX) n Lock-based (aka permission based, non-token based) - to enter
CS a process needs to obtain permission from other processes in
u similar to DMX problem - multiple processes compete for
the system.
CS execution
u Lamport
u processes have access to shared variables/clock
(presumed to be executed on the same machine) u Ricart-Agrawala
u solutions usually based on shared-memory based u Maekawa
synchronization primitives (locks/condition variables, n Token-based - unique token (privilege) circulated in the system. A
semaphores) process possessing the token can enter CS
n in DMX do not have access to shared memory/clock u Suziki-Kasami
u synchronization has to be done without shared memory u LeLann
u delays in propagation of information are unpredictable u Raymond
3 4
Measuring performance Central Coordinator

n one processor is coordinator – maintains queue of requests
n to enter the critical section, a processor sends a request message to
n Metrics to measure performance of DMX algorithms the central coordinator
u message complexity - number of messages per CS entry n when the coordinator receives a request:
u synchronization delay - amount of time required after one u If no other processor is in the critical section, it sends back a reply
process leaves CS and another process enters CS - message
measured in the number of causally related messages u If another processor is in the critical section, the coordinator adds
n the measures are considered the request to the tail of its queue, and does not respond
u low and high load - the number of processes in the system n When the requesting processor receives the reply message from the
simultaneously requesting CS coordinator, it enters the critical section
u worst and average case n When it leaves the critical section, it sends a release message to
coordinator
n When the coordinator receives a release message, it removes the
request from the head of the queue, and sends a reply message to that
processor
n evaluation
u message complexity 3
5 6
u synchronization delay 2T
Lamport’s algorithm (1978) Lamport’s algorithm (cont.)
n Each processor maintains a request queue, ordered by timestamp value
n Releasing the CS:
n Requesting the critical section (CS):
u When a processor leaves the CS, it:
u When a processor wants to enter the CS, it:
F Removes its own (satisfied) request from the top of its
F Adds the request to its own request queue - requests are ordered
own request queue
by timestamps
F Sends a timestamped release message to all
F Sends a timestamped request to all processors
processors in the system
u When a processor receives a request message, it:
u When a processor receives a release message, it:
F Adds the request to its own request queue
F Removes the (satisfied) request from its own request
F Returns a reply message queue
n Executing the CS: F (Perhaps raising its own message to the top of the
u A process enters the CS when both: queue, enabling it to finally enter the CS)
F Its own request is at the top of its own request queue (its request n Evaluation:
is earliest) u message comlexity - 3(N–1)
F variant 1 – it has received a reply with greater timestamp form F (N–1) release, (N–1) request, (N–1) reply
every process it the system u synchronization delay - T
F variant 2 – it has received any message with greater timestamp
7 8
Ricart and Agrawala’s algorithm (1981) Ricart and Agrawala’s algorithm (cont.)
n optimization of Lamport’s – no releases (merged with replys)
n Requesting the critical section (CS):
n Releasing the CS:
u When a processor wants to enter the CS, it:
u When a process leaves the CS, it:
F Sends a timestamped request to all OTHER processors
F Sends a reply message to all the deferred requests
u When a processor receives a request:
F (process with next earliest request will now received its last
F If it is neither requesting nor executing the CS, it returns a reply
reply message and enter the CS)
(not timestamped)
n Evaluation:
F If it is requesting the CS, but the timestamp on the incoming
request is smaller than the timestamp on its own request, it u message complexity - 2(N–1)
returns a reply F (N–1) reply, (N–1) request
• Means the other processor requested first u synchronization delay – 1 T
F Otherwise, it defers answering the request
n Executing the CS:

u A processor enters the CS when:
F It has received a reply from all other processors in the system
11 12
Maekawa’s algorithm Maekawa’s algorithm, Basic operation

n Lamport’s and Ricart-Agrawala’s have message complexity n Requesting CS
proportional to the number of processes in the system u process requests CS by sending timestamped request
n observation - a process does not have to send message to all other message to processes in its quorum
processes to lock them u a process has just one permission to give, if a process receives
n every process Pi is assigned a request set Ri (quorum) of processes a request it sends back reply unless it granted permission to
other process; in which case the request is queued
u Pi is in Ri
Entering CS
u for any two processes Pi and Pj, Ri∩ Ri≠∅
n
u process may enter CS when it receives replys from all

n Maekawa showed that minimum quorum size is √N
processes in its quorum
n example quorums:
n Releasing CS
u for 3 processes: R0={P0,P1}, R1={P1,P2}, R2={P0,P2}
u after exiting CS process sends release to every process in its
u for 7 processes: R0={P0,P1 ,P2}, R3={P0,P3 ,P4}, R5={P0,P5 ,P6},
quorum
R1={P1,P3 ,P5}, R4={P1,P4 ,P6}, R6={P2,P3 ,P6},
u when a process gets release it sends reply to the lowest
R2={P2,P4 ,P5}
timestamped request in its queue
13 14
Maekawa’s algorithm, deadlock possibility Maekawa’s algorithm, deadlock avoidance
n To avoid deadlock process recalls permission if it is granted out of
n Since processes do not communicate with all other processes in timestamp order
the system, CS requests may be granted out of timestamp order u if Pj receives a request from Pi with higher timestamp than the
n example: request granted permission, Pj sends failed to Pi
u suppose there are processes Pi, Pj, and Pk such that: u If Pj receives a request from Pi with lower timestamp than the
Pj∈ Ri and Pj∈ Rk but Pk∉ Ri and Pi∉ Rk request granted permission (deadlock possibility), Pj sends
u Pi and Pk request CS such that tsk < tsi inquire to Pi
u if request Pi from reaches Pj first, then Pj sends reply to Pi and u when Pi receives inquire it replies with yield if it did not succeed
Pk has to wait for Pi out of timestamp order getting permissions from other processes
u a wait-for cycle (hence a deadlock) may be formed F either got failed or sent a yield and did not get reply
15 16
Raymond’s extension for sharing K

identical resources (1987)
n K identical resources, which must be shared among N processes
n Raymond’s extension to Ricart-Agrawala’s algorithm:
u A process can enter the CS as soon as it has received N–K reply
messages
u Algorithm is generally the same as R&A, with one difference:
F R&A — reply messages arrive only when process is waiting to

enter CS
F Raymond —
• N–K reply messages arrive when process is waiting to enter

CS
• Remaining K–1 reply messages can arrive when process is in
the CS, after it leaves the CS, or when it’s waiting to enter the
CS again
• Must keep a count of number of outstanding reply messages,
and not count those toward next set of replies
n how would you modify Maekawa’s to share K resources? 18
Distributed deadlock detection
Distributed and hierarchical
deadlock detection, deadlock
resolution n Path-pushing
u WFG is disseminated as paths — sequences of edges
n detection
u Deadlock if process detects local cycle
u distributed algorithms
n Edge-chasing
F Obermarck’s path-pushing
u Probe messages circulate
F Chandy, Misra, and Haas’s edge-chasing
u Blocked processes forward probe to processes holding
u hierarchical algorithms
requested resources
F Menasce and Muntz’s algorithm
u Deadlock if initiator receives own probe
F Ho and Ramamoorthy’s algorithm
n resolution
Obermarck’s Path-Pushing Chandy, Misra, and Haas’s Edge-Chasing

n Individual sites maintain local WFGs n When a process has to wait for a resource (blocks), it sends a
u Nodes for local processes probe message to process holding the resource
u Node “Pex” represents external processes n Process can request (and can wait for) multiple resources at once
F Pex1 -> P1 -> P2 ->P3 -> Pex2 n Probe message contains 3 values:
n Deadlock detection: u ID of process that blocked
u site Si finds a cycle that does not involve Pex – deadlock
u ID of process sending message
u site Si finds a cycle that does involve Pex – possibility of a deadlock
u ID of process message was sent to
F sends a message containing its detected cycle to all other sites
• to decrease network traffic the message is sent only when F (unclear why the latter two identifiers are necessary)
Pex1 > Pex2 n When a blocked process receives a probe, it propagates the probe
• assumption: the identifier of a process spanning the sites is to the process(es) holding resources that it has requested
the same! u ID of blocked process stays the same, other two values updated
F If site Sj receives such a message, it updates its local WFG as appropriate
graph, and reevaluates the graph (possibly pushing a path
u If the blocked process receives its own probe, there is a
again)
deadlock
n Can report a false deadlock
n size of a message is O(1)
Performance evaluation of Menasce and Muntz’

Obermarck’s and Chandy-Misra-Haas hierarchical deadlock detection
algorithms
n Obermarck’s n Sites (called controllers) are organized in a tree
u on average(?) only half the sites involved in deadlock send n Leaf controllers manage resources
messages u Each maintains a local WFG concerned only about its own
u every such site sends messages to all other sites, thus resources
F n(n–1)/2 messages to detect deadlock n Interior controllers are responsible for deadlock detection
F for n sites u Each maintains a global WFG that is the union of the WFGs
u size of a message is O(n)

of its children
u Detects deadlock among its children
n Chandy, Misra, and Haas’s
u given n processes, a process may be blocked on up to (n-1)
n changes are propagated upward either continuously or
processes, thus periodically
F m(n–1)/2 messages to detect deadlock
• m processes, n sites
u size of a message is 3 integers
Ho and Ramamoorthy’s Estimating performance of deadlock
hierarchical deadlock detection detection algorithms
n Sites are grouped into disjoint clusters n Usually measured as the number of messages exchanged to
n Periodically, a site is chosen as a central control site detect deadlock
u Central control site chooses a control site for each cluster u Deceptive since message are also exchanged when there is
n Control site collects status tables from its cluster, and uses the no deadlock
Ho and Ramamoorthy one-phase centralized deadlock detection u Doesn’t account for size of the message
algorithm to detect deadlock in that cluster n Should also measure:

n All control sites then forward their status information and WFGs u Deadlock persistence time (measure of how long resources
to the central control site, which combines that information into a are wasted)
global WFG and searches it for cycles F Tradeoff with communication overhead
n Control sites detect deadlock in clusters u Storage overhead (graphs, tables, etc.)
n Central control site detects deadlock between clusters u Processing overhead to search for cycles
u Time to optimally recover from deadlock
Deadlock resolution
n resolution – aborting at least one process (victim) in the cycle and
granting its resources to others
n efficiency issues of deadlock resolution
u fast – after deadlock is detected the victim should be quickly
selected
u minimal – abort minimum number of processes, ideally abort less
“expensive” processes (with respect to completed computation,
consumed resources, etc.)
u complete – after victim is aborted, info about it quickly removed
from the system (no phantom deadlocks)
u no starvation – avoid repeated aborting of the same process
n problems
u detecting process may not know enough info about the victim
(propagating enough info makes detection expensive)
u multiple sites may simultaneously detect deadlock
u since WFG is distributed removing info about the victim takes

time
Distributed File Systems Distributed file systems
clients
n Distributed file
n definition, main concepts, design goals system is a part of
semantics of file sharing cache cache local distributed system
n cache disk that provides a
u unix
user with a unified
u session view of the files on
the network.
n file access and data cashing network
u cash location
u cash modification
u cash validation
servers
cache cache
server server server

disk disk disk
DFS - main notions Goals of DFS design

n Goals of a distributed file system
n File service — specification of the file system interface as seen
u transparency
by the clients
F structure - clients should not be aware of multiple
n File server — a process running on some machine which helps
servers, replicas and cashes in use
implement the file service by supplying files
F access - remote and local files should be accessed the
n In principle, files in a distributed file system can be stored at any
same way
machine
F name - the name of the file should not differ on different
u However, a typical distributed environment has a few
parts of DFS
dedicated machines called file servers that store all the files
u user mobility/file mobility - users should be able to access
n A machine that holds the shared files is called a server, a
the DFS in a uniform manner from different location, should
machine that accesses the files is called a client.
be able to move files around in DFS
u simplicity/ease of use - should be similar in use to a
centralized file system
u Availability / robustness — file service should be maintained
even in the presence of partial system failures
u performance/scalability— should overcome bottlenecks of a
centralized file system, should scale well
Mounting mechanism for transparency File sharing semantics

n transparent name space can be built using Unix mounting mechanism
n Unix semantics
u file systems from servers are attached (mounted) as directories in
u description:
local file system
F enforces an absolute time ordering on all operations
u the points of attachment are called mount points
F every read operation on file sees the effects of all
previous write operations on that file
u can be implemented on a single-server DFS
/ (root) / (root) / (root) u easiest to use
n session semantics
export etc usr bin nfs u session - series of file accesses made between open and
close operations
remote remote u changes made to the file are visible only to client process
people students profs users (possibly to processes on the same client)
mount mount
u the changes are visible to the sessions that open after the
bill robin han jim jane bob session closes
server 1 client server 2

File access models Remote file access and sharing
n Once the user specifies a remote file, the OS can do the access
n Accessing remote files:
either:
u remote service model - client submits requests to server, all
u Remotely on the server machine, and then return the results
processing is done on server, file never moves from server
(RPC model), or
F server is bottleneck
u Can transfer the file (or part of the file) to the requesting host,
F excessive network communication possible
and perform local accesses, or
u data-caching model
u Instead of doing the transfer for each user request, the OS can
F File service provides: cache files, and use that cache to reduce the latency for data
• open — transfer entire file to client access (and thus increase performance)
• close — transfer entire file to server n Issues
F Client works on file locally (in memory or on disk) u Where and when is data cached?
• Simple, efficient if working on entire file u Cache consistency:
• Must move entire file F What happens when the user modifies the file? Does each
• Needs local disk space cached copy change? Does the original file change?
F Is the cached copy is out of date?
Cache location Cache modification policy

n No caching — all files on server’s disk n Cache modification (writing) policy decides when a modified
u Simple, no local storage needed (dirty) cache block should be flushed to the server
u Expensive transfers n Write-through — immediately flush the new value to server (&
keep in cache)
n Cache files in server’s memory
u No problems with consistency
u Easy, transparent to clients
u Maximum reliability during crashes
u Still involves a network access
u Doesn’t take advantage of caching during writes (only during
n Cache files on client’s local disk
reads)
u Plenty of space, reliable
n Write-back (delayed-write) — flush the new value to server after
u Faster than network, slower than memory
some delay
n Cache files in client’s memory u Fast — write need only hit the cache before the process
u The usual solution (either in each process’s address space, continues
or in the kernel)
u Can reduce disk writes since the process may repeatedly
u Fast, permits diskless workstations write the same location
u Data may be lost in a crash u Unreliable — if machine crashes, unwritten data is lost
Cache modification policy Cache validation

(cont.)
n A client must decide whether or not a locally cached copy of
n Variations on write-back (when are the new values flushed to
data is consistent with the master copy
the server?)
n Client-initiated validation:
u Write-on-close — flush new value to the server only when
the file is closed u Client initiates validity checks
F Can reduce disk writes, particularly when the file is open

u Client contacts the server and asks if its copy is consistent
for a long time with the server’s copy

F At every access, or
F Unreliable — if machine crashes, unwritten data is lost
F After a given interval, or
F May make the process wait on the file close
F Only on file open
u Write-periodically — flush new value to the server at periodic
intervals (maybe 30 seconds) u Server could enforce single-writer, multiple-reader
F Can only lose writes in last period

semantics, but to do so
F It would have to store client state (expensive)
F Clients would have to specify access type (read / write)

on open
u High frequency of validity checks may mitigate the benefits
of caching
Cache validation (cont.) Stateful vs. stateless
n Server-initiated validation:
u Server records the parts of each file that each client caches
n Stateful server — server maintains state information for each
client for each file
u Server detects potential conflicts if two or more clients cache the
u Connection-oriented (open file, read / write file, close file)
same file
u Enables server optimizations like read-ahead (prefetching)
u Handling conflicts:
and file locking
F Session semantics — writes are only visible in sessions starting
u Difficult to recover state after a crash
later (not to processes which have file open now)
• When a client closes a file that it has modified, the server n Stateless server — server does not maintain state information
notifies the other clients that their cached copy is invalid, and for each client
they should discard it u Each request is self-contained (file, position, access)
– If another client has the file open, discard it when its F Connectionless (open and close are implied)
session is over u If server crashes, client can simply keep retransmitting
F UNIX semantics — writes are immediately visible to others requests until it recovers
• Clients specify the type of access they want when they open a u No server optimizations like above
file, so if two clients want to write the same file for writing, that u File operations must be idempotent
file is not cached
u Significant overhead at the server
Distributed shared memory DSM idea
u motivation and the main idea n all computers share
u consistency models a single paged, virtual address
F strict and sequential
space
F causal
n pages can be physically located
on any computer
F PRAM and processor
n when process accesses data in
F weak and release
shared address space a mapping
u implementation of sequential consistency manager maps the request to the physical page
u implementation issues n mapping manager – kernel or runtime library
F granularity
n if page is remote – block the process and fetch it
F thrashing
F page replacement
Advantages of DSM Maintaining memory coherency

n Simpler abstraction - programmer does not have to worry about n DSM systems allow concurrent access to shared data
data movement, may be easier to implement than RPC since
n concurrency may lead to unexpected results - what if the read
the address space is the same
does not return the value stored by the most recent write (write
n easier portability - sequential programs can in principle be run did not propagate)?
directly on DSM systems
n Memory is coherent if the value returned by the read operation
n possibly better performance is always the value the programmer expected
u locality of data - data moved in large blocks which helps
n To maintain coherency of shared data a mechanism that
programs with good locality of reference controls (and synchronizes) memory accesses is used.
u on-demand data movement
n This mechanism only allows a restricted set of memory access
u larger memory space - no need to do paging on disk orderings
n flexible communication - no need for sender and receiver to n memory consistency model - the set of allowable memory
exist, can join and leave DSM system without affecting the access orderings
others
n process migration simplified - one process can easily be moved
to a different machine since they all share the address space
Strict and sequential consistency Causal consistency

n strict consistency (strongest model) n proposed (Hutto and Ahmad 1990)
u value returned by a read operation is always the same as the value n there is no single (even logical) ordering of operations – two
written by the most recent write operation processes may see the same operations ordered differently
u hard to implement n the operations are sequenced in the same order if they are
n sequential consistency (Lamport 1979) potentially causally related
u the result of any execution of the operations of all processors is the
n read/write (or two write) operations on the same item are causally
same as if there were executed in some sequential order and one related
process’ operations are in the order of the program n all operations of the same process are causally related
F Interleaving of operations doesn’t matter, if all processes see n causality is transitive - if a process carries out an operation B that
the same ordering causally depends on the preceding op A - all consequent ops by
u read operation may not return result of most recent write operation!
this process are causally related to A (even if they are on
different items)
F running a program twice may give different results
u little concurrency
PRAM and processor consistency Weak and release consistency
n Weak consistency (Dubois 1988)
n PRAM (Lipton & Sandberg 1988)
u Consistency need only apply to a group of memory accesses rather
u All processes see only memory writes done by a single process in
than individual memory accesses
the same (correct) order
u Use synchronization variables to make all memory changes visible
u PRAM = pipelined RAM
to all other processes (e.g., exiting critical section)
F Writes done by a single process can be pipelined; it doesn’t have
F all access to synchronization variables must be sequentially
to wait for one to finish before starting another consistent
F writes by different processes may be seen in different order on a
F write operations are completed before access to synchvar
third process
F access to non-synchvar is allowed only after sycnhvar access is
u Easy to implement — order writes on each processor independent
completed
of all others
n Release consistency (Gharachorloo 1990)
n Processor consistency (Goodman 1989)
u two synchronization vars
u PRAM +
F acquire - all changes to synchronized vars are propagated to the
u coherency on the same data item - all processes agree on the order
process
of write operations to the same data item
F release - all changes to synchronized vars are propagated to
other processes
F programmer has to write accesses to these variables
Comparison of consistency Implementation issues

models
n Models differ by difficulty of implementation implement, ease of n how to keep track of the location of remote data
use, and performance n how to overcome the communication delays and high overhead
n Strict consistency — most restrictive, but hard to implement associated with execution of communication protocols
n Sequential consistency — widely used, intuitive semantics, not n how to make shared data concurrently accessible at several
much extra burden on programmer nodes to improve system performance
u But does not allow much concurrency
n Causal & PRAM consistency — allow more concurrency, but

have non-intuitive semantics, and put more of a burden on the
programmer to avoid doing things that require more consistency
n Weak and Release consistency — intuitive semantics, but put
extra burden on the programmer
Implementing sequential Implementing sequential consistency

consistency on page-based DSM on page-based DSM (cont.)
n Can a page move? …be replicated? n Replicated, migrating pages
n Nonreplicated, nonmigrating pages u All requests for the page have to be sent to the owner of the
u All requests for the page have to be sent to the owner of the page
page u Each time a remote page is accessed, it’s copied to the
u Easy to enforce sequential consistency — owner orders all processor that accessed it
access request u Multiple read operations can be done concurrently
u No concurrency u Hard to enforce sequential consistency — must invalidate
n Nonreplicated, migrating pages (most common approach) or update other copies of the page
during a write operation
u All requests for the page have to be sent to the owner of the
page n Replicated, nonmigrating pages
u Each time a remote page is accessed, it migrates to the u Replicated at fixed locations
processor that accessed it u All requests to the page have to be sent to one of the owners
u Easy to enforce sequential consistency — only processes on of the page

that processor can access the page u Hard to enforce sequential consistency — must update other
u No concurrency copies of the page during a write operation

Granularity Thrashing
n Granularity - size of shared memory unit
n Occurs when system spends a large amount of time transferring
n Page-based DSM shared data blocks from one node to another (compared to time
u Single page — simple to implement spent on useful computation)
u Multiple pages — take advantage of locality of reference, u interleaved data access by two nodes causes data block to
amortize network overhead over multiple pages move back and forth
F Disadvantage — false sharing u read-only blocks invalidated as soon as they are replicated
n Shared-variable DSM n handling thrashing
u Share only those variables that are need by multiple processes u application specifies when to prevent other nodes from
u Updating is easier, can avoid false sharing, but puts more burden moving block - has to modify application
on the programmer u “nailing” block after transfer for a minimum amount of time t -
n Object-based DSM hard to select t, wrong selection makes inefficient use of
u Retrieve not only data, but entire object — data, methods, etc. DSM
u Have to heavily modify old programs F adaptive nailing?
u tailoring coherence semantics (Minin) to use – object based

sharing
Page replacement
n What to do when local memory is full?
u swap on disk?
u swap over network?
u what if page is replicated?
u what if it’s read-only?
u what if it’s read/write but clean(dirty)?
u are shared pages given priority over private (non-shared)?

Process migration Advantages of process migration
n why migrate processes
n main concepts n balancing the load:
u reduces average response time of processes
n PM design objectives
u speeds up individual jobs
n design issues
u gains higher throughput
n freezing and restarting a process
n moving the process closer to the resources it is using:
n address space transfer
u utilizes resources effectively
n handling messages for moved processes u reduces network traffic
n handling co-processes n being able to move a copy of a process (replicate) on another
node improves system reliability
n a process dealing with sensitive data may be moved to a secure
machine (or just to a machine holding the data) to improve
security
1 2
Process migration Desirable features of good process

n Load balancing (load sharing) policy determines:
migration mechanism
u if the process needs to be moved (migrated) from one
n Transparency
node of the distributed system to another. u object access level - access to objects (such as files and devices) by
u which process needs to be migrated

process can be done in location -independent manner.
u system call and interprocess communication level - the
u what is the node to which the process it to be moved
communicating processes should not notice if one of the parties is
n process migration mechanism deals with the actual transfer of moved to another node, system calls should be equivalent
the process
n Minimal interference (with process execution) - minimize freezing time
source destination n Minimal residual dependencies - the migrated process should not
node node depend on the node it migrated from or:
process
in execution u previous node is still loaded
time execution
suspended u what if the previous node fails?
transfer of
freezing control n Efficiency:
time u minimize time required to migrate a process
execution
resumed u minimize cost relocating the process
u minimize cost of supporting the migrated process after migration

3 4
Parts of process migration Freezing and restarting of process

mechanism n blocks the execution of the migrant process, postponing all external
communication
u immediate blocking - when not executing system call
n freezing the process on its source node and u postponed blocking - when executing certain system calls
restarting at destination node
n wait for I/O operations:
n moving the process’ address space
u wait for fast I/O - disk I/O
n forwarding messages meant for the migrant process
u arrange to gracefully resume slow I/O operations at destination -
n handling communication between cooperating terminal I/O, network communication
processes that are separated (handling co-
n takes a “snapshot” of the process state
processes)
u relocatable information - register contents, program counter, etc.
u open files information - names, identifiers, access modes,

current positions of file pointers, etc.
n transfers the process state to the destination
n restarts the process on destination node
5 6
Address space transfer Message forwarding
n Process state (a few kilobytes): n Three types of messages:
u contents of registers, program counter, I/O buffers, interrupt 1. received when the process execution is stopped on the source
signals, etc. node and has not restarted on the destination node
n Address space (several megabytes) - dominates: 2. received on the source node after the execution started on
u program’s code, data and stack destination node
n Several approaches to address space transfer 3. sent to the migrant process after it started execution on
u total freezing - no execution is done while address space is
destination node
transferred - simplest, slowest n approaches:
u pretransferring - address space is transferred while the process u re-sending - messages of type 1 and 2 are either dropped or
is still running on the source node, after the transfer, the negatively ack-ed, the sender is notified and it needs to locate
modified pages are picked up the migrant process - nontransparent
u transfer on reference - the process is restarted before the u origin site - origin node keeps the info on the current location of
address space is migrated - the pages are fetched from the the process created there, all messages are sent to origin which
source node as the process needs them forwards them to migrant process - expensive, not fault tolerant
7 8
Message forwarding (cont.) Co-processes handling

n Approaches to message forwarding: n Need to provide efficient communication between a parent process
u link traversal and subprocesses
F messages of type 1 are queued and sent to destination node n no separation of co-processes:
as part of migration procedure u disallow migration of a process if it has children
F forwarding address (link) is left on source node to redirect u migrate children together with process
messages of type 2 and 3, link contains the system-wide F logical host concept - co-processes are always executed on
unique id of a process and its last known location - may not one logical host, and logical host is migrated atomically
be efficient or fault tolerant n home node (origin site):
u link update - during the transfer the source node sends the
u all communication between co-processes is handled through
notification (link update) of the transfer to all the nodes to which
home node - expenisve
the process communicates:
F messages of type 1 and 2 are forwarded by the source node
F messages of type 3 are sent directly to the destination node
9 10
Introduction to cryptography Cryptography, main concepts
n Main concepts n P clear (plain) text, message - readable (intelligible) information
n Design principles n C ciphertext - encrypted information
n Cryptosystems n E encryption (enciphering) - transforming clear text into ciphertext
u conventional n D decryption (deciphering) - transforming ciphertext back into
F Caesar’s cipher original cleart text
F Simple substitution n encrypting algorithm – a mathematical function having the following
u Modern form:
F Symmetric DES
C = E (P, Ke) where Ke encryption key
F Modern RSA
n decryption algorithm:
P = D (C, Kd) where Kd encryption key
n Authentication
n attacker (cryptoanalyst, intruder) - person that tries to discover C
u one way
(compromise the encryption algorithm)
u two way
n two entities (users, programs) A and B need to communicate
n two-way authentication and
u if A has Ke , B has a matching Kd - A and B have a one way
secure channel setup
private secure communication channel
u symmetric cryptosystems
u if also B has Ke and A has a matching Kd - A and B have a two
u asymmetric cryptosystems 1 way secure communication channel 2
Cryptosystem and others Cryptosystem design principles

n Cryptosystem – a system for encryption and decryption of information
u conventional - encryption algorithm is designed to be rather
complex and hard to guess n Shannon’s principles
u diffusion - spreading dependencies such that plaintext needed to
u modern - encryption algorithm is made public but the keys are
kept secret - the strength of the algorithm depends on the difficulty break the system is maximized
of determining Kd u confusion – changing piece of information so that the output has no
F symmetric – key is the same for encryption and decryption

obvious relation to input
F asymmetric – key is different
n Exhaustive search principle – the determination of the key requires
exhaustive search of key space
n Cryptology – a science for designing and breaking cryptosystems
n Cryptography – practice of using cryptosystems to maintain
confidentiality of information
n Cryptoanalysis – the study of breaking cryptosystems
3 4
Conventional cryptosystems Modern cryptosystems

n Symmetric - Ke and Kd are similar (possibly can be easily derived
from one another) – not as computationally intensive as
n the Caesar cipher: every letter is transformed into the third (or some asymmetric. Useful if both encryption and decryption is performed
other) in the alphabetical sequence (with wrap around): by private parties
E=M->(M+3) mod 26 u need secure channel to exchange keys between
u “advanced operating systems” -> “dgydqfhg rshudwlqj vcvwhpv” communicating parties. Can use insecure channel for encrypted
n transformation is linear - the number of keys (shifts) is only 25 - easy to messages transmission
guess u example - Data Encryption Standard (DES)
n simple substitution cipher : an alphabet can be mapped to any n asymmetric - Ke and Kd are dissimilar. It is (computationally) hard
permutation of letters to derive Kd from Ke. Ke does not need to be kept secret.
u each permutation is a key - there 26! (> 1026) keys. Exhaustive u computationally expensive and cannot be used for bulk data
search is very expensive. encryption
n substitution preserves frequency distribution of the letters of an u can use insecure channel for both key and message
alphabet - statistical analysis is possible. transmission
u can encode with public and sign (digital signature) with private
u Example - method of Rivest-Shamir-Adleman (RSA)
5 6
DES
RSA
n data encryption standard
u developed by IBM, widely used n Invented by
u symmetric Rivest-Shamir-
n Used to encrypt 64-bit data blocks Adleman
with 56-bit key, the key is expanded n Asymmetric
to 64 bits for error correction n Encryption (public) key is pair (e, n)
n Encryption algorithm n Decryption (private) key – (d, n)
u Initial permutation n Computing keys
u 16 identical iterations with 48bit key Ki derived from encryption key: u n = p×q where p and q are large primes
F 32-bit right side is expanded to 48-bits by duplicating some bits u Pick d such that GCD(d,(p -1) × (q -1))=1
F These bits are x-or-ed with Ki d and (p -1) × (q -1) are relatively prime
F The output is shrunk to 32 bits u Pick e such that e×d (modulo (p -1) × (q -1)) =1
u Inverse of initial permutation n Even though e,n are public, to determine d intruder needs to factor
n Decryption is encryption in reverse n into primes, if n is large (say 200 digits) factorization can be done
n Finding key requires exhaustive search over 256 values 7
by exhaustive search only 8
RSA example Authentication
n authentication – verifying the identity of communicating entities to

each other
n environment: message passing system (no shared memory),
insecure channels, no past knowledge
n authentication services
u one-way authentication of communication entities – verifies the
identity of one of the two communication entities to the other
entity
u two-way authentication of communication entities – mutual
authentication (both communication entities verify each other’s
identity)
n 23 × 7=161
9 10
Threats Two-way athentication and

secure channel setup
n intruder can gain access to any part of the network and can alter or n If two entities A and B want to communicate they must share Ke/Kd
copy any part of the message pair. How can the two entities exchange keys if there is no secure
u has knowledge of authentication protocol channel?
F knows message types, purpose and order n Key distribution center (KDC) – holds keys for each communicating
F time of protocol initiation
entity
u can replay earlier recorded messages
u can impersonate one of the communicating parties
u cannot understand the contents of encrypted messages
11 12
Symmetric systems, two-way auth. and Symmetric systems, two-way auth.
channel setup m1
and channel setup m1 m3
KDC A m4 B
n two phases for authentication KDC A B
u obtaining shared (conversation) key m2 m2 m5
u communicating conversation key
n communicating the conversation key
n Obtaining the conversation key
F m3 = C1 where C1=E((Kab, IDa), Kb)
u m1 = (Ra, IDa, IDb) - where Ra – id of request (different every time)
u key communicated
IDa - id of process A
IDb - id of process B u problem: intruder can playback A’s message to B forcing it to reuse
the conversation key
u m2 = E((Ra, IDb, Kab, C1), Ka) - where Kab – conversation key
C1=E((Kab, IDa), Kb) u solution:
Ka - private key of A F m4 = C2=E(Nr, Kab ) where Nr is nonce (never repeating number)
u what if IDb is not used in m2? F m5 = C3=E(Nr, Kab )
F intruder can substitute B in m1 for X and have A communicate

with X instead of B
u what if Ra is not used in m2?
F intruder can replay a previously recorded message from KDC to

A forcing A to reuse a previous conversation key 13 14
Asymmetric systems, two-way auth.

and channel setup
m1 m3
KDC A m4 B
n two phases m2 m5
u obtaining the public key
u handshake
n obtaining the public key (by A, B’s procedure is similar)

u m1=(A,B)
u m2=Esdc(PB,B)
n handshake
u m3=Epb(Na,A)
u m4=Epa(Na,Nb)
u m5=Epb(Nb)
15
Kerberos
n Kerberos is a network authentication system
n developed at MIT in late eighties
n features:
u authenticates users on untrusted network.
u clear password is never send over the network

Example security systems u single sign-on facility (user enters password only once)
n uses DES (symmetric cryptosystem to encrypt messages

n Kerberos server - key distribution server
u has to be protected and physically secured, better no other apps on it
n Kerberos u contains:
n Secure shell F authentication database - contains user IDs and passwords
(secret keys) for all users (and machines) in the system
the server/user keys are distributed physically as part of
installation
F authentication server - verifies user’s identity at the time of login
F ticket-granting server - supplies tickets to clients for permitting

1 access to other servers in the system 2
Kerberos authentication protocol

Kerberos
n client - runs on public workstations, obtains permission from

Kerberos server for the user to access resources
untrusted - user using the client must be authenticated before
allowed to access resources
n application server - provides certain service to client after
authenticity of the client has been verified by the Kerberos
server
3 4
Kerberos authentication protocol (cont.) Problems with Kerberos

n A - authentication server, G - ticket-granting server, C - client,
S - application server n Not effective against password guessing attacks
n an application has to be Kerberized (modified to work
with Kerberos)
n after authentication the information is transmitted
without encryption
n users’ passwords are stored in unencrypted from on
the Kerberos server
n loose clock synchronization is necessary
n appropriate time to live for tickets needs to be
configured
C1 - ticket-granting ticket, C4 - service-granting ticket 5 6

Secure shell (SSH) SSH communication
n Ssh - authentication and encryption system n SSH uses both asymmetric (for authentication) and symmetric (for
n Originally developed at Helisinki University of Technology now data encryption) cryptosystems.
maintained by SSH Communications Security, Ltd n Stage 1. connection negotiation:
n features: u Each machine running SSH has a (asymmetric) key-pair called
u authentication of users on the untrusted network host key. A server key-pair is generated when a client contacts the
u clear passwords are never sent over the network
server
u when a client contacts a server, the server sends server and
u communication between machines is encrypted - multiple
encryption algorithms available (the algorithms may be host’s public keys.
automatically selected) u client stores host’s public key between sessions. The client
u communication may be (automatically) compressed

compares the host’s public key it receives with the one it stores to
make sure the host is correct
u tunneling and encryption of arbitrary connections
u client generates a random number to serve as session key. It
n client-server architecture: encrypts this session key using both the host and server key and
u ssh server (daemon) - handles authentication, encryption and sends them to server
compression u further communication is encrypted with session key using one of
u ssh client - handles communication on the client side the symmetric data encryption methods, the communication can
7 be optionally compressed 8
SSH communication (cont.)

n Stage 2. user authentication:
u multiple authentication methods can be used:
F a server has a list of user x machine pairs from which it

accepts connections without further authentication
F a server possesses a list of pairs user’s pubic key x machine,
the client has a private key and has to authenticate itself by
providing a nonce encrypted with this private key
F password authentication - server stores login x password of
the client (or can verify it with a password server)
n Stage 3. command execution:

u after authentication a client requests a command to be executed
on a server - may be a shell command
9
Clusters What is a distributed system (again)
§ “True” Distributed Operating System
u Loosely-coupled hardware
§ Distributed system def. Review
• No shared memory, but provides the “feel” of a single memory
§ Cluster definition
u Tightly-coupled software
§ Clusters vs. distributed systems • One single OS, or at least the feel of one
§ Cluster example 1 – reliable file service u Machines are somewhat, but not completely, autonomous
§ Cluster example 2 – fast web service

§ Classification of clusters
Disk1
M1 P1 M2 P2 M3 P3
Network
M4 P4 M5 P5
1 Printer4 Disk5 2
Clusters (C) vs. Distributed systems (D)

Clusters § structure
u [C] - homogeneous - purchased to perform a certain task
u [D] - heterogeneous - put together from the available hardware
§ A subclass of distributed systems § scale

§ a small scale (mostly) homogeneous (the same hardware and OS) u [C] - small scale - don’t have to make sure that the setup scales
array of computers (located usually in one site) dedicated to small
u [D] - medium/large - have to span (potentially) large number of machines
number of well defined tasks in solving of which the cluster acts as
one single whole. § task
u [C] - specialized - made to perform a small set of well-defined tasks
§ typical tasks for “classic” distributed systems:
u [D] - general - usually have to be general-user computing environments
u file services from/to distributed machines over (college) campus
u distributing workload of all machines on campus

§ price
u [C] - (relatively) cheap
§ typical tasks for a cluster:
u [D] - free(?)/expensive
u high-availability web-service/file service, other high-availability
applications § reliability
u [C] - as good as it needs to be
u computing “farms”.
u [D] - high/low?
§ security
3 u [C] - nodes trust each-other 4
u [D] - nodes do not trust each other
Cluster Cluster examples (cont.)

examples
§ active machine - serves files to the network of computers

§ standby machine -listens to network and updates it’s own copy of
pictures taken from “In Search of Clusters”, G.F. Pfister, 1998 files
§ branches get access to shared information even if one of § in case of machine failure - standby machine takes over file service 6
5
the links or computers fails transparent to users
Cluster examples (cont.)
Classification of clusters
§ Dispatcher (sprayer) machine - sends the web requests to server § By architecture:
machines and makes sure that the servers are evenly loaded
u with hardware additions - OpenVMS, Tandem Himalaya,
§ web service continues even if a server fails Parallel Syspex
u pure software - Beowulf, …
§ By task. There is no dividing line between clusters and true

distributed systems - as we add features the clusters start to
resemble D.S.
u availability
u batch processing
u database
u generic (scientific) computation
u full clusters (distributed systems) - single system image
7 8
High-availability and Clusters
disaster recovery
§ Dependability concepts: § a subclass of distributed systems
u fault-tolerance, high-availability
§ a small scale (mostly) homogeneous (the same hardware and OS)
array of computers (located usually in one site) dedicated to small
§ High-availability classification
number of well defined tasks in solving of which the cluster acts as
§ Types of outages one single whole.
§ Failover § typical tasks for “classic” distributed systems:
u Replication u file services from/to distributed machines over (college) campus
u distributing workload to all machine on campus

u Switchback
§ Watchdogs § typical tasks for a cluster:

§ Disaster recovery u high-availability web-service/file service, other high-availability
applications
u computing “farms”.
1 2
High-availability scale
Dependability concepts
§ system’s availability usually expressed as a percentile of
§ two aspects of dependability uptime or class
u reliability – probability that system survives till certain time t
• mean time to failure MTTF (expected life) availability total accumulated outage class (#of 9s)
per year
u availability - probability that the system operates correctly at
90% more than a month 0/1
given point in time 99% under 4 days 1/2
• mean time to repair MTTR – speed of repairing 99.9% under 9 hours 2/3
• availability = MTTF/(MTTF + MTTR) 99.99% about 1 hour 3/4
§ can a system be available but not reliable? 99.999% over 5 minutes 4/5
§ does higher reliability improve the system’s availability? 99.9999% about half a minute 5/6
99.99999% about 3 seconds 6
§ what kind of systems need to be available? reliable?
§ a system is classified by the amount of downtime it allows
u 1 - campus networks
u 2 - usual non-clustered commodity stand-alone machines
u 3 - usual cluster (4 possible)
u 5 - telephone switches
3 u 6 - in-flight aircraft computers 4
Replication and Switchover

Types of outages, failover § Two types of cluster failover organization:
§ two types of outages u replication (shared-nothing cluster) – backup server keeps its own
copy of data
u unplanned - caused by faults
u switchover (shared-data cluster) – the backup has access to the
u planned - need for maintenance of the system (backups, OS
storage devices used by the primary
upgrades, upgrades, etc.)
Replication Switchover
§ certain systems should work reliably only part of the time - stock- + easier to add to an existing single - harder to add - must modify existing
exchange computers, in-flight computers machine cabling
§ if the system should be available round the clock the objective is to + easier to configure - harder to configure
minimize both types of outages + can use any old I/O adapters and - requires specialized I/O devices
§ simplest high availability cluster: backup server with failover controllers
u failover - the process of transferring control from failed server to + can use simple storage units - must used hardened storage like
the backup server RAID
u failback - the process of transferring control from backup server
- 1-to-many backup is hard + 1-to-many backup possible as long
to primary server as interconnect allows
- requires another copy of storage + only one copy of storage used
§ cluster with failover helps avoid planned as well as unplanned
- CPU overhead in normal + no overhead in normal operation
outages
operation - synchronization needed
5 - failback requires additional + no copying on failback 6
copying
Watchdogs Disaster recovery
§ watchdog is a mechanism of notification (and possible correction) of a
failure. § Disaster - failure that affects the large portions or the whole site -
§ simplest (software) watchdog - a process monitoring application fire, flood, storm-damage
processes. If the monitored process fails watchdog may take recovery § usual recovery technique - resume operations on the system
action. outside the scope of the disaster
u watchdog can run on the same machine as the application program
- may not be very useful if the machine crushes tier description

u on different machine - how is communication carried out? 0 no disaster recovery
§ application process may be programmed to cooperate with the 1 backups are periodically taken and stored off premises
watchdog; Three ways cooperation: 2 backups are taken to a “hot-site” where they can be loaded on a
secondary system if necessary
u heartbeat - periodic notification sent to the watchdog by the
3 electronic vaulting - network connects primary site and secondary
application process to confirm its correct execution. Alternate
site, back-ups are transferred by network
heartbeat paths - network, RS-232, SCSI
4 active secondary - data send over the wire, the data is kept loaded
• application initiated and ready to run on secondary
• watchdog initiated 5 secondary is kept completely up-to-date
u idle notification - application informs watchdog that it is idle
u error notification - application notifies that it encountered an error it

7 8
cannot correct
Outline
Scalability Terminology:
Farms, Clones, Partitions, and
§ Why and how to scale
Packs: RACS and RAPS
§ Ways to organize massive
computation
u Farm
u Geoplex
§ Ways to scale a farm

u Clone (RACS)
Bill Devlin, Jim Cray, Bill Laing, George Spix u Partition (RAPS)
Microsoft Research
Dec. 1999
Based on presentation by Hongwei Zhang, Ohio State U.

2
Why and how to Scale? Farm/Geoplex

§ why § Farm - the collection of servers, applications and data at a
u Server systems must be able to start small particular site
• Small-size company (garage-scale) v.s. international u features:
company (kingdom-scale) • functionally specialized services (email, WWW, directory,
u System should be able to grow as demand grows e.g. database, etc.)
• eCommerce made system growth more rapid & dynamic • administered as a unit (common staff, management
• ASP also need dynamic growth policies, facilities, networking)
§ how § Geoplex – a replicated (duplicated?) farm at two or more sites
u Scale up - expanding a system by incrementally adding u disaster protection
more devices to an existing node – CPUs, discs, NICS, etc. u may be
• inherently limited • active-active: all farms carry some of the load;

u Scale Out – expanding the system by adding more nodes – • active-passive: one or more are hot-standbys (waiting for
convenient (computing capacity can be purchased fail-over of corresponding active farms)
incrementally), no theoretical scalability limit
3 4
5 6
Clone
RACS
§ A replica of a server or a service
§ Allows load balancing § RACS (Reliable Array of
§ External to the clones Cloned Services) –
u IP sprayer (like Cisco LocalDirectorTM LocalDirectorTM)
collection of clones for a
dispatches (sprays) requests to different nodes in the clone particular service
to achieve load-balancing § two types
u Shared-nothing
§ Internal to the clones
RACS –
u IP sieve like Network Load Balancing in Windows 2000
each node duplicates
u Every requests arrive at every node in the clone, each node all the storage locally
intelligently accepts a part of these requests; u Shared-disk RACS
u Distributed coordination among nodes

– all the nodes (clones)
share a common storage manager. Stateless servers at
different nodes access a common backend storage server
7 8
RACS advantages
Problems with RACS
§ scalable – good way to add processing power,
§ Shared-nothing RACS
network bandwidth, and storage bandwidth to a farm;
u not a good way to grow storage capacity: updates at one node’s
§ available
must be applied to all other nodes’ storage
u nodes can act as backup for one another: one node
u problematic for write-intensive services: all clones must perform
fail, other nodes continue to offer service (probably
all writes (no throughput improvement) and need subtle
with degraded performance)
coordination
u Failures could be masked, if node- and application-
§ Shared-disk RACS
failure detection mechanisms are integrated with
u Storage server should be fault-tolerant for availability (only one
the load-balancing system or with client
applications copy of data)
u Still require subtle algorithms to manage updates (such as
§ easy to manage – administrative operations on one
service instance at one node could be replicated to all cache validation, lock managers, transaction logs, etc.)
others.
9 10
Partitions and
Packs How to partition, RAPS
§ Partition – service is grown § typically, the application middleware partitions the data and
by dividing data among workload by object:
nodes u Mail servers partition by mailboxes
u Only one copy of data
u Sales systems partition by customer accounts or product
in each partition – lines
availability is not § challenges
improved
u When a partition (node) is added, the data should be
§ Pack – each partition is automatically repartitioned among the nodes to balance the
implemented by a set storage and computational load.
of servers u The partitioning should automatically adapt as new data is
u Shared disk added and as the load changes.

u Shared nothing
§ RAPS (Reliable Array of Partitioned Services) – a collection of
nodes that support a packed-partitioned service
• Active/active – all members service partition
u provides both scalability and availability;
• Active/passive – one member services partitions, the others - u better performance than RACS for write-intensive services.
standby
11 12
RACS vs. RAPS
Multi-tier application example
§ Clones and RACS
u for or read-mostly applications with low consistency and § functional separation
modest storage requirements (<= 100 GB) u front-tier: web and firewall services (read
• Web/file/security/directory servers mostly)

§ Partitions and RAPS u middle-tier: file servers (read mostly)
u For update-intensive and large database applications u data-tier: SQL (database) servers (update
(routing requests to specific partitions) intensive)

• Email/instant messaging/enterprise resource
planning(ERP)/record keeping
13 14
Summary
§ scalability technique
u replicate a service at many nodes
§ simpler forms of replication

u duplicate both programs and data: RACS
§ for large databases or update-intensive services

u data partitioned: RAPS
u packs make partitions highly available
§ against disaster
u the entire farm is replicated to form a geoplex
15 16
Fault tolerance in Why fault tolerance
distributed systems
n Distributed systems encompass more and more individual devices
n Motivation n the chance of failure in distributed system can grow arbitrarily
large when the number of its components increases
n robust and stabilizing algorithms
n distributed systems can hardly be restarted after failure
n failure models
n distributed systems are subjects to partial failure property: when
n robust algorithms
one of the components fails the system may still be able to
u decision problems
function in a decreased capacity
u impossibility of consensus in asynchronous
n as the system grows in size
networks with crash-failures
u it becomes more likely that one component fail
u consensus and agreement with initially-dead
u it becomes less likely that the failure occurs in all components
processes - knot calculation algorithm
n thus the systems able to deal with failures are attractive
n stabilization
u Dijkstra’s K-state algorithm
1 2
Robust and stabilizing algorithms Failure Models

n Faults form a hierarchy on the basis of the severity of faults
n An algorithm is robust (masking) if the correct operation of the n benign
algorithm is ensured even at the presence of specified failures u initially dead - a process is initially dead if it does not execute a
n the algorithm is stabilizing if it is able to eventually start working single step in its algorithm
correctly regardless of the initial state. u crash model - a process executes steps correctly up to some
u stabilizing algorithm does not guarantee correct behavior moment (crash) and stops executing thereafter
during recovery n malign - Byzantine - a process executes arbitrary steps (not
u stabilizing algorithm is able to recover from faults regardless of necessarily in accordance with its local algorithm). In particular
their nature (as soon as the influence of the failure stops) Byzantine process sends messages with arbitrary content
n an algorithm can mask certain kinds of failures and stabilize from n initially dead process is a specially case crashed process which is
others a special case of Byzantine process
u for example: an algorithm may mask message loss and u if algorithm is Byzantine-robust it can also tolerate crashes and
stabilize from topology changes initially dead processes
u if a problem cannot be solved for initially dead processes, it
cannot be solved in the presence of crashes or Byzantine
failures
3 n other fault models can be defined in between 4
Decision problems Impossibility of consensus

n Study of robust algorithms centers around decision problems n State is reachable if there is a computation that contains it
n decision problem requires that each (correct) process eventually n Each process has a read-only input variable xp and write-once
and irreversibly arrives at a “decision” value output variable yp initially holding b
n decision problems requirements: n A consensus algorithm is 1-crash robust it it satisfies the following
u termination - all correct processes decide (cannot indefinitely
properites:
wait for dead processes) u termination - in every 1-crash fair execution all correct
u consistency - the decisions of correct processes should be

processes decide
related; u agreement - if, in any reachable state, yp ≠ b and yq ≠ b for
F consensus problem - the decisions are equal

correct processes p and q then, yp = yq
u non-triviality - there exist a reachable states such that for some
F election problem - only one process arrives at “1” (leader)
the others - “0” (non-leaders) p, yp=1 in one state and yp=0 in another
u non-triviality - different outputs are possible in different
n Theorem: there are no asynchronous, deterministic 1-crash robust
executions of the algorithm consensus algorithms
n intuitively this result is explained by the fact that in an
asynchronous system it is impossible to distinguish a crashed
process from an infinitely slow one
5 6
What is possible Consensus with initially dead processes
n Consensus with initially dead-process fault model is possible n If processes are only initially-dead consensus is possible.
n weaker coordination problems than consensus (such as renaming) n Based on the following knot-computation algorithm
are solvable n knot is a strongly connected sub-graph with no outgoing edges
u given: a set of processes p1,..,p N, each process with distinct n the objective is for all correct processes to agree on the subset of
identity taken from arbitrary large domain. Each process has to correct processes
decide on a unique new name from smaller domain 1,…,K n L stands for (N+1)/2
n randomized algorithms are possible even for Byzantine failures n we assume that are at least L alive processes
n weak termination - termination required only when a given process n first phase: each process p:
(general) is correct, the objective is for all processes to learn the u sends messages to all processes in the system
general’s decision; solvable even in the presence of Byzantine u collects at least L messages in set Succp
faults
n a process is a successor if p got a message from it - there is a
n synchronous systems are significantly more fault tolerant graph G in the system
n thus each correct process has L successors
n an initially-dead process does not send any messages. Thus there
is a knot in G containing correct processes
7 8
Knot calculation algorithm Knot calculation algorithm (cont.)

n Since each correct process has an outdegree L - the knot has at
least L processes
n since L > N/2, G contains just one knot. Let’s call it K
n since p has L successors, one of them is in K, thus all nodes in K
are descendants of p
n second phase:
u each process collects a list of its descendants. Since
processes do not fail, no deadlock occurs at this stage
n in the end a process has a set of processes and their descendants
which allows it to compute the knot in G
n it is possible to do an election and consensus on the basis of knot
calculation algorithm
u election - since processes agree on the knot - they all can
agree on the leader by electing the leader - the process with
the highest id in K
u consensus - all processes calculate the input value of the
9 processes of K and output the value that occurs most often 10
Guarded Command Language (GCL) Dijkstra’s K-State Token

l *[ …] - execution repeats forever Circulation Algorithm
l guardi - binary predicate on local vars,
*[ •the system consists of a ring of K
received messages, etc.; Objective: circulate a processors (ids 0 through K-1)
guard 1 → command1 l commandi - list of assignment state-
single token among
ments; •each processor maintains a state
[]guard 2 → command 2 processors
command is executed when variable s; a processor can see the
M corresponding guard is true; state of it’s left (smaller id) neighbor
] guards are selected nondeter-
menistically, Processor ρ 0 •guard evaluates to true - processor
*[ has a privilege (token)
Advantages: s =s
0 k −1
→ s := ( s + 1) mod K
0 0
] •all processors evaluate their guards,

l GCL allows to easily reason about algorithms and their only one at a time changes state
executions: the program counter position is irrelevant or less (C-Daemon)
Processor ρ i ( 0 < i < K )
important;
*[ •after the state change all
lwe don’t have to consider execution starting in the middle of si ≠ si −1
→ s i := s i −1
processors re-evaluate the guards
guard or command (serializability property); 11 ] 12

AOS Material

Uploaded by

Copyright:

Available Formats

You might also like

AOS Material

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AOS Material

Uploaded by

Copyright:

Available Formats

What is a Distributed System? What is a Distributed System? (cont.

n More recent (Stallings, 1993)

n Tightly coupled ≈ parallel processing

Classification of Operating Systems Classification of Operating Systems (cont.)

l All memory is shared l Tightly-coupled software

n User knows which machine he/she is on

l Goal: provide access other machines on the network, share Network

n Each layer takes a message passed down to it by a higher

ISO OSI protocol summary

n Presentation layer — provides freedom from machine-dependent

n Session layer — provides communication/synchronization between

n Transport layer — accepts messages of arbitrary length between

ISO OSI protocol summary (cont.) TCP/IP Protocol suite

l telnet — remote terminal protocol

n Handles mailing lists, forwarding, etc.

n A domain may be split into subdomains

Ethernet Token Ring

into each Ethernet device l Token is free when no device is transmitting

n Some useful programs using ICMP:

n Sends multiple (?) 20-byte packets

n Sends ICMP echo request to destination

n Destination sends ICMP echo reply

n Sends 64-byte packets repeatedly

• motivation o receive - provide a (user) buffer for message to be copied to

• overview and return

• stub generation o reliable - receipt is confirmed by acknowledgement, block until

• execution semantics ack is sent/received

Direct vs. indirect communication

• Link is associated with two or more processes that share a

o The link may be either unidirectional or bidirectional

Why is message-passing not ideal? Remote Procedure Call (RPC)

local procedure call to the server • The trap handler:

RPC Invocation (more detailed) Binding

Parameter and result passing Parameter passing (cont.)

} o Do call-by-copy / restore instead

unmarshal marshal – Instead of pointer, pass item pointed to

• Exactly what size are parameters (e.g., integers, • “exactly once”

arrays)? o if succeeds – exactly once

o if fails – none, partial, one

Stateful vs. stateless server

§ Stateless server — server does not maintain state information for

• If server crashes, client can simply keep retransmitting requests

Why distributed systems Model classification

F progress – some process eventually enters CS n p is transitive

Wave Algorithms Ring algorithm

u decision: each computation contains at least one decide event

u dependence: in each computation each decide event is causally

F each process knows its own unique name

F each process knows the names of its neighbors

u number of decisions to occur in each process

n Traversal algorithm is a wave algorithm with the

u traversing connected networks (tree starts computation by sending one message

Sequential polling algorithm Measuring efficiency of algorithms

n Time complexity. Time is idealized

Tree terminology Tarry’s

• Will ignore relativistic effects

Physical Clocks Physical Clocks in a Distributed System

resolution is sufficiently small variations, etc.