Professional Documents
Culture Documents
AOS Material
AOS Material
AOS Material
)
P1 A distributed system is a
n From various textbooks: P2 P3
Network
set of physically separate
l “A distributed system is a collection of independent processors connected by
computers that appear to the users of the system as a one or more
P4 P5
single computer.” communication links
l “A distributed system consists of a collection of
autonomous computers linked to a computer network and n Is every system with >2 computers a distributed
equipped with distributed system software.” system??
l “A distributed system is a collection of processors that do l Email, ftp, telnet, world-wide-web
not share memory or a clock.”
l Network printer access, network file access, network file
l “Distributed systems is a term used to define a wide range backup
of computer systems from a weakly-coupled system such
as wide area networks, to very strongly coupled systems l We don’t usually consider these to be distributed
such as multiprocessor systems.” systems…
MIMD
Two Taxonomies for Classifying Computer Systems Classification of parallel and
MIMD distributed
computers
n Michael Flynn (1966)
Architectures tightly loosely
l SISD — single instruction, single data coupled coupled
l SIMD — single instruction, multiple data multiprocessors multicomputers
l MISD — multiple instruction, single data (shared (distributed /
Tanenbaum memory) private memory)
l MIMD — multiple instruction, multiple data (date?)
(2) transfer data (packets) between end systems n Sends files from one system to another under user
command
l Examples: IP (connectionless), X.25 (connection- n Handles both text and binary files
oriented) n Supports userids and passwords
n Transmission – Carrier Sense Multiple Access with Collision Detection l To transmit, a host waits for a free token, attaches its message to
(CSMA/CD) it, sent the token status to busy, and sends it on
l Carrier sense: listen before broadcasting, defer until channel is l Destination removes the message, sets the token status to free,
clear, then broadcast and sends it on
l Collision detection: listen while broadcasting n Advantage: not sensitive to load
n If two hosts transmit at same time —collision — the data gets n Disadvantage: complexity – token maintenance is complex
garbled
n Each jams network (short jammed signal is issued), then waits a
random (but increasing) amount of time, and tries again
ICMP
n A part of IP that is less widely known is the Internet Control Message
Protocol (ICMP)
l Allows gateways and hosts to exchange bootstrapping information,
report errors, and test the liveliness of the network
l ping /usr/sbin/ping
n Tests that destination is up and reachable
send blocks until receive executes on • Link is associated with exactly two processes
receiving computer - safer, easier, less
o Between any two processes, there exists at most one link
concurrent
o The link may be unidirectional, but is usually bidirectional
• asynchronous -multiple sends can be
executed before receives (messages
buffered) - more dangerous (what to do with § Indirect communication — communicate using mailboxes (ports)
messages to crashed receiver?), complex, (usually) owned by receiver
concurrent o send(mailbox, message)
o receive(mailbox, message)
9. Local kernel passes message(s) to client stub • Use a binding server (binder)
o Servers register / deregister their services with the binding
10. Client stub unpacks result(s) and returns them to client app.
server
o When a client calls a remote procedure for the first time, it
queries the binding server for a registered server to call
u tree algorithm
u echo algorithm
u polling algorithm
1 2
Computations
States and Actions
n a computation is a sequence of states such that
n Assignment of values to all variables in distributed system is a
u the first state is an initial state and each consequent
(global) state of distributed system. Let Z be a set of system’s states state is obtained by executing an enabled action at the
u if the system is message passing the state includes an preceding state
assignment of messages to communication channels u a computation is either infinite or ends in a fixpoint
n An atomic operation is an operation that cannot interleave with n A computation is weakly fair if no action is enabled in
other operations
infinitely many consequent states
n an atomic operation takes the system from one state to another n prefix – a finite segment of a computation that starts in an
n a distributed algorithm defines a system, a set of initial state and a initial state and ends in a state
set of possible atomic operations n suffix – a computation segment that starts in a state and is
n an operation is enabled at a certain state if it can be executed at that either infinite or ends in a fixpoint
state u a computation is obtained by joining a prefix and a suffix
n the execution of an action is an event u can two separate computations share a prefix? a suffix?
n fixpoint (quiescent state) – a state where none of the actions are both?
enabled
6 7
Program Properties
Causality relationship
n property – a set of correct computations
u safety – there is a set of prefixes none of which a correct n An event is usually influenced by part of the state.
computation may contain (something “bad” never happens)
n two consecutive events influencing disjoint parts of the state are
u liveness – for every prefix there is suffix whose addition creates a independent and can occur in reverse order
correct computation (a something “good” eventually happens )
n this intuition is captured in the notion of causality relation p
n every property is an intersection of safety and liveness properties
n for message-passing systems:
n example:
u if two events e and f are different events of the same process and e
u mutual exclusion problem (MX) – a set of processes alternate
occurs before f then e p f
executing critical section (CS) of code and other code
u if s is a send event and r is a receive event then s p r
u a solution to mutual exclusion problem has to satisfy three
properties: n for shared memory systems:
F exclusivity – only one process at a time is allowed to enter CS u a read and a write on the same data item are causally related
Tree
algorithm Polling
n Operates on tree
algorithm
network (can work on
spanning tree of arbitrary
network) - no root,
edges are undirected
(bi-directional)
n leaves of tree initiate the algorithm
n if a process has received a message from all neighbors but one n Works on cliques (complete networks)
(initially true for leaves), the process sends a message to the n one initiator (centralized)
remaining neighbor.
n how many processes decide?
n If process gets messages from all neighbors - it decides
n How many messages are exchanged in the algorithm
n Excluding the forall statement how many processes can decide?
n what other topology can this algorithm be used in
What are these processes?
n Is this a wave algorithm?
n Why do we need forall statement?
n How many messages are sent in the algorithm? 12 13
n Is this a wave algorithm?
Chang’s
Echo
algorithm
n Works on networks
of arbitrary topology
n one initiator
(centralized)
n initiator sends messages to all neighbors
n when non-initiator receives the first message it forwards it to all other
neighbors, when it gets tokens from all other neighbors it replies back
n how many processes decide?
n How many messages are exchanged in the algorithm
n Is this a wave algorithm?
14
Traversal algorithms What are traversal algorithm
F complexity measures
sends out one message or decides
F tree terminology
n traversal algorithm can be viewed as follows:
there is a single token that “visits” all processes
F tarry’s algorithm
n we’ve already looked at one traversal algorithm
F classical algorithm
what is it?
F awerbuch’s algorithm
1 2
3 4
1 2
• A clock (also called a timer) is an electronic device that counts accuracy, and then
oscillations in a crystal at a particular frequency u measure time relative to each local clock to determine order
• Count is typically divided and stored in a counter register between two events
• Clock can be programmed to generate interrupts at regular • Well, there are some problems…
intervals (e.g., at time interval required by a CPU scheduler) u It’s difficult to synchronize the clocks
• Counter can be scaled to get time of day u Crystal-based clocks tend to drift over time — count time at
• This value can be used to timestamp an event on that computer different rates, and diverge from each other
• Two events will have different timestamps only if clock F Physical variations in the crystals, temperature
not the exact time of day at which they occurred, so this scaling F For quartz crystal clocks, typical drift rate is about one
is often not necessary second every 106 seconds =11.6 days
F Best atomic clocks have drift rate of one second in 1013
seconds = 300,000 years
3 4
5 6
Cristian’s algorithm Berkeley algorihtm
• Send request to time server, measure time Dtrans taken to
receive reply Tserver • Choose a coordinator computer to act as the master
• Set local time to Tserver + (Dtrans / 2) • Master periodically polls the slaves — the other computers
whose clocks should be synchronized to the master
u Improvement: make several requests, take average Tserver
value u Slaves send their clock value to master
• Assumptions: • Master observes transmission delays, and estimates their local
clock times
u Network delay is fairly consistent
u Master averages everyone’s clock times (including its own)
u Request & reply take equal amount of time
F Master takes a fault-tolerant average — it ignores
• to offset variations in time delay – client may average over readings from clocks that have drifted badly, or that have
several requests
failed and are producing readings far outside the range of
• Problems: the other clocks
u Doesn’t work if time server fails u Master sends to each slave the amount (positive or
u Not secure against malfunctioning time server, or malicious negative) by which it should adjust its clock
impostor time server
7 8
Network Time Service protocol (NTP) Compensating for clock drift in NTP
• Provides time service on the Internet • Compare time Ts provided by time server to time Tc at computer C
• Hierarchical network of servers: • If Ts > Tc (e.g., 9:07am vs 9:05am)
u Primary servers (100s) — connected directly to a time u Could advance C’s time to Ts
source u May miss some clock ticks; probably OK (maybe not)
u Secondary servers (1000s) — connected to primary servers • If Ts < Tc (e.g., 9:07am vs 9:10am)
in hierarchical fashion u Can’t roll back C’s time to Ts
F ns.mcs.kent.edu runs a time server F Many applications (e.g., make) assume that time always
u Servers at higher levels are presumed to be more accurate advances!
than at lower levels u Can cause C’s clock to run slowly until it resynchronizes with
• Several synchronization modes: the time server
u Multicast — for LANs, low accuracy F Can’t change the clock oscillator rate, so have to change the
u Procedure call — similar to Cristian’s algorithm, higher software interpreting the clock’s counter register
accuracy (file servers) F Tsoftware = a Thardware + b
u Symmetric mode — exchange detailed messages, maintain F Can determine constants a and b
history
• All built on top of UDP (connectionless) 9 10
Is It Enough to
Synchronize Physical Clocks?
• In a distributed system, there is no common clock, so we have
to:
u Use atomic clocks to minimize clock drift
u Synchronize with time servers that have UTC receivers,
trying to compensate for unpredictable network delay
• Is this sufficient?
u Value received from UTC receiver is only accurate to within
0.1–10 milliseconds
F At best, we can synchronize clocks to within 10–30
milliseconds of each other
F We have to synchronize frequently, to avoid local clock
drift
u In 10 ms, a 100 MIPS machine can execute 1 million
instructions
F Accurate enough as time-of-day
u condition instructions
u implementation u We will refer to these clocks as physical clocks, and say
u application
they measure global time
F birman-schiper-stephenson causal
n Idea — abandon idea of physical time
broadcast u For many purposes, it is sufficient to know the order in
which events occurred
F schiper-eggli-sandoz causal
ordering of messages u Lamport (1978) — introduce logical (virtual) time,
synchronize logical clocks
1 2
u It can be shown that ∀i, ∀k : Ci [i] ≥ Ck [i] u Receive e22 updated by IR1 and IR2
u Rules for comparing timestamps can also be established so that if u Receive e13 tells P1 about P2 and P3 (P3 clock is old, but better
ta < tb , then a→b than nothing!)
F Solves the problem with Lamport’s clocks
13 14
u do not confuse with causal ordering of events u VTPi[k] ≥ VTm[k], ∀k ∈{1,2,…n} but i
n causal ordering is useful, for example for replicated databases Pj received all messages received by Pi before sending m
n two algorithms using VC u undelivered messages are stored for later delivery
u Birman-Schiper-Stephenson (BSS) causal ordering of broadcasts
n after delivery of m VTPi[k] is updated according to VC rule IR2 on the
basis of VTm and delayed messages are reevaluated
u Schiper-Eggli-Sandoz (SES) causal ordering of regular messages
15 16
u after delivery
F insert entries from Vm into VP2 for every process P3 ≠ P2 if they are not
there
F update the timestamp of corresponding entry in VP2 otherwise
we define
that is
• the message is in transit if it was sent but not received
• the message is inconsistent if it was received but never sent
Cuts
n cut C is a set of special (cut) events one
for each process
C={c1,c2, … , cn}
n a non-cut event at process P is pre-cut
if it precedes the cut event at this process, post-cut otherwise
n a cut is not consistent if a pre-cut event causally follows a post-cut
event
n Theorem a cut is consistent iff all cut events are concurrent
n assume virtual clocks are run, VTci is timestamp of ci
n we define VTC the timestamp of the cut C:
n Dijkstra-Scholten termination detection algorithm messages but there are no messages in the channels - no
for diffusing computations action can be executed but the processes are not aware of
termination
n Shavit-Francez generalization to decentralized
u explicit (process termination) - a terminal state with every
algorithms
process in terminal state and no messages in channels
n implicit termination is easier to design since the processes do not
need to know the algorithm terminated
1 2
n Objective - convert message terminating algorithms into process n Event can be internal or external (receipt of message)
terminating. n Each basic process is assumed to be in one of two states:
n Achieved by adding two additional algorithms. The original u active - an action of the process is enabled (able to execute)
algorithm is called basic algorithm. The added algorithms perform
u passive - no process actions are enabled
two tasks:
n variable statep represents whether process is active or passive
u termination detection - recognize that the basic algorithm is in
message terminating state n the following assumptions about basic algorithms:
u active process becomes passive only on internal event
u termination announcement - makes all processes go into
terminating state u process always becomes active when a message is received
n the added algorithms may exchange messages. They are called u internal events where process becomes passive are the only
control messages internal events (we ignore active -> active events)
n diffusing computation (centralized algorithm)- algorithm where only
one process is active in every initial state
u this process is called initiator
3 4
n each process sends n initiator calls Announce when p0∉ VT (which implies that T is empty)
〈stop〉 message to n when process p sends basic message 〈mes〉 it becomes the father of
every neighbor this message
n messages are sent at most once: n when 〈mes〉 is received by q
u on local call to Announce u if q is involved in computation (does not have a father) it set p to be
n the algorithm works on directed and undirected networks, requires no n each process p maintains variable scp that counts the number of sons
identities, leader or topological knowledge u every time p sends 〈mes〉, scp is incremented
n what happens when multiple nodes call announce simultaneously? u every time p receives 〈sig〉 it is decremented
5 u when scp is 0 and p is passive, it sends 〈sig〉 to its parent 6
Termi-
nation Shavit-Francez generalization to
detection decentralized algorithms
n Dijkstra-Scholten algorithm works only on diffusing computations (one
Sp - basic send initiator)
Rp - basic n Shavit-Francez suggested generalization to decentralized algorithms
receive (multiple initiators)
Ip - change from n in their algorithm each initiator maintains a computation tree similar to
active to Dijkstra-Scholten
passive n problem - when the tree on one initiator collapsed the initiator does not
Ap - arrival of know if the computation terminated - there still may be other trees
signal
n solution - all processes participate in a wave
u non-initiator process continues the wave
u initiator process continues the wave only if its tree has collapsed
n measuring performance of a DMX algorithm u liveness - if a process requests to access the shared resource it
synchronization primitives (locks/condition variables, n Token-based - unique token (privilege) circulated in the system. A
semaphores) process possessing the token can enter CS
n in DMX do not have access to shared memory/clock u Suziki-Kasami
3 4
Ricart and Agrawala’s algorithm (1981) Ricart and Agrawala’s algorithm (cont.)
n optimization of Lamport’s – no releases (merged with replys)
n Requesting the critical section (CS):
n Releasing the CS:
u When a processor wants to enter the CS, it:
u When a process leaves the CS, it:
F Sends a timestamped request to all OTHER processors
F Sends a reply message to all the deferred requests
u When a processor receives a request:
F (process with next earliest request will now received its last
F If it is neither requesting nor executing the CS, it returns a reply
reply message and enter the CS)
(not timestamped)
n Evaluation:
F If it is requesting the CS, but the timestamp on the incoming
request is smaller than the timestamp on its own request, it u message complexity - 2(N–1)
11 12
n observation - a process does not have to send message to all other message to processes in its quorum
processes to lock them u a process has just one permission to give, if a process receives
n every process Pi is assigned a request set Ri (quorum) of processes a request it sends back reply unless it granted permission to
other process; in which case the request is queued
u Pi is in Ri
Entering CS
u for any two processes Pi and Pj, Ri∩ Ri≠∅
n
13 14
Maekawa’s algorithm, deadlock possibility Maekawa’s algorithm, deadlock avoidance
n To avoid deadlock process recalls permission if it is granted out of
n Since processes do not communicate with all other processes in timestamp order
the system, CS requests may be granted out of timestamp order u if Pj receives a request from Pi with higher timestamp than the
n example: request granted permission, Pj sends failed to Pi
u suppose there are processes Pi, Pj, and Pk such that: u If Pj receives a request from Pi with lower timestamp than the
Pj∈ Ri and Pj∈ Rk but Pk∉ Ri and Pi∉ Rk request granted permission (deadlock possibility), Pj sends
u Pi and Pk request CS such that tsk < tsi inquire to Pi
u if request Pi from reaches Pj first, then Pj sends reply to Pi and u when Pi receives inquire it replies with yield if it did not succeed
Pk has to wait for Pi out of timestamp order getting permissions from other processes
u a wait-for cycle (hence a deadlock) may be formed F either got failed or sent a yield and did not get reply
15 16
n resolution
• to decrease network traffic the message is sent only when F (unclear why the latter two identifiers are necessary)
Pex1 > Pex2 n When a blocked process receives a probe, it propagates the probe
• assumption: the identifier of a process spanning the sites is to the process(es) holding resources that it has requested
the same! u ID of blocked process stays the same, other two values updated
F If site Sj receives such a message, it updates its local WFG as appropriate
graph, and reevaluates the graph (possibly pushing a path
u If the blocked process receives its own probe, there is a
again)
deadlock
n Can report a false deadlock
n size of a message is O(1)
u every such site sends messages to all other sites, thus resources
F n(n–1)/2 messages to detect deadlock n Interior controllers are responsible for deadlock detection
F for n sites u Each maintains a global WFG that is the union of the WFGs
• m processes, n sites
u size of a message is 3 integers
Ho and Ramamoorthy’s Estimating performance of deadlock
hierarchical deadlock detection detection algorithms
n Sites are grouped into disjoint clusters n Usually measured as the number of messages exchanged to
n Periodically, a site is chosen as a central control site detect deadlock
u Central control site chooses a control site for each cluster u Deceptive since message are also exchanged when there is
n Control site collects status tables from its cluster, and uses the no deadlock
Ho and Ramamoorthy one-phase centralized deadlock detection u Doesn’t account for size of the message
Deadlock resolution
n resolution – aborting at least one process (victim) in the cycle and
granting its resources to others
n efficiency issues of deadlock resolution
u fast – after deadlock is detected the victim should be quickly
selected
u minimal – abort minimum number of processes, ideally abort less
“expensive” processes (with respect to completed computation,
consumed resources, etc.)
u complete – after victim is aborted, info about it quickly removed
from the system (no phantom deadlocks)
u no starvation – avoid repeated aborting of the same process
n problems
u detecting process may not know enough info about the victim
(propagating enough info makes detection expensive)
u multiple sites may simultaneously detect deadlock
u cash modification
u cash validation
servers
cache cache
n session semantics
export etc usr bin nfs u session - series of file accesses made between open and
close operations
remote remote u changes made to the file are visible only to client process
people students profs users (possibly to processes on the same client)
mount mount
u the changes are visible to the sessions that open after the
• Must move entire file F What happens when the user modifies the file? Does each
• Needs local disk space cached copy change? Does the original file change?
F Is the cached copy is out of date?
– If another client has the file open, discard it when its F Connectionless (open and close are implied)
F UNIX semantics — writes are immediately visible to others requests until it recovers
• Clients specify the type of access they want when they open a u No server optimizations like above
file, so if two clients want to write the same file for writing, that u File operations must be idempotent
file is not cached
u Significant overhead at the server
Distributed shared memory DSM idea
u motivation and the main idea n all computers share
u consistency models a single paged, virtual address
F strict and sequential
space
F causal
n pages can be physically located
on any computer
F PRAM and processor
n when process accesses data in
F weak and release
shared address space a mapping
u implementation of sequential consistency manager maps the request to the physical page
u implementation issues n mapping manager – kernel or runtime library
F granularity
n if page is remote – block the process and fetch it
F thrashing
F page replacement
u little concurrency
PRAM and processor consistency Weak and release consistency
n Weak consistency (Dubois 1988)
n PRAM (Lipton & Sandberg 1988)
u Consistency need only apply to a group of memory accesses rather
u All processes see only memory writes done by a single process in
than individual memory accesses
the same (correct) order
u Use synchronization variables to make all memory changes visible
u PRAM = pipelined RAM
to all other processes (e.g., exiting critical section)
F Writes done by a single process can be pipelined; it doesn’t have
F all access to synchronization variables must be sequentially
to wait for one to finish before starting another consistent
F writes by different processes may be seen in different order on a
F write operations are completed before access to synchvar
third process
F access to non-synchvar is allowed only after sycnhvar access is
u Easy to implement — order writes on each processor independent
completed
of all others
n Release consistency (Gharachorloo 1990)
n Processor consistency (Goodman 1989)
u two synchronization vars
u PRAM +
F acquire - all changes to synchronized vars are propagated to the
u coherency on the same data item - all processes agree on the order
process
of write operations to the same data item
F release - all changes to synchronized vars are propagated to
other processes
F programmer has to write accesses to these variables
u All requests for the page have to be sent to the owner of the page
page u Each time a remote page is accessed, it’s copied to the
u Easy to enforce sequential consistency — owner orders all processor that accessed it
access request u Multiple read operations can be done concurrently
n Nonreplicated, migrating pages (most common approach) or update other copies of the page
during a write operation
u All requests for the page have to be sent to the owner of the
page n Replicated, nonmigrating pages
u Each time a remote page is accessed, it migrates to the u Replicated at fixed locations
processor that accessed it u All requests to the page have to be sent to one of the owners
Page replacement
n What to do when local memory is full?
u swap on disk?
1 2
5 6
Address space transfer Message forwarding
n Process state (a few kilobytes): n Three types of messages:
u contents of registers, program counter, I/O buffers, interrupt 1. received when the process execution is stopped on the source
signals, etc. node and has not restarted on the destination node
n Address space (several megabytes) - dominates: 2. received on the source node after the execution started on
u program’s code, data and stack destination node
n Several approaches to address space transfer 3. sent to the migrant process after it started execution on
u total freezing - no execution is done while address space is
destination node
transferred - simplest, slowest n approaches:
u pretransferring - address space is transferred while the process u re-sending - messages of type 1 and 2 are either dropped or
is still running on the source node, after the transfer, the negatively ack-ed, the sender is notified and it needs to locate
modified pages are picked up the migrant process - nontransparent
u transfer on reference - the process is restarted before the u origin site - origin node keeps the info on the current location of
address space is migrated - the pages are fetched from the the process created there, all messages are sent to origin which
source node as the process needs them forwards them to migrant process - expensive, not fault tolerant
7 8
F forwarding address (link) is left on source node to redirect u migrate children together with process
messages of type 2 and 3, link contains the system-wide F logical host concept - co-processes are always executed on
unique id of a process and its last known location - may not one logical host, and logical host is migrated atomically
be efficient or fault tolerant n home node (origin site):
u link update - during the transfer the source node sends the
u all communication between co-processes is handled through
notification (link update) of the transfer to all the nodes to which
home node - expenisve
the process communicates:
F messages of type 1 and 2 are forwarded by the source node
9 10
Introduction to cryptography Cryptography, main concepts
n Main concepts n P clear (plain) text, message - readable (intelligible) information
n Design principles n C ciphertext - encrypted information
n Cryptosystems n E encryption (enciphering) - transforming clear text into ciphertext
u conventional n D decryption (deciphering) - transforming ciphertext back into
F Caesar’s cipher original cleart text
F Simple substitution n encrypting algorithm – a mathematical function having the following
u Modern form:
F Symmetric DES
C = E (P, Ke) where Ke encryption key
F Modern RSA
n decryption algorithm:
P = D (C, Kd) where Kd encryption key
n Authentication
n attacker (cryptoanalyst, intruder) - person that tries to discover C
u one way
(compromise the encryption algorithm)
u two way
n two entities (users, programs) A and B need to communicate
n two-way authentication and
u if A has Ke , B has a matching Kd - A and B have a one way
secure channel setup
private secure communication channel
u symmetric cryptosystems
u if also B has Ke and A has a matching Kd - A and B have a two
u asymmetric cryptosystems 1 way secure communication channel 2
3 4
n simple substitution cipher : an alphabet can be mapped to any n asymmetric - Ke and Kd are dissimilar. It is (computationally) hard
permutation of letters to derive Kd from Ke. Ke does not need to be kept secret.
u each permutation is a key - there 26! (> 1026) keys. Exhaustive u computationally expensive and cannot be used for bulk data
search is very expensive. encryption
n substitution preserves frequency distribution of the letters of an u can use insecure channel for both key and message
alphabet - statistical analysis is possible. transmission
u can encode with public and sign (digital signature) with private
5 6
DES
RSA
n data encryption standard
u developed by IBM, widely used n Invented by
u symmetric Rivest-Shamir-
n Used to encrypt 64-bit data blocks Adleman
with 56-bit key, the key is expanded n Asymmetric
to 64 bits for error correction n Encryption (public) key is pair (e, n)
n Encryption algorithm n Decryption (private) key – (d, n)
u Initial permutation n Computing keys
u 16 identical iterations with 48bit key Ki derived from encryption key: u n = p×q where p and q are large primes
F 32-bit right side is expanded to 48-bits by duplicating some bits u Pick d such that GCD(d,(p -1) × (q -1))=1
F These bits are x-or-ed with Ki d and (p -1) × (q -1) are relatively prime
F The output is shrunk to 32 bits u Pick e such that e×d (modulo (p -1) × (q -1)) =1
u Inverse of initial permutation n Even though e,n are public, to determine d intruder needs to factor
n Decryption is encryption in reverse n into primes, if n is large (say 200 digits) factorization can be done
n Finding key requires exhaustive search over 256 values 7
by exhaustive search only 8
n 23 × 7=161
9 10
11 12
Symmetric systems, two-way auth. and Symmetric systems, two-way auth.
channel setup m1
and channel setup m1 m3
KDC A m4 B
n two phases for authentication KDC A B
u obtaining shared (conversation) key m2 m2 m5
u communicating conversation key
n communicating the conversation key
n Obtaining the conversation key
F m3 = C1 where C1=E((Kab, IDa), Kb)
u m1 = (Ra, IDa, IDb) - where Ra – id of request (different every time)
u key communicated
IDa - id of process A
IDb - id of process B u problem: intruder can playback A’s message to B forcing it to reuse
the conversation key
u m2 = E((Ra, IDb, Kab, C1), Ka) - where Kab – conversation key
C1=E((Kab, IDa), Kb) u solution:
Ka - private key of A F m4 = C2=E(Nr, Kab ) where Nr is nonce (never repeating number)
u handshake
u m2=Esdc(PB,B)
n handshake
u m3=Epb(Na,A)
u m4=Epa(Na,Nb)
u m5=Epb(Nb)
15
Kerberos
n Kerberos is a network authentication system
n developed at MIT in late eighties
n features:
u authenticates users on untrusted network.
3 4
u authentication of users on the untrusted network host key. A server key-pair is generated when a client contacts the
u clear passwords are never sent over the network
server
u when a client contacts a server, the server sends server and
u communication between machines is encrypted - multiple
encryption algorithms available (the algorithms may be host’s public keys.
automatically selected) u client stores host’s public key between sessions. The client
9
Clusters What is a distributed system (again)
§ “True” Distributed Operating System
u Loosely-coupled hardware
§ Distributed system def. Review
• No shared memory, but provides the “feel” of a single memory
§ Cluster definition
u Tightly-coupled software
§ Clusters vs. distributed systems • One single OS, or at least the feel of one
§ Cluster example 1 – reliable file service u Machines are somewhat, but not completely, autonomous
Network
M4 P4 M5 P5
1 Printer4 Disk5 2
§ security
3 u [C] - nodes trust each-other 4
u [D] - nodes do not trust each other
u batch processing
u database
7 8
High-availability and Clusters
disaster recovery
§ Dependability concepts: § a subclass of distributed systems
u fault-tolerance, high-availability
§ a small scale (mostly) homogeneous (the same hardware and OS)
array of computers (located usually in one site) dedicated to small
§ High-availability classification
number of well defined tasks in solving of which the cluster acts as
§ Types of outages one single whole.
§ Failover § typical tasks for “classic” distributed systems:
u Replication u file services from/to distributed machines over (college) campus
1 2
High-availability scale
Dependability concepts
§ system’s availability usually expressed as a percentile of
§ two aspects of dependability uptime or class
u reliability – probability that system survives till certain time t
• mean time to failure MTTF (expected life) availability total accumulated outage class (#of 9s)
per year
u availability - probability that the system operates correctly at
90% more than a month 0/1
given point in time 99% under 4 days 1/2
• mean time to repair MTTR – speed of repairing 99.9% under 9 hours 2/3
• availability = MTTF/(MTTF + MTTR) 99.99% about 1 hour 3/4
§ can a system be available but not reliable? 99.999% over 5 minutes 4/5
§ does higher reliability improve the system’s availability? 99.9999% about half a minute 5/6
99.99999% about 3 seconds 6
§ what kind of systems need to be available? reliable?
§ a system is classified by the amount of downtime it allows
u 1 - campus networks
u 5 - telephone switches
u Geoplex
Bill Devlin, Jim Cray, Bill Laing, George Spix u Partition (RAPS)
Microsoft Research
Dec. 1999
3 4
5 6
Clone
RACS
§ A replica of a server or a service
§ Allows load balancing § RACS (Reliable Array of
§ External to the clones Cloned Services) –
u IP sprayer (like Cisco LocalDirectorTM LocalDirectorTM)
collection of clones for a
dispatches (sprays) requests to different nodes in the clone particular service
to achieve load-balancing § two types
u Shared-nothing
§ Internal to the clones
RACS –
u IP sieve like Network Load Balancing in Windows 2000
each node duplicates
u Every requests arrive at every node in the clone, each node all the storage locally
intelligently accepts a part of these requests; u Shared-disk RACS
7 8
RACS advantages
Problems with RACS
§ scalable – good way to add processing power,
§ Shared-nothing RACS
network bandwidth, and storage bandwidth to a farm;
u not a good way to grow storage capacity: updates at one node’s
§ available
must be applied to all other nodes’ storage
u nodes can act as backup for one another: one node
u problematic for write-intensive services: all clones must perform
fail, other nodes continue to offer service (probably
all writes (no throughput improvement) and need subtle
with degraded performance)
coordination
u Failures could be masked, if node- and application-
§ Shared-disk RACS
failure detection mechanisms are integrated with
u Storage server should be fault-tolerant for availability (only one
the load-balancing system or with client
applications copy of data)
u Still require subtle algorithms to manage updates (such as
§ easy to manage – administrative operations on one
service instance at one node could be replicated to all cache validation, lock managers, transaction logs, etc.)
others.
9 10
Partitions and
Packs How to partition, RAPS
§ Partition – service is grown § typically, the application middleware partitions the data and
by dividing data among workload by object:
nodes u Mail servers partition by mailboxes
u Only one copy of data
u Sales systems partition by customer accounts or product
in each partition – lines
availability is not § challenges
improved
u When a partition (node) is added, the data should be
§ Pack – each partition is automatically repartitioned among the nodes to balance the
implemented by a set storage and computational load.
of servers u The partitioning should automatically adapt as new data is
u For update-intensive and large database applications u data-tier: SQL (database) servers (update
13 14
Summary
§ scalability technique
u replicate a service at many nodes
§ against disaster
u the entire farm is replicated to form a geoplex
15 16
Fault tolerance in Why fault tolerance
distributed systems
n Distributed systems encompass more and more individual devices
n Motivation n the chance of failure in distributed system can grow arbitrarily
large when the number of its components increases
n robust and stabilizing algorithms
n distributed systems can hardly be restarted after failure
n failure models
n distributed systems are subjects to partial failure property: when
n robust algorithms
one of the components fails the system may still be able to
u decision problems
function in a decreased capacity
u impossibility of consensus in asynchronous
n as the system grows in size
networks with crash-failures
u it becomes more likely that one component fail
u consensus and agreement with initially-dead
u it becomes less likely that the failure occurs in all components
processes - knot calculation algorithm
n thus the systems able to deal with failures are attractive
n stabilization
u Dijkstra’s K-state algorithm
1 2