Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

UNIVERSITY OF MASSACHUSETTS Dept.

of Electrical & Computer Engineering Fault Tolerant Computing


ECE 655 Part 6 Networks - 1

C. M. Krishna Fall 2006

ECE655/Krishna Part.6 .1

Copyright 2004 Koren & Krishna

Fault-Tolerant Networks
Multiple paths connecting the source to the
destination of a message Spare nodes that can be switched in to replace failed units Fault-tolerant topologies Extra-Stage Multi-Stage Networks Interstitial Mesh Redundant Crossbar Hypercube Point-to-Point Networks

ECE655/Krishna Part.6 .2

Copyright 2004 Koren & Krishna

Page 1

Multi-Stage Network
Non-fault-tolerant multi-stage network
(butterfly network) - typically built out of 2x2 switches - two inputs and two outputs

Switch has four settings

S - Straight - top input line connected to top output


and bottom input line to bottom output C - Cross - top input line connected to the bottom output and bottom input line to top output UB - Upper Broadcast - top input line connected to both output lines both output lines

LB - Lower Broadcast - bottom input line connected to


ECE655/Krishna Part.6 .3 Copyright 2004 Koren & Krishna

Butterfly Network
k-stage network (k3)
2 k inputs and 2 k outputs k stages of 2 k-1 switches each Connections follow a recursive

2-stage butterfly 1-stage butterfly


ECE655/Krishna Part.6 .4

pattern from input to output Input stage - top output line of each switchbox connected to the input lines of a 2 K-1 x 2 K-1 butterfly, and the bottom output line of each switchbox connected to the input lines of another 2 K-1 x 2 K-1 butterfly

Input stage - top output line of each of its two

switchboxes connected to one 2x2 switchbox and the bottom output line to another 2x2 switchbox

Single 2x2 switchbox


Copyright 2004 Koren & Krishna

Page 2

Butterfly Network - Details

A switchbox in stage i has lines numbered 2 apart Output line j of every stage goes into input line j of
the following stage (j=0,...,2 k-1 ) Numbers in any box other than at the output stage are both of the same (even or odd) parity Butterfly is not fault-tolerant: there is only one path from any given input to a specific output If a switchbox in stage i fails - 2 k-i inputs will no longer be connected to 2 i+1 outputs The system can still operate but in a degraded mode
ECE655/Krishna Part.6 .5

Copyright 2004 Koren & Krishna

Extra stage - duplicating

Extra-Stage Networks - Fault Tolerant

stage 0 at the input Bypass multiplexors around switchboxes at the input and output stages - a failed switch can be bypassed by routing around it Examples: Stage-0 switchbox carrying lines 2,3 fails - duplicated by the
extra stage - failed box is bypassed by the multiplexor

Switchbox in stage-2 carrying lines 0,4 fails - extra stage is

Exercise - Network can remain connected despite the


ECE655/Krishna Part.6 .6 Copyright 2004 Koren & Krishna

set so that input line 0 is switched to output line 1 and input line 4 to output line 5 - bypassing the failed switchbox

failure of up to one switchbox anywhere in the system

Page 3

Dependability Measures of a Multi-Stage Network


Multi-stage interconnection network - connects N
processors to N memory units in a shared memory architecture (N=2 k ) In the presence of faulty elements - the system can operate - possibly in a degraded mode Systems resilience as it degrades can be measured Resilience Measures: Bandwidth Average number of operational paths Metrics of connectivity among processors and memories All measures are a function of time t - assuming faults occur and are possibly repaired during [0,t]
ECE655/Krishna Part.6 .7 Copyright 2004 Koren & Krishna

Bandwidth BW(t) - expected number of processors, at

Definition of Measures

time t, which are communicating with some memory Connectivity Q(t) - expected number, at time t, of operational processor-to-memory paths - an operational path includes a processor, memory, and the links connecting them - all fault-free A processor (memory) is accessible (at time t) if it is fault-free and is connected to at least one fault-free memory (processor) Additional connectivity measure - tuple (A r (t),Am (t)) - expected number at time t of accessible processors (memories) Another measure - (N m (t),N r (t)) - expected number at time t of fault-free processors (memories) to which an accessible memory (processor) is connected
ECE655/Krishna Part.6 .8 Copyright 2004 Koren & Krishna

Page 4

Shortcomings of Measures
BW(t) - depends not only on network conditions but also on the memory requirements of processors Q(t) - number of paths does not indicate how many distinct processors and memories are still accessible (A r (t),A m (t)) - does not imply that a fully connected fault-free A r (t) x A m (t) interconnection network exists; does not indicate how many faultfree memories are connected, on the average, to an accessible processor Combining Q(t) and (A r (t),A m (t)) - a more complete characterization of system N m (t) = Q(t) / A r (t) ; N r (t) = Q(t) / A m (t) (N r (t),Nm (t)) - an upper bound on the expected maximal fully-connected operational system at time t
ECE655/Krishna Part.6 .9 Copyright 2004 Koren & Krishna

Dependability Analysis
Assumption - mean time between component failures
(and possible repairs) is much larger than the average length of the communication period between a processor and a memory Status (faulty or fault-free) of system components is constant for a large enough period of time - allowing the analysis of the system's behavior under a statistical steady-state System is observed at some arbitrary time t - fixed throughout the analysis All measures are functions of t - p r (t), p m (t), p (t) l t is omitted from notations for simplicity - pr , pm ,p l p q - probability that a processor has a request for a memory connection
ECE655/Krishna Part.6 .10 Copyright 2004 Koren & Krishna

Page 5

Bandwidth Analysis
Bandwidth BW - expected number of processors
actively communicating with some memory Simplifying assumption - destinations of memory requests by processors are statistically independent and uniformly distributed among the N memories Network bandwidth - product of number of memories N and m - the probability that a given memory (say memory 0) is non-faulty and has a request at its input

is calculated iteratively - following a path leading to this memory Probability of a request on an output link of a switch - calculated from the probability that a request has been accepted at the input links to this switch
ECE655/Krishna Part.6 .11 Copyright 2004 Koren & Krishna

Bandwidth Calculation
A link is in state X=1 (X=0) if it has (does not
have) a request for the specific memory - a faulty link is in state X=0 Assigning numbers to the k+1 stages (k=log2N) Stage 0 is the last stage - its output links are connected
to the memories Stage k is the first the outputs of the processors X i ,Y i - outputs of a 3 2 1 0

switch in stage i X i+1 ,Y i+1- inputs of a switch in stage i - outputs of two different switches in stage i+1
ECE655/Krishna Part.6 .12 Copyright 2004 Koren & Krishna

Page 6

Bandwidth - Non Redundant Network


Memory requests are uniformly distributed among the
memories - an incoming request will be routed to any output link with equal probability (0.5) It is sufficient to consider only a single output link calculating P(X i =1) A memory module request can reach the output link of a switch through any of the two input links P(X i =1)= P(X i =1/Xi+1=u,Y i+1=v)P(Xi+1 =u,Y i+1 =v) u,v=0,1 =0 P(X i+1=0,Y i+1=0)+ 1/2 p l P(X i+1=0,Y i+1=1) +1/2 pl P(X i+1=1,Y i+1=0)+(p l -1/4 pl2 )P(X i+1=1,Y i+1=1) Only input link faults are taken into account - faults at the output links are considered as input link faults at the next stage
ECE655/Krishna Part.6 .13 Copyright 2004 Koren & Krishna

Bandwidth - Cont.
Inputs into switch are statistically independent P(X i =u,Y i =v) = P(X i =u) P(Y i =v) P(Y i =0) = P(X i =0) = 1- P(X i =1) After some algebraic manipulations 2 P( X i =1) = pl P(X i+1=1) - 0.25 p 2 P(X i+1=1) l For the processors output links P(X k =1) = pq p r P( X 0 =1) is calculated recursively The memory and its input link can be faulty m = P( X 0 =1) p l p m BW = N m
ECE655/Krishna Part.6 .14 Copyright 2004 Koren & Krishna

Page 7

Non Redundant Interconnection Network - Connectivity Analysis


Q - Average number of operational paths connected processor-memory pairs Exactly one path between a processor and a memory Q = product of number of processor-memory pairs and the probability of a fault-free path Latter probability - pr p k+1 pm l k+1 - number of links along the path = number of stages +1 (k=log2N) 2 N - number of processor-memory paths Q = N 2 pr p k+1 pm
l
ECE655/Krishna Part.6 .15 Copyright 2004 Koren & Krishna

Non Redundant Network - Additional Measures Calculation


A
- expected number of accessible processors product of N and the probability r that a processor (say processor 0) is accessible Link is in state X=0 (X=1) if all (not all) paths from it to the memories are faulty A faulty path is a path with at least one faulty link A faulty link is in state x=0 Stage numbers - k to 0; X i - state of link in stage i P(X 0 =1) = p m ; P(X 0 =0) =1-p m
r

Recursive equation - P(X

i+1

r=pr pl P(X k=1)


A m ECE655/Krishna Part.6 .16

obtained similarly by interchanging p and p


r m

A r= N r

=1)= p l [1-P(X =0) ]

Copyright 2004 Koren & Krishna

Page 8

Example - Bandwidth Calculation - Fault Free Links


N=8 ; k=3 ; For some fixed t p r= 0.8 p m= 0.9 p l = 1 (links are fault-free) p q = 0.7 BW calculation: P(X3=1)=p q pr = 0.56 2 P(X2=1)=0.56-0.25x0.56 = 0.536 2 P(X 1=1)=0.536-0.25x0.536 = 0.464 P(X 0=1)=0.464-0.25x0.464 2 =0.41 BW=0.41Npm = 0.41x8x0.9 = 2.95
ECE655/Krishna Part.6 .17 Copyright 2004 Koren & Krishna

Example - Additional Measures Calculation



pr= 0.8 ; p m = 0.9 ; p = 1 l Q = N 2 x 0.8 x 0.9 = 0.72 N 2 N A r = Npr [1-(1-p m ) N ]= 0.8N(1-0.1 ) A m = Np m [1-(1-p r ) N ]= 0.9N(1-0.2 ) N r = Q / Am N m= Q / A r
N

ECE655/Krishna Part.6 .18

Copyright 2004 Koren & Krishna

Page 9

You might also like