Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Lecture 4

• Fault tolerance
• Hardware fault tolerance
• Redundancy

1
Error Detection
(cont. from Lecture 3)

2
Erroneous events
• The basic principle: at each state the
controller expects certain events. If the
arrived event is not one of those then
the controller detects error

• Sometimes it is possible to track the


error back to the causes sometimes not.
But in any case, this should not go
“unnoticeable”
3
Example: conveyor belt

• Conveyor belt example:


• Assume that it takes maximum 10 sec to travel from
the beginning to the end of the conveyor belt
• Sensors in the beginning and the end of the belt
• We need to stop the belt if the table is not ready
4
Erroneous events: example
• Scenario: Controller issued signal for
the motor to stop and wait for the table
to become ready. Motor failed to stop.
The exit sensor registered the detail
“passing” it.

5
Erroneous events: example
• Scenario: Controller issued signal for
the motor to stop and wait for the table
to become ready. Motor failed to stop.
The exit sensor registered the detail
“passing” it.

IF exit_sensor = OFF & state=waiting THEN
State:= failed ; output(‘motor failed to stop’)
6
Fault tolerance
• The idea: Not to allow a fault to result in a
failure of an entire system
• In other words: Try to provide required
services (probably degraded) in presence of
faults
• Fault tolerance is achieved by redundancy

7
Redundancy
• Redundancy is the use of some additional
elements within the system which would not
be required in a system that was free from all
faults
• I.e. redundancy is the overhead required to
tolerate faults
• It is needed to detect or mask a fault and
continue to operate even some redundant
component failed
8
Forms of redundancy
• Hardware redundancy
• Software redundancy
• Information redundancy
• Temporal (time) redundancy

9
Hardware fault tolerance
• Hardware redundancy: static, dynamic and hybrid
• Static systems utilize fault masking rather than fault
detection to achieve fault tolerance
• Dynamic redundancy relies on the detection of
faults and on the system taking appropriate actions to
nullify their effects (this involves reconfiguration)
• Hybrid techniques uses fault masking to prevent
errors from being propagated within the system and
fault detection and reconfiguration to remove faulty
units form the system

10
Static redundancy - Triple modular
redundancy
• 3 identical modules
receive identical inputs
and should produce 3
identical outputs
• a voter compares the
outputs from the three
modules
• if all agree voter
produces unanimous
output
• if a single fault occurs
when voter gives output
produced by majority 11
Simulation of TMR
Program TMR
var in1, in2, in3 : NAT
out :NAT
get(in1) || get (in2) || get (in3);
if in1=in2 then out:= in1
elseif in1=in3 then out:=in3
elseif in2=in3 then out:=in2
else ERROR;
12
If the match is not perfect
if in2-delta ≤ in1≤ in2+delta \/
in1-delta ≤ in2 ≤ in1+delta
then out:= in1
similarly the other branches

Possible modifications: out:= (in1+in2)/2

13
Binary voting
Program BinaryVoting
var x,y,z : array [0..N-1] of boolean; /*input*/
u,v,w : array [0..N-1] of boolean; /*intermediate*/
out: array [0..N-1] of boolean; /*output*/
Init: initialisation of arrays
for i=1 to N-1 do
Begin
u(i) := x(i) /\ y(i);
v(i):= x(i) /\ z(i);
w(i):= y(i) /\ z(i);
out(i):= u(i)\/v(i)\/w(i)
End
14
End
TMR
• Used to prevent the failure of a single
component cause the failure of the complete
system – so called single-point failure

• Problem
• Input signals – might be delays, errors in
converting etc.
• Voting element: either guarantee very
dependable voting element or replicate voting
arrangement
15
TMR with triplicate voting

16
Multistage TMR

17
TMR: analysis
• Once a module has failed the ability of the
system to tolerate further faults is reduced
and in TMR system may be non-existent
• It is vital to notice presence of faulty module
so it can be repaired or replaced
• Hence it is useful to record every
disagreement in a log

18
NMR

19
NMR
• Odd number of modules is used
• Module failure would not result in a
system failure provided the majority of
modules are ok.
• The system will tolerate failure of
|_ (N-1)/2 ˩
modules.
20
Dynamic redundancy
• Static redundancy is expensive: we need 3
modules to tolerate 1 fault, 5 modules to
tolerate 2 faults etc.
• Dynamic redundancy uses fault detection
instead of the fault masking
• one module is operational and one or more
standby modules are available if operational
module fails
• So we need 2 modules to tolerate 1 fault, 3 to
tolerate 2 faults.
• Success depends on fault detection
21
Dynamic redundancy
• It does not mask fault but rather tries to
confine faults and reconfigure the system to
achieve fault tolerance
• Observe that an error must be produced
before the dynamic system can recognize the
fault and take actions
• Hence we can use it in the systems which
can tolerate temporal errors within their
operation
22
Standby spares
• While no fault is
detected the operating
module drives output
through the switch
• When a fault is detected
the switch recognizes it
and takes output from
standby module
• The switch is controlled
by fault detection
scheme
• Reconfiguring the
system causes
disruption of the system
while outputs are
23
switched
Standby spares: Hot or Cold
• In hot standby the spare runs continuously in
parallel with active unit
• Hot standby minimizes disruption
• Problems with hot standby
– increase power consumption
– spare is the subject of the same operating stress
• Cold standby: the spare is unpowered until
called into service
– Pros and cons are opposite to hot standby

24
Self-checking pair
• fig 6.13 • Two identical modules
are fed with identical
inputs and their results
are compared
• The output from one
module is passed on
the next stage and the
output of the
comparator is used as a
failure detection signal

25
A self-checking pair using
software comparison

26
A self-checking component

27
Hybrid fault tolerance
• Can have different forms.
• Often it is a sort of N-modular redundancy
with spares
• N modules produce outputs. If all of them
match then unanimous output is produced. If
mismatch is detected then voter produces
majority view. Disagreement detector detects
faulty component and switches it off. The
standby module is used instead of faulty.
28

You might also like