Professional Documents
Culture Documents
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
Verification Testing
Reliability
Attributes
Safety
Security
Fault Prevention
Dependability
Fault Forecasting
Faults
Impairments Errors
Failures
Fault, Error and Failure
Incorrectness in systems may be described in different terms
as fault, error and failure.
Fault is a physical defect, imperfection or flaw that occurs in
hardware or software.
Error is a deviation from correctness or accuracy.
Errors are usually associated with incorrect values in the system state.
Fault Example:
short between wires
break in transistor
infinite program loop
Error Example:
Suppose a line is physically shortened to 0 (there is a fault).
As long as the value on line is supposed to be 0, there is no
error.
Failure Example:
Suppose a circuit controls a lamp (0 = turn off, 1 = turn on)
and the output is physically shortened to 0 (there is a fault).
As long as the user wants the lamp off, there is no failure.
Fault Sources
Design Problems
Software or Hardware
Manufacturing Problems
Damage and Deterioration
External disturbances
Harsh environmental conditions, electromagnetic
interference and ionization radiation
System Misuse
People
Fault and Classifications
Extent:
Local (independent)
Distributed (related)
Duration: Transient, Intermittent, Permanent
Transient-appear and disappear quickly, and are
not correlated with each other. They are most
commonly induced by random environmental
disturbances such as electro-magnetic interference.
Many faults in the communication systems are transient
Fault and Classifications
Intermittent-appear, disappear, and reappear
repeatedly. They are difficult to predict, but their
effects are highly correlated. Most intermittent
faults are due to marginal design.
e.g., a heat-sensitive hardware component
Permanent-remain in existence indefinitely if no
corrective action is taken. Though many are
design or manufacturing faults, they are also
caused by catastrophic events such as an accident.
E.g. a broken wire, software design error
Methods for Minimizing Faults
1) Fault Prevention
attempts to eliminate any possibility of faults in a system before
it goes operational.
Fault Prevention has Two stages: Fault avoidance and Fault
removal
A. Fault Avoidance:
» attempts to limit the introduction of faults during system
construction (by use of the most reliable components )
B. Fault Removal:
» Procedures for finding and removing the causes of errors
» e.g. design reviews, program verification, and system
testing
Fault Tolerance
2) Fault-Tolerance is the ability of a computing system to
survive in the presence of faults.
All fault-tolerant techniques rely on extra elements
introduced into the system to detect & recover from faults
Components are redundant as they are not required in a
perfect system - often called Redundancy
Two types:
Static (or masking)
Dynamic redundancy
Masking
Static/Masking : redundant components are used
inside a system to hide the effects of faults without
explicit error detection;
e.g. Triple Modular Redundancy (TMR)
3 identical subcomponents and majority voting circuits;
The outputs are compared and if one differs from the other
two, that output is masked out;
Assumes the fault is not common (such as a design error)
but is either transient or due to component deterioration;
Dynamic Redundancy
Rk(t) = e-λ t
k
where λk is constant failure rate
Assuming the failure rates of components are statistically
independent. The overall serial system reliability Rser(t)
Rser(t) = R1(t) ×R2(t) ×R3(t) ×…. ×Rn(t)
component k
Qk(t) = 1 -e-λkt where λk is a constant failure rate
𝒏
Overall system reliability: R par ( t )=𝟏− ∏ (𝟏− 𝑹¿¿𝒊 ( 𝒕 ) )¿
𝒊=𝟏
Example
Example 1:
If one is to build a serial system with 100 components each
of which had a reliability of 0.999, the overall system
reliability would be
Rser=(0.999)100 = 0.905
Example 2:
Consider a system with 4 identical modules connected in
parallel.
If the reliability of each module is 0.95.
The overall system reliability is
Rpar = 1-[1-0.95]4= 0.99999375
Example