Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28

Chapter Seven

Reliable System Design


Contents
 Hardware Design
Checklist
 Testing Embedded
Systems
 Critical Systems
 Dependability
Hardware Design Checklist
 A complete and reliable design requires all of the
innumerable details to be evaluated and analyzed
correctly.
 The following checklist is intended to provide a guide
for the designer to ensure that all the important
design aspects have been evaluated.
 Define Power Supply Requirements
 Verify Voltage Level Compatibility
 Check DC Fan-Out: Output Current Drive vs. Loading
 AC (Capacitive) Output Drive vs. Capacitive Load
and De-rating
Cont’d…
 Verify Worst Case Timing Conditions
 Determine if Transmission Line Termination is Required
 Clock Distribution
 Power and Ground Distribution
 Asynchronous Inputs
 Guarantee Power-On Reset State
 Programmable Logic Devices
 Deactivate Interrupt and Other Requests on Power-Up
 Electromagnetic Compatibility Issues
 Manufacturing and Test Issues
 source: Embedded Controller Hardware Design by Ken Arnold,
“Appendix A, Hardware Design Checklist”
Testing Embedded Systems
 Testing:
 is an organized process to verify the behavior, performance, and
reliability of a device or system against designed specifications.
 is a manufacturing step to ensure that the manufactured device is
defect free.
 is one of the detective measures
 Testing is different from Verification.
 Verification or debugging
 is the process of removing defects (bugs) in the design phase
to ensure that the synthesized design, when manufactured,
will behave as expected.
 is one of the corrective measures of quality.
Verification vs. Testing

Verification Testing

Verifies the correctness of Verifies correctness of


design manufactured system
Performed once prior to Test application performed on
manufacturing every manufactured device
Responsible for quality of Responsible for quality of devices
design
Critical Systems
 In some embedded systems failure can result in significant
economic losses, physical damage or threats to human life.
 These systems are called critical systems.
 If critical systems fail to deliver their services as expected then
serious problems and significant losses may result.
 There are three main types of critical systems :
 Safety-critical systems: A system whose failure may result
in injury, loss of life or serious environmental damage.
 Mission-critical systems : A system whose failure may
result in the failure of some goal-directed activity.
 Business-critical systems: A system whose failure may
result in very high costs for the business using that system.
Cont’d

 Modern electronic systems increasingly make use of


embedded computer systems to add functionality,
increase flexibility, controllability and performance.
 The increased use of embedded software to control
systems brings with it certain risks.
Dependability

 The most important property of a critical system is its


dependability.
 A dependable system provides a trustworthy operation to
users.
Dependability Terminology
Availability

Reliability
Attributes
Safety

Security

Fault Prevention
Dependability

Means Fault Tolerance

Fault Forecasting
Faults
Impairments Errors
Failures
Fault, Error and Failure
 Incorrectness in systems may be described in different terms
as fault, error and failure.
 Fault is a physical defect, imperfection or flaw that occurs in
hardware or software.
 Error is a deviation from correctness or accuracy.
 Errors are usually associated with incorrect values in the system state.

 Failure is a non-performance of some action as expected.


 There is a cause-and-effect relationship between faults, errors,
and failures.
 Faults result in errors, and errors can lead to system failures
Relationship between Fault, Error and Failure

Source: Fault, Failure & Reliability by Lee, Kyoungwoo


Example

 Fault Example:
 short between wires
 break in transistor
 infinite program loop
 Error Example:
 Suppose a line is physically shortened to 0 (there is a fault).
As long as the value on line is supposed to be 0, there is no
error.
 Failure Example:
 Suppose a circuit controls a lamp (0 = turn off, 1 = turn on)
and the output is physically shortened to 0 (there is a fault).
As long as the user wants the lamp off, there is no failure.
Fault Sources

 Design Problems
 Software or Hardware
 Manufacturing Problems
 Damage and Deterioration
 External disturbances
 Harsh environmental conditions, electromagnetic
interference and ionization radiation
 System Misuse
 People
Fault and Classifications
 Extent:
Local (independent)
Distributed (related)
 Duration: Transient, Intermittent, Permanent
Transient-appear and disappear quickly, and are
not correlated with each other. They are most
commonly induced by random environmental
disturbances such as electro-magnetic interference.
 Many faults in the communication systems are transient
Fault and Classifications
Intermittent-appear, disappear, and reappear
repeatedly. They are difficult to predict, but their
effects are highly correlated. Most intermittent
faults are due to marginal design.
 e.g., a heat-sensitive hardware component
Permanent-remain in existence indefinitely if no
corrective action is taken. Though many are
design or manufacturing faults, they are also
caused by catastrophic events such as an accident.
 E.g. a broken wire, software design error
Methods for Minimizing Faults
1) Fault Prevention
 attempts to eliminate any possibility of faults in a system before
it goes operational.
 Fault Prevention has Two stages: Fault avoidance and Fault
removal
A. Fault Avoidance:
» attempts to limit the introduction of faults during system
construction (by use of the most reliable components )
B. Fault Removal:
» Procedures for finding and removing the causes of errors
» e.g. design reviews, program verification, and system
testing
Fault Tolerance
2) Fault-Tolerance is the ability of a computing system to
survive in the presence of faults.
 All fault-tolerant techniques rely on extra elements
introduced into the system to detect & recover from faults
 Components are redundant as they are not required in a
perfect system - often called Redundancy
 Two types:
 Static (or masking)
 Dynamic redundancy
Masking
 Static/Masking : redundant components are used
inside a system to hide the effects of faults without
explicit error detection;
 e.g. Triple Modular Redundancy (TMR)
 3 identical subcomponents and majority voting circuits;
 The outputs are compared and if one differs from the other
two, that output is masked out;
 Assumes the fault is not common (such as a design error)
but is either transient or due to component deterioration;
Dynamic Redundancy

 Reconfiguration/Dynamic redundancy: eliminating


a faulty entity from the system and restoring the
system to operational state.
 Error detection: recognizing that an error occurred
 Error location: identifying the module with the error
 Error containment: preventing errors from propagating
 Error recovery: regaining operational status
Dynamic redundancy
 Eg. Standby sparing: One module is operational
while one or more modules are spares.
error detection + error location + containment +
error recovery
faulty modules are removed and replaced by a
spare
Attributes of Dependability
 Availability:
 The probability that the system can provide the services
requested by users at any time.
 Reliability:
 The probability, over a given period of time, that the
system will correctly deliver services.
 Safety:
 Shows the extent of damage may be caused by the system
to people or its environment.
 Security:
 Shows that how the system can resist accidental or
deliberate unauthorized intrusions.
System Reliability

 The Reliability, RF(t) of a system is the probability


that no fault of the class F occurs (i.e. system survives)
during time t.
RF(t)= Pr[tinit ≤t < tf for all f ∈F]
 Where tinit is time of introduction of the system to
service tf is time of occurrence of the first failure f
drawn from F
 Failure Probability, QF(t) is complementary to RF(t).
RF(t)+QF(t)= 1
 We can take off the F subscript from RF(t)and QF(t)
R(t)+Q(t)= 1
Component Reliability Model
 When the lifetime of a system is exponentially
distributed, the reliability of the system is:
R(t) = e−λt
 The parameter λ is called the failure rate.
Reliability of System of Components

 Minimal Path Set: Minimal set of components whose


functioning ensures the functioning of the system.
 {1,3,4} {2,3,4} {1,5} {2,5}
Serial System Reliability
 For Serially Connected Components the overall
system reliability depends on the proper working of
each component.
 Let Rk(t)be the reliability of a single component k given as

Rk(t) = e-λ t
k
where λk is constant failure rate
 Assuming the failure rates of components are statistically
independent. The overall serial system reliability Rser(t)
Rser(t) = R1(t) ×R2(t) ×R3(t) ×…. ×Rn(t)

 Serial Failure rate,


Parallel System Reliability
 For Parallel Connected Components system will operate correctly
provided at least one module is operational.
 Q (t) is equal to 1 -R (t) where R (t) is the reliability of a single
k k k

component k
 Qk(t) = 1 -e-λkt where λk is a constant failure rate

 Assuming the failure rates of components are statistically


independent.
𝒏
Q Par ( t )=∏ 𝑸 𝒊 ( 𝒕 )
𝒊=𝟏

𝒏
Overall  system  reliability:     R par ( t )=𝟏− ∏ (𝟏− 𝑹¿¿𝒊 ( 𝒕 ) )¿
𝒊=𝟏
Example

 Example 1:
 If one is to build a serial system with 100 components each
of which had a reliability of 0.999, the overall system
reliability would be
Rser=(0.999)100 = 0.905
 Example 2:
 Consider a system with 4 identical modules connected in
parallel.
 If the reliability of each module is 0.95.
 The overall system reliability is
Rpar = 1-[1-0.95]4= 0.99999375
Example

 Example 3: Parallel-Serial Reliability

 Given R1=0.9, R2=0.9, R3=0.99, R4=0.99, R5=0.87


 Total reliability is the reliability of the first half, in serial
with the second half.

Rtotal =(1−(1 − 0.9)(1 − 0.9))(1− (1− (0.99 × 0.99)) (1− 0.87))


= 0.987

You might also like