5 DCS Failure Modes and Models

Dependable Computer
Systems
Part 5: Failure Modes and Models
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Contents
• Recap from Part 3: Canonical Failure Classification

• Failure Mode Hierarchy
• Fault-Hypothesis, Failure Semantics, and Assumption
Coverage
• Failure Hypothesis Estimation
• Overview of Safety Analysis Methods
• Comparison of Safety Analysis Methods
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 2
Recap from Part 3: Canonical
Failure Classification
Failures
Recap:
A (service) failure is an event that occurs when the delivered
service deviates from correct service.
• Thus, a failure is a transition from correct service to incorrect
service.
Failure Mode Classification –
Overview
• Domain:
• content, early timing failure, late timing failure, halt failure,
erratic failure
• Detectability:
• signaled failures, unsignaled failures
• Consistency:
• consistent failure, inconsistent failure
• Consequences:
• minor failure, ..., catastrophic failure
Domain
• Content
• Early timing failure
• Late timing failure
• Halt failure
 the external state becomes constant, i.e., system activity
is no longer perceptible to the users
 silent failure mode is a special kind of halt failure in that
no service at all is delivered
• Erratic failure
 not a halt failure, e.g., a babbling idiot failure
Consistency
When there are more than one users of a service.
• Consistent failure:
• All users experience the same incorrect service.
• Inconsistent failure
• Different users experience different incorrect services.
Consequences, e.g., Aircraft
Minor: 10E-5 per flight hour or greater
no significant reduction of aeroplane safety, a slight reduction in the
safety margin
Major: between 10E-5 and 10E-7
significant reduction in safety margins or functional capabilities,
significant increase in crew workload or discomfort for occupants
Hazardous: between 10E-7 and 10E-9
large reduction in safety margins or functional capabilities, causes
serious or fatal injury to a relatively small number of occupants
Catastrophic: less than 10E-9
these failure conditions would prevent the continued safe flight and
landing of the aircraft
Failure Mode Hierarchy
Universe of possible behavior
Byzantine Fail uncontrolled behavior
Authentification detectable byzantine

Performance
Omission
Crash
Fail-stop
Based on the strength of the assumptions the failure modes form a

hierarchy.
• byzantine failures are based on the weakest assumption
(a non-assumption)
• fail-stop failures are based on the strongest assumptions
(only correct results, information about the last correct state in case
of a failure)
Failure Mode Hierarchy (cont.)
• Byzantine or arbitrary failures:

there is no restriction on the behavior at the system interface,
this mode is often called fail-uncontrolled
(“two-faced” behavior, forging of messages)
• Authentification detectable byzantine failures:

the only restriction on the behavior at the system interface is
that messages of other systems cannot be forged
(this failure mode applies only to distributed systems)
• Performance failures:
under this failure mode systems deliver correct results in the
value domain, in the time domain results may be early or
late (early or late failures)
• Omission failures:
a special class of performance failures where results are
either correct or infinitely late (for distributed systems
subdivision in send and receive omission failures)
• Crash failures:
a special class of omission failures where a system does not
deliver any subsequent results if it has exhibited an omission
failure once
(the system is said to have crashed)
• Fail-Stop failures:
besides the restriction to crash failures it is required that
other (correct) systems can detect whether the system has
failed or not and can read the last correct state from a stable
storage
Fault-Hypothesis, Failure Semantics,
and Assumption Coverage
Fault-Hypothesis, etc.
Concepts
• Fault hypothesis:
The fault hypothesis specifies anticipated faults which a
server must be able to handle (also fault assumption).
• Failure semantics:
A server exhibits a given failure semantics if the probability of
failure modes which are not covered by the failure semantics
is sufficiently low.
• Assumption coverage:
Assumption coverage is defined as the probability that the
possible failure modes defined by the failure semantics of a
server proves to be true in practice conditions on the fact that
the server has failed.
Fault-Hypothesis, etc. (cont.)
Importance of assumption coverage
• The definition of a proper fault hypothesis, failure semantics
and achievement of sufficient coverage is one of the most
important factors.
• If the fault hypothesis (or failure semantics) is violated a

system may fail as a whole.
Assumption Coverage Example
If component 1 or 2 violates its failure semantics the system
fails, although it was designed to tolerate 1 component failure.
crash semantics
input 1
Component 1
output
Voter
crash semantics
input 2
Component 2
The Titanic or: violated assumption coverage
• The fault hypothesis:
The Titanic was built to stay afloat if less or equal to 4 of the
underwater departments were flooded.
• Rationale of fault hypothesis:
This assumption was reasonable since previously there had
never been an incident in which more than four
compartments of a ship were damaged.
• But:
Unfortunately, the iceberg ruptured five spaces, and the
following events went down to history.
Failure Hypothesis Estimation
Life-characteristics curve (Bathtub curve)
• For semiconductors, out of three terms describing the life
characteristics only infant mortality and the constant-failure-
rate region are of concern
Life-characteristics curve, showing the three components of failure
Semiconductor failure rate
• a typical failure rate distribution for semiconductors shows
that wear out is of no concern
Semiconductor failure rate
Stress Tests
• semiconductor failures are stress dependent

• the most influential stress factor is temperature
Stress Tests (cont.)
Arrhenius equation
• the basic relationship between the activation rate of failures

and temperature is described by the Arrhenius equation
E
 A
R  R0 e kT
R0 .. constant
T .. absolute temperature (K)
EA .. activation energy (eV)
k .. Boltzmann’s constant 8.6 10-5 eV/K
Arrhenius plot (EA = 1 eV)
Accelerated stress testing of semiconductors
• to remove freaks and infant-mortality failures (screening)

• to determine the expected failure rate
Accelerated conditions:
accelerated temperature lowering of temperature
cycling of temperature high temperature and current
temperature and voltage stress  particles
temperature, voltage and high voltage gradients
humidity stress
Software stress
• For software there is no sound empirical and mathematical basis to

use stress as a method to characterize the behavior of components.
• it is currently unknown how to characterize stress for software
• it is impossible to carry out accelerated stress tests to examine
failure rates for software
• for software there is no such relation as the Arrhenius equation
which describes the activation rate of failures
• there is no general possibility to “over-engineer” a system to
handle conditions which are more stressful
Hardware/Software
Interdependence
• software depends on hardware:
• software requires hardware to execute
(e.g. Intel’s Pentium bug)
• hardware depends on software:

• VLSI design uses software tools
• PCB layout and routing by software tools
• EMC analysis by software tools
• hardware testers are software driven
Overview of Safety Analysis
Methods
Safety Analysis (cont.)
Concepts
System Safety: is a subdiscipline of system engineering that
applies scientific, management, and engineering principles to
ensure adequate safety, throughout the operational life cycle,
within the constraints of operational effectiveness, time and cost.
Safety: has been defined as “freedom from those conditions that

can cause death, injury, occupational illness, or damage to or
loss of equipment or property”. safety has to be regarded as a
relative term.
Software Safety: to ensure that the software will execute within a

system context without resulting in unacceptable risk
Safety analysis
Overview
• includes complete life cycle of project/product

(specification, design, maintenance, modification, ... )
• definition of responsibilities
• communication with other groups
• complete documentation
• analysis of complex processes
• management procedures
(specialists, meetings, action reviews, time schedule, ... )
Major topics of Safety analysis
• which (hazard analysis)

• how (accident sequencing)
• how likely (quantitative analysis)
Safety analysis methodologies
• Preliminary Hazards Analysis (PHA)

• Hazards and Operability Study (HAZOP)
• Action Error Analysis (AEA)
• Fault Tree Analysis (FTA)
• Event Tree Analysis (ETA)
• Failure Modes and Effect Analysis (FMEA)
Failure Modes, Effect and Criticality Analysis (FMECA)
• Cause-consequence analysis
Preliminary hazard analysis (PHA)
• The first step in any safety program is to identify hazards and

to categorize them with respect to criticality and probability
• define system hazards

• define critical states and failure modes
• identify critical elements
• determine consequences of hazardous events
• estimate likelihood of hazardous events
• issues to be analyzed in more detail
Hazards and Operability Study (HAZOP)
Based on a systematic search to identify deviations that may
cause hazards during system operation
Intention: for each part of the system a specification of the
“intention” is made
Deviation: a search for deviations from intended behavior which
may lead to hazards
Guide Words: Guide words on a check list are employed to
uncover different types of deviations
(NO, NOT, MORE, LESS, AS WELL AS, PART OF,
REVERSE, OTHER THAN)
Team: the analysis is conducted by a team, comprising different
specialists
Example for HAZOP
• Intention: pump a specified amount of A to reaction tank B.
Pumping of A is complete before B is pumped over.
NO or NOT
– the tank containing A is empty
– one of the pipe’s two valves V1 or V2 is closed
– the pump is blocked, e.g. with frozen liquid
– the pump does not work (switched off, no power, ... )
– the pipe is broken
CONSEQUENCE is serious, a possible explosion
V3
MORE A V2 V4
– the pump has a too high capacity V1 B
– the opening of the control valve is too large
CONSEQUENCE not serious, tank gets overfilled
V5
AS WELL AS C
– valve V3 is open, another liquid or gas gets pumped
– contaminants in the tank
–A is pumped to another place (leak in the connecting
pipe)
CONSEQUENCE is serious, a possible explosion
...
Action Error Analysis (AEA)
Considers the operational, maintenance, control and supervision
actions performed by human beings. The potential mistakes in
individual actions are studied.
• list steps in operational procedures (e.g. “press button A”)
• identification of possible errors for each step, using a check-list of
errors
• assessment of the consequences of the errors
• investigations of causes of important errors
(action not taken, actions taken in wrong order, erroneous actions,
actions applied to wrong object, late or early actions, ... )
• analysis of possible actions designed to gain control over these
process
• relevant for software in the area of user interface design
Fault Tree Analysis (FTA)
A graphical representation of logical combinations of causes that
may lead to a hazard (top-event). Can be used as a quantitative
method.
• identification of hazards (top-events)
• analysis to find credible combinations which can lead to the
top-event
• graphical tree model of parallel and sequential faults
• uses a standardized set of symbols for Boolean logic
• expresses top-event as a consequence of AND/OR
combination of basic events
• minimal cut set is used for quantitative analysis
Symbols used in fault tree analysis
An Example for fault tree analysis
In a container two chemicals react with each other over a period of 10 hours at a
temperature of 125 °C. If the temperature exceeds 175 °C toxic gas is emitted.
The temperature is controlled by a computer system.
Alarm
Computer
system
T = 125°C
Valve
Relay Power Supply
An Example for fault tree analysis (cont.)
Identification of the top-event:

Emission of poisonous gas
is the top event
The upper part of the fault tree
Subtree for temperature

measurement failure
Subtree for heating

cut off failure
Event Tree Analysis (ETA)
Models the potential consequences of faults which are

considered as events. Can be used as a quantitative method.
• identification of basic events

• start with basic events and describe possible consequences
of this event
• binary decision for consequences of events
• opposite of FTA which starts with top events
Failure Modes and Effect Analysis (FMEA)
A common method where the designer in a systematical way

has to answer the questions “How can the component fail?” and
“What happens then?”.
• the system is dived up into different components in the form

of a block diagram
• failure modes are identified for all components
• causes, consequences and the significance of failures are
assessed for each failure mode
Failure Modes and Effect Analysis (FMEA) (cont.)
• an investigation is made into how the failure can be detected

• if necessary, recommendations for suitable control measures
are made
• analysis is supported by tabular sheets (e.g. IEC
standard 1985)
• failure mode, effects and criticality analysis (FMECA) puts
special emphasis on the criticality aspect
An example FMEA hazard assessment
Severity of consequence Probability of occurrence Probability of detection
10 Very severe 10 High 500 10-6 10 Unprobable

System operation has to be It is almost certain that the It is impossible or at very in-
9 abandoned or even a safety 9 failure will occure with high probable that the failure can
critical state may be reached probability be detected
8 Severe 8 Moderate 50 10-6 9 Very low
Failure causes disturbance of The component is similiar to com- It is possible to detect the fault
end user (no safety critical 7 ponent designs which already have before the system fails
7 failures or violations of regu- caused problems in the past
lations) 8 Small
6 Moderate 6 Small 5 10-6 7
Failure causes inconvenience of The component is similiar to com-
5 the end user, restricted system 5 ponent designs which have caused 6
operation will be perceived by problems in the past, but the extend
4 the customer 4 of problems was relatively low 5 Moderate
3 Minor 3 Very small 100 10-9 4
Failure causes only minor incon- The component is similiar to com- 3
venience of the end user, only ponent designs which had very
2 minor restrictions of the system low failure rates in the past
operation are perceiveable 2 High
1 Improbable 1 Improbable 1 10-9 1 Very High
It very improbable that the failure It is very improbable that a failure It is certain that the faults gets de-
will be perceived by the end user ocurrs tected before the system fails
Safety Analysis (cont.) Severity
An example FMEA hazard assessment (cont.) Probability

Dedection
Function Failure Mode Cause Effect Controls Product
speed sensor open connector or no operation supplier quality control and 9 4 3 108
harness possible end of line testing
computer no operation computer supplier quality control 9 3 3 81

possible and end of line testing
sensor no operation sensor supplier quality control, 9 4 3 108

possible module and end of line testing
short to connector or no operation supplier quality control and 9 2 3 54

supply harness possible end of line testing


short to connector or no operation supplier quality control and 9 1 3 27

ground harness possible end of line testing


Cause-consequence analysis
Combination of fault tree analysis and event tree analysis
• starts at a critical event

• works forward by using event tree analysis (consequences)
• works backward by using fault tree analysis (causes)
• very flexible
• well documented method
Comparison of Safety Analysis
Methods
Methods
Method Advantages Restrictions and deficiencies
Preliminary A required first step. None.

hazards analysis
Hazards and Suitable for large chemical plants. Technique is not well standardized
operability study Results in a list of actions, design and described in the literature. Most
changes and cases identified for often applied to continuos
more detailed study. Enhances the processes.
information exchange between
system designers, process designers
and operating personnel.
Action error Gives the computer system designer

analysis proposals for proper interface AEA is an analysis of the technical
design. Helps the personnel or users system, and does not analyze the
to monitor the process during behavior of operators. The thoughts
operation and helps to prevent and intentions of human beings, i.e.
operator mistakes. the reasons for mistakes, are not
considered.
Methods (cont.)
Fault tree Well accepted technique. Very good Large fault trees are difficult to
analysis for finding failure relationships. A understand, bear no resemblance to
fault oriented technique which looks system flow charts, and are
for the ways a system can fail. mathematically not unique. It
Makes it possible to verify assumes that all failures are of
requirements, which are expressed binary nature, i.e. a component
as quantitative risk values. completes successfully or fails
completely.
Event tree Can identify effect sequences and Fails in case of parallel sequences.
analysis alternative consequences of failures. Not suitable for detailed analysis
Allows analysis of systems with due to combinatorial explosion.
stable sequences of events and Pays no attention to extraneous,
independent events. incomplete, early or late actions.
Methods (cont.)
Failure modes Easily understood, well accepted, Examines non-dangerous failures

and effects standardized technique. Non- and is therefore time consuming.
analysis controversial, non-mathematical. Often combinations of failures and
Studies potential failures and their human factors not considered. It is
effects on the function of the system. difficult to consider multiple and
simultaneous failures.
Cause- Extremely flexible and all- Cause-consequence diagrams

consequence encompassing methodology. Well become too large very quickly (as
analysis documented. Sequential paths for FTA, ETA). They have many of the
critical events are clearly shown. disadvantages of fault tree analysis.
Problems with software safety
analysis
• relatively new field
• lack of systematic engineering discipline
• no agreed or proven methodologies
• time and cost
• complexity
(understanding of the problem domain, separation of knowledge)
• discrete nature of software
(difficulties with large discrete state spaces)
• real-time aspects
(concurrency and synchronization)
• (partially) invalid assumption of independent failures

5 DCS Failure Modes and Models

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 DCS Failure Modes and Models

Uploaded by

Copyright:

Available Formats

Dependable Computer

• Recap from Part 3: Canonical Failure Classification

Authentification detectable byzantine

Based on the strength of the assumptions the failure modes form a

• Byzantine or arbitrary failures:

• Authentification detectable byzantine failures:

• If the fault hypothesis (or failure semantics) is violated a

Life-characteristics curve, showing the three components of failure

Semiconductor failure rate

• semiconductor failures are stress dependent

• the basic relationship between the activation rate of failures

Arrhenius plot (EA = 1 eV)

• to remove freaks and infant-mortality failures (screening)

• For software there is no sound empirical and mathematical basis to

• hardware depends on software:

Safety: has been defined as “freedom from those conditions that

Software Safety: to ensure that the software will execute within a

• includes complete life cycle of project/product

• which (hazard analysis)

• Preliminary Hazards Analysis (PHA)

• The first step in any safety program is to identify hazards and

• define system hazards

Identification of the top-event:

The upper part of the fault tree

Subtree for temperature

Subtree for heating

Models the potential consequences of faults which are

• identification of basic events

A common method where the designer in a systematical way

• the system is dived up into different components in the form

• an investigation is made into how the failure can be detected

10 Very severe 10 High 500 10-6 10 Unprobable

An example FMEA hazard assessment (cont.) Probability

computer no operation computer supplier quality control 9 3 3 81

sensor no operation sensor supplier quality control, 9 4 3 108

short to connector or no operation supplier quality control and 9 2 3 54

computer no operation computer supplier quality control 9 2 3 54

sensor no operation sensor supplier quality control, 9 2 3 54

short to connector or no operation supplier quality control and 9 1 3 27

computer no operation computer supplier quality control 9 1 3 27

sensor no operation sensor supplier quality control, 9 1 3 27

Combination of fault tree analysis and event tree analysis

• starts at a critical event

Preliminary A required first step. None.

Action error Gives the computer system designer

Failure modes Easily understood, well accepted, Examines non-dangerous failures

Cause- Extremely flexible and all- Cause-consequence diagrams

You might also like