Professional Documents
Culture Documents
5 DCS Failure Modes and Models
5 DCS Failure Modes and Models
Systems
Part 5: Failure Modes and Models
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Contents
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 2
Recap from Part 3: Canonical
Failure Classification
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 3
Failures
Recap:
A (service) failure is an event that occurs when the delivered
service deviates from correct service.
• Thus, a failure is a transition from correct service to incorrect
service.
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 4
Failure Mode Classification –
Overview
• Domain:
• content, early timing failure, late timing failure, halt failure,
erratic failure
• Detectability:
• signaled failures, unsignaled failures
• Consistency:
• consistent failure, inconsistent failure
• Consequences:
• minor failure, ..., catastrophic failure
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 5
Failure Mode Classification –
Domain
• Content
• Early timing failure
• Late timing failure
• Halt failure
the external state becomes constant, i.e., system activity
is no longer perceptible to the users
silent failure mode is a special kind of halt failure in that
no service at all is delivered
• Erratic failure
not a halt failure, e.g., a babbling idiot failure
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 6
Failure Mode Classification –
Consistency
When there are more than one users of a service.
• Consistent failure:
• All users experience the same incorrect service.
• Inconsistent failure
• Different users experience different incorrect services.
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 7
Failure Mode Classification –
Consequences, e.g., Aircraft
Minor: 10E-5 per flight hour or greater
no significant reduction of aeroplane safety, a slight reduction in the
safety margin
Major: between 10E-5 and 10E-7
significant reduction in safety margins or functional capabilities,
significant increase in crew workload or discomfort for occupants
Hazardous: between 10E-7 and 10E-9
large reduction in safety margins or functional capabilities, causes
serious or fatal injury to a relatively small number of occupants
Catastrophic: less than 10E-9
these failure conditions would prevent the continued safe flight and
landing of the aircraft
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 8
Failure Mode Hierarchy
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 9
Universe of possible behavior
Byzantine Fail uncontrolled behavior
Omission
Crash
Fail-stop
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 11
Failure Mode Hierarchy (cont.)
• Performance failures:
under this failure mode systems deliver correct results in the
value domain, in the time domain results may be early or
late (early or late failures)
• Omission failures:
a special class of performance failures where results are
either correct or infinitely late (for distributed systems
subdivision in send and receive omission failures)
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 12
Failure Mode Hierarchy (cont.)
• Crash failures:
a special class of omission failures where a system does not
deliver any subsequent results if it has exhibited an omission
failure once
(the system is said to have crashed)
• Fail-Stop failures:
besides the restriction to crash failures it is required that
other (correct) systems can detect whether the system has
failed or not and can read the last correct state from a stable
storage
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 13
Fault-Hypothesis, Failure Semantics,
and Assumption Coverage
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 14
Fault-Hypothesis, etc.
Concepts
• Fault hypothesis:
The fault hypothesis specifies anticipated faults which a
server must be able to handle (also fault assumption).
• Failure semantics:
A server exhibits a given failure semantics if the probability of
failure modes which are not covered by the failure semantics
is sufficiently low.
• Assumption coverage:
Assumption coverage is defined as the probability that the
possible failure modes defined by the failure semantics of a
server proves to be true in practice conditions on the fact that
the server has failed.
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 15
Fault-Hypothesis, etc. (cont.)
Importance of assumption coverage
• The definition of a proper fault hypothesis, failure semantics
and achievement of sufficient coverage is one of the most
important factors.
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 16
Fault-Hypothesis, etc. (cont.)
Assumption Coverage Example
If component 1 or 2 violates its failure semantics the system
fails, although it was designed to tolerate 1 component failure.
crash semantics
input 1
Component 1
output
Voter
crash semantics
input 2
Component 2
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 17
Fault-Hypothesis, etc. (cont.)
The Titanic or: violated assumption coverage
• The fault hypothesis:
The Titanic was built to stay afloat if less or equal to 4 of the
underwater departments were flooded.
• Rationale of fault hypothesis:
This assumption was reasonable since previously there had
never been an incident in which more than four
compartments of a ship were damaged.
• But:
Unfortunately, the iceberg ruptured five spaces, and the
following events went down to history.
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 18
Failure Hypothesis Estimation
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 19
Life-characteristics curve (Bathtub curve)
• For semiconductors, out of three terms describing the life
characteristics only infant mortality and the constant-failure-
rate region are of concern
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 20
Semiconductor failure rate
• a typical failure rate distribution for semiconductors shows
that wear out is of no concern
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 21
Stress Tests
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 22
Stress Tests (cont.)
Arrhenius equation
E
A
R R0 e kT
R0 .. constant
T .. absolute temperature (K)
EA .. activation energy (eV)
k .. Boltzmann’s constant 8.6 10-5 eV/K
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 23
Stress Tests (cont.)
Accelerated stress testing of semiconductors
Accelerated conditions:
accelerated temperature lowering of temperature
cycling of temperature high temperature and current
temperature and voltage stress particles
temperature, voltage and high voltage gradients
humidity stress
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 24
Stress Tests (cont.)
Software stress
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 25
Hardware/Software
Interdependence
• software depends on hardware:
• software requires hardware to execute
(e.g. Intel’s Pentium bug)
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 26
Overview of Safety Analysis
Methods
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 27
Safety Analysis (cont.)
Concepts
System Safety: is a subdiscipline of system engineering that
applies scientific, management, and engineering principles to
ensure adequate safety, throughout the operational life cycle,
within the constraints of operational effectiveness, time and cost.
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 29
Safety Analysis (cont.)
Major topics of Safety analysis
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 30
Safety Analysis (cont.)
Safety analysis methodologies
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 31
Safety Analysis (cont.)
Preliminary hazard analysis (PHA)
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 32
Safety Analysis (cont.)
Hazards and Operability Study (HAZOP)
Based on a systematic search to identify deviations that may
cause hazards during system operation
Intention: for each part of the system a specification of the
“intention” is made
Deviation: a search for deviations from intended behavior which
may lead to hazards
Guide Words: Guide words on a check list are employed to
uncover different types of deviations
(NO, NOT, MORE, LESS, AS WELL AS, PART OF,
REVERSE, OTHER THAN)
Team: the analysis is conducted by a team, comprising different
specialists
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 33
Safety Analysis (cont.)
Example for HAZOP
• Intention: pump a specified amount of A to reaction tank B.
Pumping of A is complete before B is pumped over.
NO or NOT
– the tank containing A is empty
– one of the pipe’s two valves V1 or V2 is closed
– the pump is blocked, e.g. with frozen liquid
– the pump does not work (switched off, no power, ... )
– the pipe is broken
CONSEQUENCE is serious, a possible explosion
V3
MORE A V2 V4
– the pump has a too high capacity V1 B
– the opening of the control valve is too large
CONSEQUENCE not serious, tank gets overfilled
V5
AS WELL AS C
– valve V3 is open, another liquid or gas gets pumped
– contaminants in the tank
–A is pumped to another place (leak in the connecting
pipe)
CONSEQUENCE is serious, a possible explosion
...
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 34
Safety Analysis (cont.)
Action Error Analysis (AEA)
Considers the operational, maintenance, control and supervision
actions performed by human beings. The potential mistakes in
individual actions are studied.
• list steps in operational procedures (e.g. “press button A”)
• identification of possible errors for each step, using a check-list of
errors
• assessment of the consequences of the errors
• investigations of causes of important errors
(action not taken, actions taken in wrong order, erroneous actions,
actions applied to wrong object, late or early actions, ... )
• analysis of possible actions designed to gain control over these
process
• relevant for software in the area of user interface design
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 35
Safety Analysis (cont.)
Fault Tree Analysis (FTA)
A graphical representation of logical combinations of causes that
may lead to a hazard (top-event). Can be used as a quantitative
method.
• identification of hazards (top-events)
• analysis to find credible combinations which can lead to the
top-event
• graphical tree model of parallel and sequential faults
• uses a standardized set of symbols for Boolean logic
• expresses top-event as a consequence of AND/OR
combination of basic events
• minimal cut set is used for quantitative analysis
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 36
Safety Analysis (cont.)
Symbols used in fault tree analysis
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 37
Safety Analysis (cont.)
An Example for fault tree analysis
In a container two chemicals react with each other over a period of 10 hours at a
temperature of 125 °C. If the temperature exceeds 175 °C toxic gas is emitted.
The temperature is controlled by a computer system.
Alarm
Computer
system
T = 125°C
Valve
Relay Power Supply
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 38
Safety Analysis (cont.)
An Example for fault tree analysis (cont.)
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 39
Safety Analysis (cont.)
An Example for fault tree analysis (cont.)
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 40
Safety Analysis (cont.)
An Example for fault tree analysis (cont.)
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 41
Safety Analysis (cont.)
Event Tree Analysis (ETA)
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 42
Safety Analysis (cont.)
Failure Modes and Effect Analysis (FMEA)
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 43
Safety Analysis (cont.)
Failure Modes and Effect Analysis (FMEA) (cont.)
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 44
Safety Analysis (cont.)
An example FMEA hazard assessment
Severity of consequence Probability of occurrence Probability of detection
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 45
Safety Analysis (cont.) Severity
speed sensor open connector or no operation supplier quality control and 9 4 3 108
harness possible end of line testing
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 46
Safety Analysis (cont.)
Cause-consequence analysis
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 47
Comparison of Safety Analysis
Methods
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 48
Comparison of Safety Analysis
Methods
Method Advantages Restrictions and deficiencies
Hazards and Suitable for large chemical plants. Technique is not well standardized
operability study Results in a list of actions, design and described in the literature. Most
changes and cases identified for often applied to continuos
more detailed study. Enhances the processes.
information exchange between
system designers, process designers
and operating personnel.
Fault tree Well accepted technique. Very good Large fault trees are difficult to
analysis for finding failure relationships. A understand, bear no resemblance to
fault oriented technique which looks system flow charts, and are
for the ways a system can fail. mathematically not unique. It
Makes it possible to verify assumes that all failures are of
requirements, which are expressed binary nature, i.e. a component
as quantitative risk values. completes successfully or fails
completely.
Event tree Can identify effect sequences and Fails in case of parallel sequences.
analysis alternative consequences of failures. Not suitable for detailed analysis
Allows analysis of systems with due to combinatorial explosion.
stable sequences of events and Pays no attention to extraneous,
independent events. incomplete, early or late actions.
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 50
Comparison of Safety Analysis
Methods (cont.)
Method Advantages Restrictions and deficiencies
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 51
Problems with software safety
analysis
• relatively new field
• lack of systematic engineering discipline
• no agreed or proven methodologies
• time and cost
• complexity
(understanding of the problem domain, separation of knowledge)
• discrete nature of software
(difficulties with large discrete state spaces)
• real-time aspects
(concurrency and synchronization)
• (partially) invalid assumption of independent failures
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 52