Professional Documents
Culture Documents
Adelard Safety Case Development Manual: Ascad
Adelard Safety Case Development Manual: Ascad
/az-kad/
Adelard
Safety Case
Development
Manual
Adelard, 1998
Foreward
Following research for the UK HSE/NII in the 1990's, Adelard published its Safety
Case Development Manual (ASCAD) in 1998. This has successfully been used in
many organisations worldwide since then.
In support of the safety community Adelard has decided to make the manual
publicly available. It can be downloaded, after registration, at our website
http://www.adelard.com/resources/ascad
While now available free of charge to individuals, copyright is retained by
Adelard. Conditions of use are:
The manual may only be used by the individual who downloads the
document. It may not be passed on to anyone else without permission
from Adelard. Other interested parties should download the document
from our website. Anyone who has difficulty downloading the document
should contact Adelard to discuss other options.
The manual may be used freely by registered users, both for commercial
and non-commercial use.
While Adelard believes the content to be accurate, it accepts no
responsibility for any consequence of use, either direct or indirect. Use of
the manual implies acceptance of this and all other conditions.
The content of the manual may not be reproduced in any format (other
than for backup purposes) without agreement from Adelard in writing.
The document may be used is support of both academic teaching and
research, and in both cases some of the above restrictions may be
waived. Contact <office@adelard.com> for more information.
The document is available free of charge in softcopy only. Hard copy
versions are available at a nominal reproduction charge. Contact
<office@adelard.com> for more information.
ISBN 0 9533771 0 5
Adelard
Adelard is an independent consultancy founded in 1987 by Robin Bloomfield and
Peter Froome. Adelard works on a wide spectrum of problems in the area of the
assurance and development of safety related computer-based systems, ranging
from formal machine assisted verification to the human and social vulnerabilities
of organisations. We also apply this specialist knowledge to the development and
verification of real industrial systems.
http://www.adelard.com
Adelard of Bath
Adelard takes its name from Adelard of
Bath, a medieval mathematician and
natural philosopher, a crucial figure in the
development of early European thought,
and a major influence in the
revolutionary adoption of the Arabic
notation for numbers instead of the
intractable Roman numerals.
Adelards most influential works were on mathematics. He translated Euclids
Elementsstill the basis of much of todays mathematicsfrom Arabic into Latin,
the international language of European scholarship. He was also the author of a
Latin version of a treatise on Arabic arithmetic by al-Khwarizmi, the great Saracen
mathematician whose name, corrupted to algorism, became the European word
for the new system of numbers.
Version: 1.0
Contents
Part 1 Introduction............................................................................................................. 7
1 Scope ............................................................................................................................. 7
2 What is a safety case?.................................................................................................. 8
3 The importance of a good safety case ...................................................................... 8
4 Basis of the ASCAD methodology ............................................................................... 8
5 How to use the manual................................................................................................. 9
6 Feedback..................................................................................................................... 11
7 Acknowledgements ................................................................................................... 11
Part 2 Description of the safety case methodology.................................................... 13
1 Introduction.................................................................................................................. 13
2 Overview of approach ............................................................................................... 14
2.1 Safety case principles ......................................................................................... 14
2.2 Safety case structure ........................................................................................... 14
2.3 Types of claim....................................................................................................... 17
2.4 Sources of evidence............................................................................................ 17
2.5 Style of argument................................................................................................. 18
3 Safety case development.......................................................................................... 21
3.1 Safety case elements .......................................................................................... 21
4 Developing Preliminary safety case elements ........................................................ 22
4.1 Definition of system and project ........................................................................ 23
4.1.1 Operating context ............................................................................................ 23
4.1.2 Identify any defined PES (Programmable Electronic System) or component
safety requirements..................................................................................................... 24
4.1.3 Existing safety and project information ........................................................ 24
Version: 1.1
Version: 1.1
Version: 1.1
Version: 1.1
Version: 1.1
Version: 1.1
Part 1 Introduction
1 Scope
This manual defines the Adelard safety case development methodology (ASCAD)
which seeks to minimise safety risks and commercial risks by constructing a
demonstrable safety case. The ASCAD methodology places the main emphasis
on claims about the behaviour of the system (i.e. functional behaviour and system
attributes) and methods for structuring the safety arguments which are both
understandable and traceable.
The overall approach used in ASCAD is generic and applicable across a wide
range of technologies. The details of the approach are concerned with safety
cases for computer based command, control and protection systems such as
those found in railway signalling, nuclear reactor protection, air traffic control and
safety critical medical devices as well as many diverse military applications.
ASCAD can be applied both to new systems, using bespoke or COTS
components, and to the retrospective development of safety cases.
Many problems in producing an acceptable safety case can arise from an
attitude that regards the safety case as a bolt-on accessory to the system
(often produced after the system has been built). At this stage it is often
discovered that retro-fitting the supporting safety case is both expensive and
time consuming because the design does not minimise the scope of assessment
and the retrospective production of evidence is expensive. The overall ASCAD
approach can be applied to existing systems but the safety case options are
more constrained.
The manual assumes that the reader is familiar with the concepts of safety
management systems, quality management systems and safety analysis in
general. There is already a large body of guidance in these areas and the
uniqueness of this manual is its emphasis on addressing the construction of safety
cases. We also assume a familiarity with the system safety context as elaborated
in Appendix A.
Version: 1.1
Version: 1.1
A generalised form in Def Stan 00-42 Part 2, as the software reliability case
The approach has evolved during this period, but the evolution is largely through
extensions to the methodology rather than changing earlier ideas. While the
methodology is likely to evolve further, we believe that our current ASCAD
provides a good basis for safety case development.
Version: 1.1
10
Version: 1.1
6 Feedback
We are keen to receive feedback on this manual. Please send comments to
ascad@adelard.co.uk, see our www page at http://www.adelard.co.uk or write
to Robin Bloomfield, Adelard, 3 Coborn Road, London E3 2DA.
7 Acknowledgements
The manual was produced by Peter Bishop, Robin Bloomfield, Luke Emmet, Claire
Jones and Peter Froome. Some of the underlying technical work was undertaken
in the CEC sponsored SHIP project (ref. EV5V 103). More recent material has come
from the Quarc project funded by the UK (Nuclear) Industrial Management
Committee (IMC) Nuclear Safety Research Programme under Scottish Nuclear
contracts 70B/0000/006384 and PP/74851/HN/MB.
Version: 1.1
11
12
Version: 1.1
Version: 1.1
13
2 Overview of approach
2.1 Safety case principles
We define a safety case as:
a documented body of evidence that provides a demonstrable and
valid argument that a system is adequately safe for a given application
and environment over its lifetime.
To implement a safety case we need to:
make an explicit set of claims about the system
produce the supporting evidence
provide a set of safety arguments that link the claims to the
evidence
make clear the assumptions and judgements underlying the
arguments
allow different viewpoints and levels of detail
The following sections describe how we think a safety case should be structured
to meet these goals.
14
Version: 1.1
A safety case consists of the following elements: a claim about a property of the
system or some subsystem; evidence which is used as the basis of the safety
argument; an argument linking the evidence to the claim, and an inference
mechanism that provides the transformational rules for the argument. This is
summarised in the figure below.
Inference rule
Evidence
Claim
Evidence
Subclaim
Inference rule
Argument structure
Version: 1.1
15
Qualitative
The choice of argument will depend on the available evidence and the type of
claim. For example claims for reliability would normally be supported by statistical
arguments, while other claims (e.g. for maintainability) might rely on more
qualitative arguments such as adherence to codes of practice.
In addition the overall argument should be robust, i.e. it should be valid even if
there are uncertainties or errors. For example, two independent arguments could
be used to support the top level safety claim about a given system. Alternatively,
if there are two independent systems that can assure safety, it may only be
necessary to have a single argument for each one. Typically the strength of the
argument will depend on the integrity level associated with the specific system. At
the highest integrity level (Level 4) we might expect two independent arguments
for a single system regardless of the existence of other systems, as illustrated in
Figure H0 below
Evidence A
Evidence B
Argument 1
Evidence C
Argument 2
Claim
16
Version: 1.1
assumptions, which are necessary to make the argument, but may not
always apply in the real world
sub-claims, derived from a lower-level sub-argument
fail-safety
functional correctness
accuracy
time response
robustness to overload
maintainability
modifiability
The relevant attributes should be identified and, where possible, quantified. Note
that the attributes listed are only examples and further attributes may be safetyrelevant. This is elaborated later in Section 4.2.1.
Version: 1.1
17
the design
The choice of argument will depend in part on the availability of such evidence,
e.g. claims for reliability might be based on field experience for an established
design, and on development processes and reliability testing for a new design.
18
Version: 1.1
Safe
State
Error correction
Safe failure
Error
State
OK
State
Transition depends
on:
fail-safe design
partitioning
existence of safe
states
Dangerous failure
Fault activation
Danger
State
Version: 1.1
19
A particular safety argument can focus on claims about particular transition arcs.
The main approaches are listed below:
A fault elimination argument can increase the chance of being in the
perfect state and can hence reduce or eliminate the OK erroneous
transition. This is the reasoning behind the requirement to use formal methods
(e.g. in MOD DS 00-55) which essentially supports a claim that the error
transition rate is zero because the software correctly implements the
specified logical behaviour.
A failure containment argument can strengthen the erroneous OK or
erroneous safe transition. An example would be a strongly fail-safe design
which quantifies the fail-safe bias. This, coupled with test evidence bounding
the error activation rate, would be sufficient to bound the dangerous failure
rate.
A failure rate estimation argument can estimate the OK dangerous
transition. The whole system is treated as a black-box, and probabilistic
arguments are made about the observed failure rate based on past
experience or extensive reliability testing.
It is also possible to apply the arguments selectively to particular components or
fault classes, e.g.:
A design incorporates a safety barrier, which can limit dangerous failures
occurring in the remainder of the system. The safety argument would then
focus on the reliability of the barrier rather than the whole system.
Different countermeasures might be utilised for different classes of fault.
Each fault class then represents a separate link in the argument chain, and
all fault classes would have to be covered to complete the argument chain.
For example, design faults might be demonstrated to be absent by formal
development, while random hardware failures are covered by hardware
redundancy.
While normally applied to incorrect logical behaviour, the same approach can
be applied to many of the other safety attributes. For instance to ensure
timeliness, timing errors could be:
20
Version: 1.1
Version: 1.1
21
22
Version: 1.1
the safe and hazardous plant states (or equipment states) and target
failure probabilities
hazardous / safe states of the interfaces
anticipated changes in external equipment, interfaces and operating
modes
any operational or maintenance requirements such as maintenance
levels, repair times, manning intervals
safety functions
Version: 1.1
23
reliability requirements
24
Version: 1.1
System attributes
Accuracy
Availability
Fail-safety
Logical correctness
Maintainability (e.g. MTTR)
Maximum input and output data rates
Maximum response time
Maximum storage capacity (e.g. permanent records)
Modifiability (with respect to identified functional changes)
Real-time performance
Reliability (e.g. MTTF, pfd)
Response to hardware failures
Response to internal failures
Response to overload (data rate, internal storage)
Security
Timeliness
Usability
Table 1: Computer system attributes
Version: 1.1
25
software or may be addressed by other parts of the system (e.g. fault tolerance
may be implemented entirely in hardware). In addition, the software
implementation must cope with the constraints imposed by the specific choice of
hardware.
Software attributes
Accuracy
Compliance with hardware constraints (e.g. memory
capacity)
Fail-safety
Fault tolerance
Logical correctness (sometimes represented by the
software integrity level)
Maintainability
Modifiability (with respect to identified functional
changes)
Reliability
Response to hardware failures
Response to internal failures
Response to overload (data rate, internal storage)
Time response
Table 2: Software attributes
26
Version: 1.1
Plant Safety
Requirement
Safety
Functions 1
Hardware
Functions
(Accident probability)
Safety
Functions 2
Dangerous failure
rate
Availability
System
Architecture
Computer
System Functions
Computer
System Functions
Software
Functions
Version: 1.1
27
Additional software
allow parameterisation
Modifiability
Recovery routines
Availability
Hardware
Security
Voting algorithms
Redundant channels
Data encryption
mechanisms
Password authentication
network isolation
In this way layered safety cases are developed, i.e. a top-level safety case with
subsidiary traceable safety cases for subsystems.
28
Version: 1.1
Version: 1.1
29
30
Version: 1.1
Version: 1.1
31
Design Features
Assumption
/Evidence
Subsystem
Requirements
The evidence
either needed
(assumption) or
used to
substantiate the
claim is recorded
here.
See Section 5.2
Used to
document and
trace
assumptions.
See Section 5.3
Fault Avoidance
Error Tolerance
Fail-safe bias
32
Version: 1.1
Pressure limit
switch
Safety logic
(1,000 metres)
Version: 1.1
33
comparisons between sensors can identify failures and hence improve fault
diagnosis and availability. However the safety justification would be extremely
difficult without detailed analysis of the smart sensor software and hardware. In
fact it is possible to produce a simple design which meets the safety and
operational requirements without excessive reliance on computer-based
elements as shown below.
High
Pressure
Pipe
Limit Switch
Analogue
Pressure Sensor
Safety logic
4-20 mA signal
Isolated
repeater
signals
(1,000 meters)
34
Version: 1.1
Version: 1.1
35
36
Version: 1.1
Version: 1.1
37
Having identified the risks, the options and possible trade-offs should be reviewed.
This review will include the viewpoints of the developer, operator, licenser,
purchaser and maintainer. Also, the candidate design, system requirements,
safety case evidence and arguments, and the long term support requirements,
should be agreed with these stakeholders.
Appendix F provides a checklist for safety case reviewing.
38
Version: 1.1
The completed implementation safety case for a subsystem will provide evidence
that:
the design features, V&V and safety analysis demonstrate that the
required attributes were implemented
all sub-contracted components have been implemented to specification,
and implement their required attributes
Version: 1.1
39
all deviations are documented, and their impact has been analysed and
justified
As the project evolves the results of this subsidiary safety case will be incorporated
in the higher level system safety case. The actual subsystem components would
then be integrated into the overall system according to an integration plan. As
part of this process, the safety case may require evidence from:
40
Version: 1.1
Attribute Correctness
Claim
Argument
Evidence/Assumptions
There is no
logical fault in
the software
implementation
Formal proof of
specified safety
properties
Software
reliability
exceeds system
requirement
Reliability can be
assessed under
simulated operational
conditions
Version: 1.1
41
The completed system safety case (including subsystem evidence and systemlevel evidence) should be reviewed to assess whether:
42
Version: 1.1
The Preliminary safety case element will have defined requirements in this
area and identified any operating constraints that might apply (see
Section 4)
The Architecture safety case element will have addressed the need to
design for usability, maintainability and modifiability (see Section 5)
The Implementation safety case element will have implemented these
features and assessed whether there are any new operating constraints
or procedures required as well as adding the detail now available to the
maintenance, use and support aspects (see Section 6)
The types of information that will be new to this safety case are those aspects of
operation, installation and maintenance that the developer may not be
competent to define. For example, specific grades of staff to undertake the
different types of maintenance, training requirements for operators, the exact user
specific permit to work system that should be used, the identified operating
Version: 1.1
43
Over the lifetime of the system, there will almost inevitably be changes to the
safety case to accommodate changes in regulations, technology and
organisations so it will be necessary to establish a safety case maintenance
infrastructure (see Section 10).
System structure
We also need to address any special considerations that apply when there are
subsidiary safety cases for components and sub-systems. The documentation
structure is addressed in Section 11.
44
Version: 1.1
contractual boundary is crossed the safety responsibilities are handed over via a
safety case for that stage. This is the practice for civil and military air traffic control
where there are four part safety cases reflecting the purchaser/developer/
operator/user/maintainer boundaries.
The following table illustrates the different safety case components, for example
of a simple command system that consists of a database and an interface. Note
not all the project phases are shown.
Version: 1.1
45
Project Phase
Safety case
element
Produced by
Invitation to Tender
Preliminary
Preliminary design
Preliminary
System Design
Architectural
Designer
Subsystem Requirements
Preliminary
Subsystem Requirements
Preliminary
Architectural
Database Subsystem
Design
Architectural
Subsystem
Implementation
Implementation
Subsystem
Implementation
Implementation
Systems Integration
Implementation
Operation
Operational
46
Version: 1.1
For a new system the design can take into account the need to
demonstrate safety and the safety case production can be incorporated
into the project following a design for assessment approach.
For a COTS product, the design freedom is more constrained. There is
design freedom in the choice of COTS, so that a system can be chosen
where there is sufficient generic evidence to demonstrate safety. There is
also design freedom in the way the product is configured and used in a
particular application.
For a pre-existing system, there is very little design freedom, but there
may be scope for additional testing and analysis to demonstrate safety
attributes.
Version: 1.1
47
whether the design features will achieve the attributes and whether a
design for assurance approach has been adopted
the project risk arising from novelty, complexity and project stress
48
Version: 1.1
whether the hazards identified in the design have been tracked and
controlled (e.g. by hazard elimination, protective features, or operational
procedures)
the impact of changes made during development (and whether this
affects the arguments in the Preliminary and Architectural safety cases)
whether the operational and maintenance requirements to maintain the
system and the safety case are likely to be reasonable
Over the lifetime of the system, there will almost inevitably be changes to the
safety case to accommodate changes in regulations, technology and
organisations.
Appendix F provides a generic safety case review checklist that can be used at
all project phases.
Version: 1.1
49
The independent assessments should also look broadly at the available evidence
to ensure that any evidence contradicting the claims is properly incorporated into
the safety arguments.
10 Long-term maintenance
An important part of many safety cases is their potential longevity. This part of the
manual looks at the issues raised by this longevity and the supporting
organisational and management processes that are needed. The maintenance
implications of the safety case have been incorporated into the overall safety
case methodology in Section 5, so that the long-term costs and risks of
maintaining the safety case can be considered at an early stage in the system
design. There is little published data on the costs of safety case maintenance. The
costs of maintaining the overall safety cases in the nuclear industry is significant,
roughly 2% of operating costs per year, so a methodology that considers support
implications could have a considerable impact on costs as well as safety.
Control and protection systems are long lived in comparison with the lifetimes of
the implementation technologies, which are typically electronic and computer
based. Developments in these technologies are rapid with typical products
obsolete within a few years. This has led to the special provision of spares and for
the planned refurbishment of systems, and considerable effort is expended to
address the long term operational requirements. There are however wider issues
than this to be addressed when looking at the long term maintenance of safety
cases. These include the need to maintain the safety case in the light of external
changes which may affect it, e.g.:
We also need to consider internal changes which affect the long-term integrity of
the safety case maintenance process, e.g.:
50
Version: 1.1
Monitor the integrity of the safety case and the support infrastructure. (Is
the safety case still valid? Have the outstanding concerns been
addressed? Can anticipated changes be implemented?)
Version: 1.1
51
52
Version: 1.1
Version: 1.1
53
safety functions
reliability requirements
54
Version: 1.1
design analyses
provide at least one safety argument for each requirement which relates
evidence from design features, subsystem requirements and
development processes to a claim about the requirement
identify all design assumptions used in the argument, e.g.:
failure modes
failure rates
fail-safe bias
reliability and availability (e.g. failures per demand, spurious trip rate)
timeliness, accuracy
Version: 1.1
55
design assumptions
subsystems
outstanding concerns
unresolved hazards
In order to track the evolution of the safety case, it is also desirable to record
significant events during the construction of the safety case. This would include:
56
Version: 1.1
justification of changes
results of QA audits
11.10 References
The document will include references to related documents. These could include:
environment descriptions
design documents
hazard log
Version: 1.1
57
58
Version: 1.1
The overall plant or system safety case makes an overall claim for safety based on
all these risk reduction approaches. Targets would be set for the tolerable
accident frequency and severity, and the top-level safety case would argue that
the implemented safety features ensured the accident frequency was within
limits. There is also a requirement to show the risk is ALARP (as low as reasonably
practicable) so further risk reduction should be implemented provided the costs
do not outweigh the gains.
The systems we are discussing fall mainly into the category of hazard control (i.e.
reducing accident frequency). They would be used to implement the basic safety
functions (e.g. preventing excess mass entry or flooding the cell). Of course there
is no actual need to use a computer-based system to implement a safety
function; other mechanisms such as mechanical interlocks, discrete logic or a
human operator could be used instead. In addition the same safety function
Version: 1.1
59
60
maintainability
modifiability
security
usability
Version: 1.1
replaceability
These tend to be treated as softer attributes, but they are necessary to maintain
the integrity of the original design against potential sources of attack (even if
these are unintentional). Essentially the attributes relate to threats from different
sources (such as maintenance staff, the operator, unauthorised personnel, or
ageing and obsolescent equipment). These might be addressed using more
qualitative arguments (e.g. number of defences or conformity to ergonomic rules
and design standards).
MOD DS 00-56
MOD DS 00-55
DIN V-19250
DIN VDE-801
ISO 9001
ISO 9000-3
Version: 1.1
61
62
Version: 1.1
Integrity
level
A similar limit scheme is used in MOD DS 00-56, but the probability ranges are not
pre-determined; they have to be defined for a specific application.
For diverse subsystems implementing the same function (or functions), MOD DS 0056 allows the subsystem integrity level to be reduced by one level. Common faults
can limit the reliability improvement of diverse systems. The reduction in integrity
level reflects empirical experience that diversity can yield an order of magnitude
improvement. Other examples of claim limits for other design features are:
Version: 1.1
63
At least two different safety functions to protect against the most critical
accidents.
64
Version: 1.1
Fault Avoidance
stable sensors
stable and
accurate inputoutput system
Availability
Version: 1.1
Error Tolerance
Fail-safe bias
feedback
mechanisms to
minimise long-term
error
high reliability
components
multiple channels +
voting
compliance with
environmental
standards (EMI,
temperature,
etc.)
65
Logical
correctness
design simplicity
design diversity
formally proved
hardware (e.g.
VIPER)
hardware
watchdogs
fail-safe bias on
inputs and outputs
mature hardware
(stable, extensive
field experience)
Maintainability
interface
labelling
multiple channels +
voting
keyed
connectors to
avoid errors
simple, standard
interfaces
modular design
Response to
overload
ensuring
processor
capacity is
sufficient for
maximum inputoutput data rates
prioritising functions
so that the least
important functions
can be discarded
Security
locked cabinets
access indicators
(e.g. light on if door
open)
encryption
Timeliness
time budgets
assigned to
functions
hardware
watchdogs
66
Version: 1.1
Fault Avoidance
use of floating
point
Error Tolerance
Fail-safe bias
comparison
against
computation with
a small input
perturbation
memory exhaustion
checks
alternative data
sources
fail-safe response
to failure conditions
integer
calculations with
worst case error
and overflow
analysis
algorithm stability
analysis
Compliance to
hardware
constraints (e.g.
memory)
pre-allocation of
resources
Fault tolerance
alternative output
devices
Logical
correctness
design simplicity
formal
development
design diversity
isolation from
failures in noncritical functions
safety kernels
assertion checks in
code
Modifiability
design simplicity
code assertions to
detect errors
information
hiding
Version: 1.1
67
Response to
overload
mechanisms for
limiting
throughput
graceful
degradation (e.g.
discarding old
data in a real time
system)
overload detection
Time response
bounded
execution time
preference given
to safety critical
tasks
software timers
watchdogs
68
Version: 1.1
Operator error
Error Avoidance
training
procedures
Fault Detection /
Recovery
status displays
capability for cancelling
or returning to original
state
Calibration
error
Repair error
Update error
(parameter
data, redesign)
training
independent checks
status recording
pre-start tests
training
independent checks
status recording
pre-start tests
training
independent checks
status recording
on-line monitoring of
configuration integrity
on-line monitoring of
configuration integrity
pre-start tests
on-line monitoring of
configuration integrity
Malicious
damage
restriction of access
(locked cabinets,
passwords, authorisation
procedures)
on-line monitoring of
configuration integrity
Version: 1.1
69
70
Version: 1.1
C.1 Planning
Safety plandocument specifying the steps to produce a structured safety case
over the systems lifetime, covering quality management, safety management
and functional and technical safety.
Other plansthe safety aspects of other plans may also be relevant e.g. Quality
Plan, Configuration Management Plan, Integrated Logistic Support Plan,
Operation and Maintenance Plan, V&V Plan, Overall Project Plan.
Version: 1.1
71
72
Version: 1.1
Design Features
Identification of
safety-related
functions
Partitioning
according to
criticality
Assumption
/Evidence
Subsystem
Requirements
Assumption that
Subsystem integrity
segregated
level
functions cannot
affect each other Functional
segregation
requirements
Design simplicity
Version: 1.1
73
Attribute: Fail-safety
Claim
Design
Features
Use of
functional
diversity
Fail-safe
architectures
Assumption
/Evidence
System Hazard
Analysis
Fault Tree
Analysis
Subsystem
Requirements
Fail safety
requirements to
subsystems (response
to failure conditions)
Attribute: Reliability/availability
Claim
Reliability claim
based on
reliability
modelling and
CMF assumptions,
together with
fault detection
and repair
assumptions.
Reliability claim
based on
experience with
similar systems
74
Design Features
Architecture,
levels of
redundancy,
segregation
Fault tolerant
architectures
Design simplicity
Assumption
/Evidence
Subsystem
Requirements
Reliability of
components,
CMF assumptions
Hardware
component reliability
Failure rate,
diagnostic
coverage,
test intervals,
repair time,
chance of
successful repair
Prior field
reliability
in similar
applications
Software integrity
level
Component
segregation
requirements
Fault detection and
diagnostic
requirements
Maintenance
requirements
Version: 1.1
Design Features
Design ensures
overall response
time is bounded
Assumption
/Evidence
Assumes subsystem time
budgets can be
met
Subsystem
Requirements
Time budgets for
hardware interfaces,
and software
Attribute: Security
Claim
Claim a defence
exists for all
identified attacks
Claim of defence
in depth for
critical attacks
Design Features
System level
access controls
External
interfaces
Physical barriers
Assumption
/Evidence
Knowledge of
likelihood of
different forms of
attack
Assumption that
all forms of
attack are
identified
Subsystem
Requirements
Subsystem
integrity checks,
interface credibility
checks,
subsystem
segregation
Attribute: Modifiability
Claim
Claim that
anticipated
changes do not
pose a safety risk
Version: 1.1
Design Features
Assumption
/Evidence
Functional
segregation,
design structure
Identification of
features likely to
change
Design simplicity
Impact
assessment of
incorrect
modification
Subsystem
Requirements
Explicit identification
of features likely to
change in software
and hardware
specifications
75
Attribute: Maintainability
Claim
Claim that
maintenance
actions can be
performed
reliably, or are at
least fail-safe
(based on
analysis)
Design Features
Time to repair
Limits on
maintenance
actions (access,
calibration,
repair,
reconfiguration)
(based on past
systems with
similar features)
Assumption
/Evidence
Identification of
possible
maintenance
errors
Subsystem
Requirements
Subsystem failure
reporting and self-test
functions
Assessment of
incorrect action,
assessment of
impact on
dangerous failure
Attribute: Usability
Claim
Design Features
Claim that
operator cannot
affect the safety
of the system
On-line help,
ergonomic
design,
credibility
checks,
limits on operator
action
76
Assumption
/Evidence
Human error
rates, types of
error
Subsystem
Requirements
Operator interface
requirements
Usability tests
Version: 1.1
Attribute: Correctness
Claim
Argument
Evidence/Assumptions
There is no
logical fault in
the software
implementation
Formal proof of
specified safety
properties
Attribute: Reliability
Claim
Software
reliability
exceeds system
requirement
Version: 1.1
Argument
Evidence/Assumptions
Reliability can be
assessed under
simulated operational
conditions
77
Attribute: Timeliness
Claim
Argument
Evidence/Assumptions
Maximum timing decided by static
code analysis
Dynamic tests of worst case time
response
Argument
Software design is such
that the memory is
bounded and statically
decidable
Evidence/Assumptions
Analysis of memory usage
Stress testing of system
Argument
Identified
hardware
failures
(computer
interfaces, and
computer
system) are
either tolerated
or result in a failsafe response
78
Evidence/Assumptions
All failure modes have been
identified
Fault injection tests to check
response
Version: 1.1
Argument
Design can detect
overload conditions
and either maintain a
degraded service or
perform a fail-safe
action
Evidence/Assumptions
There is sufficient processing power
to cope with credible levels of
overload
Overload tests
Attribute: Maintainability
Claim
Argument
Parameter
adjustments
can be made
without
affecting safety
Software-imposed limits
ensure parameters
remain in the safe range
Evidence/Assumptions
Systems level analysis of allowable
safe ranges
Validation tests
Attribute: Operability
Claim
Claim that the
system is robust
to faulty
operator
actions
Claim that the
system is
designed to
minimise user
error
Version: 1.1
Argument
Design conforms to
human factors
standards
Evidence/Assumptions
Interface prototyping
Validation tests
79
80
Version: 1.1
Version: 1.1
81
Impact of small
change
Larger changes
Statistical testing
Deterministic
arguments
Experience
Process
Need to reimplement
process or develop
changed/new parts.
Obsolescence could
be a problem
Need to reimplement
process
82
Version: 1.1
Impact of small
change
Larger changes
Statistical testing
Deterministic
arguments
Experience
Process
Need to collect
process data as for
initial development
Version: 1.1
83
Arguments
Statistical testing
Deterministic
arguments
Experience
Process
The extent of the change to software will obviously have a profound effect on the
changes to the safety case. The following table indicates the potential impact:
84
Version: 1.1
Arguments
Statistical testing
Deterministic arguments
Experience
Process
The issue of obsolescent tools needs to be addressed in the periodic review and
an appropriate response formulated. This might involve:
Version: 1.1
85
The costs and risks of the approaches need to be considered, and this would
have to include the issue of maintaining expertise in the tools.
86
Version: 1.1
Version: 1.1
87
88
Version: 1.1
F.2 Demonstrable
Understandable. The safety case (or a component part) has to be presented to
and understood by different audiences, such as the developer, the operator and
regulator.
Evolutionary. The safety case has to be presented at different phases in the
system lifetime, i.e.: system concept, system development, acceptance,
operation and replacement.
F.3 Valid
Accurate. As a prerequisite for a valid argument, the evidence presented should
be accurate, i.e.:
Internally consistent.
Be available to all interested parties. We have termed these the
stakeholders; they could include the regulator, developer, subcontractor, and customer departments (e.g. engineering, health and
safety, operations and maintenance).
Be up-to-date and relate to the actual system design.
This is achieved by producing the safety case within an established safety and
quality management system which tracks the status of the various components of
the safety case and system design and controls the release of documents.
Version: 1.1
89
Related to safety properties. The arguments should directly support claims about
the required safety properties of the system (reliability, fail-safety, etc.). Arguments
of good practice (e.g. we tried hard) are not sufficient
Designed for assurance. The construction of a valid safety case may be not
feasible unless an appropriate design is used. A design for assurance approach
is advocated where the system design and safety are developed in parallel to
ensure that:
KISS (keep it simple). The risk of flaws in the system design and the associated
safety case will increase with complexity. Complexity should be minimised
wherever possible (see Section 5.1.1).
Traceable. Safety properties at one level will be translated into design features at
a lower level. It should be possible to demonstrate a clear link between top level
safety goals and the functional behaviour and attributes of implemented
subsystems
Robust. Arguments may contain flaws. The overall claims should not be sensitive to
individual
90
Version: 1.1
operational requirements
The infrastructure requirements have to be feasible over the long term. This
requires an assessment in the design phase of the costs and risks of maintaining
the safety case over the long term.
Are the mechanisms for eliminating faults and dealing with failure
adequate?
Is there coverage of operations and maintenance risks and adequacy of
defences?
Does the argument conform to the design criteria?
Version: 1.1
91
Design rules
the system design and the associated safety argument is kept simple (see
Section 5.1.1)
92
Version: 1.1
Is there scarcity of skills for performing the safety case analysis and
updates?
Version: 1.1
93
94
segregation
Version: 1.1
Version: 1.1
95
(measurement bound)
10000
1000
100
MTTF
(years)
10
1
0.1
0.1
10
100
1000 10000 100000
Operational Use (years)
The data relates mainly to commercial products used for real-time applications
(control, protection, telephone switching, etc). For one of the protection systems,
the MTTF approaches 1000 years. However MTTF may be an unsuitable measure
for such systems as the important attribute is the probability of failure on demand,
and demands may be infrequent (e.g. less than one per year). Nevertheless clear
trends were found in the study:
1. The reliability seems to be higher with increased operational use
2. Small software applications have higher reliability than large programs
given the same level of operational use.
Using IEC 61508 terminology, the field reliability results indicate that a System
Integrity Level target of SIL 2 (10 to 100 years mean time to failure) is achievable
for some commercial real-time products but that many fall below this. It also
provides some justification for treating SIL3 and SIL4 as very onerous requirements
requiring special measures.
Another independent study of reliability in PLC applications [4] yielded the
following results.
96
Version: 1.1
Industry
Sector
Years of
operation
No. of Failures
Safety
significant
Production
significant
Minor
Total
Nuclear
924.0
16
Chemical
74.5
Oil and
gas
64.5
Electricity
54.4
10
1117.4
11
16
30
Totals
Note that all the failures observed were due to faults in the application software
rather than the underlying PLC operating system. The average failure rate of the
application software is about once in 35 years, and about once in 100 years for
safety-related failures. Again this is consistent with a SIL2 target (10 to 100 years).
Like the previous study, this study found a correlation between application
complexity and unreliability.
the theory predicts that for a system with N faults, the achieved reliability after a
usage time T will be bounded by:
MTTF
Version: 1.1
e
T
N
97
where MTTF is the mean time to failure, and e is the exponential constant
(2.7181..). Studies of empirical reliability seem to indicate that this result applies in
practice, and the empirical results shown in the section above are consistent with
this finding. For example [3] discusses the application to a number of data sets.
One of these is from three generations of teleswitch equipment. Most of the
detailed data are confidential, but information is available about: the number of
known faults; the software size; and the failure rate over time. Most of the reliability
growth data are based on operation in the field. One complicating factor is that
new systems were being progressively installed on different sites, each with a
different operational profile and possibly different software options, so that new
parts of the input space could be covered for each new installation. The results for
one generation of teleswitch are shown below. We have used a fault estimate
which is 50% greater than the known faults.
MTTF
Predicted Bound
(N=175)
10.00
1.00
MTTF
(years
0.10
)
0.01
0.001
0.0
0.1
1.0
10.0
100.0
98
Version: 1.1
1000000
100000
10000
TTF
1000
(cycles)
100
TTF
Bound
(N=31)
10
1
10
The predicted lower bound is also plotted on the figure, assuming N=31. It can be
seen that most TTFs lie above the bound. The bound actually relates to the
average TTF, so statistically some TTFs could fall outside the limits. The one point
that falls a long way below the line is known to be a correction-induced fault, but
this has little impact on subsequent reliability growth
Version: 1.1
99
To perform the calculation, we also require the overall usage time of the product
and an estimate of the number of residual faults. The usage time can be inferred
from the number of units sold, and a reasonably good estimate of residual faults
can be obtained by multiplying the software size by the expected fault density.
The fault density might be provided by the developer, or a generic figure could
be used. Relatively conservative generic values of fault density are:
1 fault per kilobyte of binary code
For example, a small PLC with 20 kilobytes of code might have 20 residual faults. If
the PLC had 10 000 years of prior usage we might expect the MTTF for operating
system faults (excluding hardware failures) to be better than:
e
2.718 10000
T=
1300 years
N
20
This is consistent with the empirical evidence (i.e. that no PLC operating system
failures were observed in 1000 years of operation). More complex systems will
have more faults so the expected level of reliability growth will be lower. For
example a teleswitching system might contain ten to a hundred times as many
faults, so the reliability after a similar level of usage might be one or two orders of
magnitude lower (i.e. between 10 and 100 years MTBF which is broadly consistent
with empirical observation)
The theory and empirical results support the KISS principle (Keep It Simple). Simpler
systems should contain fewer faults and hence become reliable more rapidly
than large systems.
It also follows that rapidly evolving designs will be more unreliable than stable
designs. Some systems may be subject to continuous change to incorporate new
functions. These changes can reduce the reliability to a much lower level since
the new faults will have been exposed to relatively little usage and hence can
have much higher failure rates. Under conditions of continuous change, the
failure rates of the new faults can be the dominant factor, i.e. the limit will always
be worse than eT/N where N is the number of new faults introduced in the
periodic upgrades which occur after a usage time T. So for a system that
introduces 100 new faults in each upgrade, and upgrades once per year over
100
Version: 1.1
1000 sites (i.e. N=100, T=1000 years), the best reliability that can be expected at
the end of the year is at most:
eT 2.718 1000
=
27 years
N
100
and in the early stages in the upgrade period the MTTF bound will be much
smaller. It would therefore be sensible not to upgrade to a new version until
extensive field experience has been gained.
Version: 1.1
101
kloc (kilo-lines of code) and studies show the post-delivery fault density might lie
between 1 and 5 faults per kloc for conventional development processes.
For in-house software development more accurate estimates of fault density may
be feasible. For a well-established development process applied to large systems,
more precise estimates might be obtained from process profiling. This involves
estimating the fault detection profile for previous projects, and the early
developmental fault data can be scaled to derive accurate estimates of residual
faults. To illustrate the following table shows the process profile of the PODS
software diversity experiment [1]. The results are probably not typical of larger
projects, but it does illustrate the overall approach.
Detection Method
Cust. Req
68
19
Code
26
52
38
Design Review
14
24
Acceptance Test
14
Faults
Created
Detected
Remaining
102
Version: 1.1
G.6 References
[1]
[2]
[3]
[4]
R.I. Wright and A.F. Pilkington, An Investigation into PLC Reliability, HSE
Software Reliability Study, GNSR/CI/21, Risk Management Consultants
(RMC), Report R94-1(N) Issue B, Nov. 1995
Version: 1.1
103
104
Version: 1.1
Version: 1.1
105
These processes should be extended to include the integrity of the safety case
infrastructure. In the case of the system modification procedure, this may require
a redefinition of a modification to include changes in the safety case
maintenance environment (i.e. people, structures, resources and procedures),
and an extension to the scope of periodic reviews and audits.
We propose two forms of assessment:
We also identify the set of documents needed to support these assessments, and
propose that a mechanism is put in place which updates the assessment
guidance in the light of practical experience and new technical knowledge.
It will also be necessary to consider what activities should be implemented at the
system level and the corporate level. It would be logical to make the long-term
monitoring of new technical knowledge a corporate function so there should be
some central corporate activity which:
106
collates system experience (e.g. failure data, common cause failure and
incident analyses)
monitors technical advances, standards and regulatory requirements
alerts system sites to immediate problems (e.g. to other sites with similar
systems)
analyses past experience and updates the design and safety case
assessment guidance and checklists
Version: 1.1
analyse the data and derive the lessons learned (these could be more
general than the incident itself)
review and validate the new rules
incorporate the rules in design criteria, claim limits, checklists, assessment
procedures, etc., for use on subsequent projects
verify the application of the new rules applied in each project
It should be noted that this long-term improvement process need not result in an
increased assessment and maintenance burden. Optimisation is part of the
processthe performance and costs of the existing rules and recommendations
should be assessed. If these prove to be ineffective or irrelevant or there are more
cost-efficient alternatives, the rules should be changed to reflect this.
Greater knowledge of the maintenance effort could have a significant impact on
the approach to the design of new safety systems (e.g. by employing simple
designs or additional defences which reduce the safety case maintenance
requirements). This could be reflected in updated design guidance and design
criteria.
Version: 1.1
107
Anticipated change list. This is a list of possible changes that have been allowed
for in the current system design and safety case, and may need to be
updated over time.
Current concerns list. This would include lists of safety issues that require either
resolution, monitoring or further analysis (similar in principle to the Hazard Log
defined in Defence Standard 00-56). This list can change with time (e.g.
problems can be resolved, and new concerns can be identified).
Safety case infrastructure status report. Produced at periodic reviews to assess the
adequacy of the safety case infrastructure, i.e.:
staff competencies
documentation
technical resources
108
Version: 1.1
Version: 1.1
109
110
Version: 1.1
Version: 1.1
111
112
Version: 1.1
I.3 Violations
Past incidents and accidents may provoke restrictions and prescriptive
procedures on the actions of users of the system. Increased maturity adds further
restrictions as time goes by, perhaps resulting in procedural over-specification to
the point where user violations are the only way to actually get the job done. An
overly-prescriptive safety case procedure set against time demands may
therefore encourage violations and result in unsafe acts. Thus it is important to
consider, in an open-minded manner, any difference between the prescribed
procedures for the process and the actual procedures followed by the users.
Version: 1.1
113
Many analyses of incidents (e.g. the Challenger and Herald of Free Enterprise
disasters) that have naively been attributed to human error have shown that
organisational context and culture are central in assuring the safety of a process.
An inappropriate organisational context surrounding a process may provide
latent errors that lie dormant until a particular set of coinciding events come
together to form a safety critical incident. Furthermore, organisational, culture
and communication structures (e.g. [6]) determine the extent to which corporate
knowledge and good practice may be reused; otherwise old problems may have
to be revisited and solved afresh each time. Organisational weaknesses may be
associated with:
114
Version: 1.1
Version: 1.1
115
I.7 References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Reason, J., Human Error. Cambridge University Press, Cambridge, UK, 1990.
116
Version: 1.1
Version: 1.1
117
Maintaining
the safety
case
Respond to
changes
Environment
Demonstrable
Equipment
Consistent
Technical
Knowledge
Valid
Adaptable
Human resources
Technical resources
Documentation
118
Version: 1.1
J.2.1 Demonstrable
The safety case should be demonstrablefor each stakeholder there should be
adequate human resources, documentation and technical resources to
understand and evaluate the safety case.
Adequate human resources
The safety case is not demonstrable unless there are people available who
understand the safety case and its relationship to the safety system. This
requirement applies to each stakeholder. Some of the issues involved in
maintaining safety case knowledge and skills are discussed below.
Maintaining Skills: There will be a need to identify the skills and knowledge
necessary for the stakeholders. The required skills and knowledge should be
documented, together with the staff who provide these capabilities (e.g. in a
competence matrix). This should include any key sub-contract staff who
provide maintenance support. This matrix should define the required depth of
understanding; in some cases it may only be necessary to have sufficient
knowledge to understand what others have done, while in other cases there
should be the in-depth knowledge needed to create acceptable documents or
designs.
The safety case infrastructure status assessment should:
Tacit knowledge: The safety case may rely on unexpressed knowledge and
expertise within the safety team or supporting experts. Some of this may be in the
form of implicit assumptions and background rationale for design decisions that
have been made. However some of the deep expertise of domain experts may
be in the form of know how and difficult to express (see Appendix I). This form of
tacit knowledge is hard to formalise and codify and can be a vulnerability once
these key personnel retire or move on.
Version: 1.1
119
To address this vulnerability, the review should assess the extent of tacit
knowledge, and recommend how it may be converted to explicit knowledge, or
maintained for the future.
Adequate documentation
The safety case documentation set should meet the needs of the various
stakeholders. It should be written with a clear understanding of who the target
audience is, their likely tasks, and how the safety case documentation set is going
to support these tasks. In particular it should be:
120
Version: 1.1
J.2.2 Consistent
For each stakeholder (e.g. operation, regulator or safety department) the safety
case documentation set(s) should be consistent, i.e.:
the documentation should be internally consistent (in terms of crossreferences and dependencies)
The available records can be reviewed and stakeholder sites can be audited to
see if the latest versions have been distributed. Even if the documents remain
unchanged, responsibilities and organisations may alter, and it may be necessary
to check that the current stakeholders have the relevant documentation.
An audit can also be performed to check whether the safety case has taken into
account any changes to the system and the operational environment, e.g.:
Any mismatches should be identified, the causes analysed and, where necessary,
changes in procedures implemented. This may require an analysis of existing
processes, and should take into account human factors aspects (see also
Appendix I). For example, a private marked-up copy may exist which reflects
the true configuration of the system. One response is to make the procedures
more strict. However a human-centred analysis might conclude that the existing
Version: 1.1
121
J.2.3 Valid
Issues to be addressed include:
J.2.4 Adaptable
The safety case and supporting process should be capable of responding to
anticipated changes. As part of the overall safety case methodology, a list of
anticipated changes should be identified, and the system design and safety case
should be able to accommodate those changes.
The capacity to adapt to change should be periodically assessed, e.g. by:
122
Version: 1.1
For any change, there should be processes in place to assess and manage the
impact of these changes on the safety case. This will include:
analysing safety significance (both for the system and more globally)
identifying what changes are required to the safety case and the system
design
assessing commercial aspects of the change: risks and costs
(implementation cost, outage delays and lifecycle support costs)
negotiating the proposed changes (e.g. to procedures, equipment, and
safety case) with appropriate stakeholders (the relevant stakeholders will
Version: 1.1
123
loss of skilled staff and associated tacit knowledge and know how (see
Appendix I)
changes of responsibility and organisational restructuring
loss of ready access to key resources (e.g. documentation, technical
equipment or expertise).
124
Version: 1.1
proposal which identifies the organisational changes and maps out the changes
in resources and how the new structure is to be aligned with the safety case
maintenance tasks.
For both remedial actions and major changes there should be a process involving
the system stakeholders which can accept the proposed change and approve
the resulting implementation. This would typically be part of the normal safety
management process (e.g. involving the plant safety committee, the corporate
safety departments and the licensors).
Version: 1.1
125
The impact of any new knowledge on the safety case can be either positive or
negative. The information should be assessed to establish whether:
The safety case is still valid, or whether changes are required to the safety
case or the system.
The safety case is too conservative (e.g. pessimistic design assumptions for
fail-safe bias, failure rates, etc.). The new information may permit stronger
claims to be made about the system.
Whether the safety case is still ALARP (e.g. are any of the new methods
reasonably practicable?).
126
Version: 1.1
J.5 Demonstrable
The requirement:
The safety case should be demonstrablefor each stakeholder there should be
adequate human and technical resources and documentation to understand
and evaluate the safety case.
The following sets of questions address this requirement.
Version: 1.1
127
Maintaining skills:
For each stakeholder identify:
Who is able to read and understand the details of the safety case?
Tacit knowledge.
Assess the extent to which the safety case relies on tacit knowledge or know
how of experts. Look for indications such as:
128
Version: 1.1
Develop a strategy to maintain and transfer this tacit knowledge in the future, for
example by:
J.5.2 Documentation
Version: 1.1
129
hazard analyses
Is there evidence of the use of the tools by the stakeholders for the safety case?
J.6 Consistent
The requirement:
For each stakeholder (operation, regulator, safety dept) the safety case
documentation sets should be consistent with the current configuration of
the system.
130
Version: 1.1
J.7 Valid
The requirement:
The safety case should remain valid.
Consider the following questions:
Is the environment (in its broadest sense) stable so that the safety case
assumptions and evidence remain valid?
What monitoring is there to detect any changes that invalidate the safety case,
consider the following:
operational modes
interfaces
new information, e.g. from the analysis of failure data, incidents, and
periodic tests
Version: 1.1
131
Are concerns or caveats in the initial safety case tracked (e.g. things to fix later,
questionable assumptions, continuing investigations or supporting analyses)?
Consider:
integrity of PES
J.8 Adaptable
The requirement:
132
Version: 1.1
Has the need for change been addressed in the design of the safety
case?
Is there an anticipated change list? Is this reviewed and updated in the
light of operating experience, changing requirements (e.g. changed
modes of system operation such as a change to load following from
base load operation), and developments in technology (e.g. test
methods, understanding of diversity, sensors or obsolescence)?
Is it possible to adapt the safety case to change? Review the safety case
with respect to the anticipated change list. Assess cost of different types
of change. Document the areas that are difficult to change.
Does the safety case documentation structure and architecture support
its own evolution and development?
Version: 1.1
133
analyse safety significance (both for the system and more globally)?
identify what changes are required to the safety case and the system
design, i.e. what is changed and to what extent? Consider:
134
Version: 1.1
Version: 1.1
135
136
the coverage of current tasks (are all tasks covered? are some tasks
duplicated?)
the organisational fit (e.g. are tasks spread across organisational
boundaries? will the speed of response be acceptable?)
the loss of expertise and domain knowledge (will past knowledge be
diluted, or split across separate organisations? do we have a
lobotomised organisation?)
inter-group communication and number of contractual barriers (e.g.
the number of interfaces involved in implementing a given activity
or change)
access to documentation and expertise
Version: 1.1
The impact of any new knowledge on the safety case can be either positive or
negative. The information should be assessed to establish whether:
The safety case is still valid, or whether changes are required to the safety
case or the system.
The safety case is too conservative (e.g. pessimistic design assumptions
for fail-safe bias, failure rates, etc.). The new information may permit
stronger claims to be made about the system.
Whether the safety case is still ALARP (e.g. are any of the new methods
reasonably practicable?).
Version: 1.1
137
138
Version: 1.1
Version: 1.1
139
Trip the reactor if the temperature is too high in any gas duct
140
Version: 1.1
Coded
output
signal
Square
wave
signal
DCL
PAC
Thermocouples
B
DCL
PAC
DCL
PAC
Fail-safe
guardline
logic
2oo4
DCL
PAC
Serial lines
Monitor
Computer
Figure K1: Reactor protection example: system architecture
Each design feature addresses one or more of the safety requirements as
described below.
Version: 1.1
141
PAC
800 thermocouple
readings
T1
DCL
Coded
output
signal
Square
wave
signal
T2
Test
Source
Test mode selector
142
Version: 1.1
The monitor computer can be used for pre-start checks on the consistency of the
software configurations in the four channels (R.SEC), and for on-line diagnosis of
channel failures and failures of thermocouples (R.TST, R.FIX). By comparing outputs
from the channels it is possible to decide whether the fault resides in a channel or
the thermocouple input system. It can also be used to monitor long term
degradation of thermocouples. If these are severe, availability can be
maintained by replacement or a veto.
K.3.4 Simplicity
The design has no intercommunication between channels and the A/D
conversion is performed within the PAC. There is no need for interrupt handling or
buffering so the software can be implemented as a simple cyclic program. This
should be easy to test and verify (R.TRIP) and alter (R.UPD).
Since the program is simple and cyclic, the worst case response time is bounded,
and the worst case time is readily determined via timing tests or code analysis. The
time delays in the interfaces can also be measured to determine the overall
response time (R.TIM).
Version: 1.1
143
144
Version: 1.1
Requirement
PFD
STR
Design Feature
TRIP
TIM
F1
F2
SEC
UPD
Modular hardware
replacement
Mature hardware and
software tools
TST
FIX
Access constraints
Version: 1.1
145
R.PFD
R.TIM
R.FIX
With an alternative diverse channel architecture using PLCs, we may not be able
to perform formal proofs but we might be able to claim an order of magnitude
reduction in failures per demand beyond that demonstrated in the statistical tests.
146
Version: 1.1
change of sensors
Version: 1.1
147
R.PFD
Claim
Argument
Assumptions
PFD<10-3 pa
C.PFD.RAND
Hardware reliability
analysis
(redundancy + monitor
+ self-tests)
Common mode
factor
No systematic faults
(sub-claim C.NO-FLT)
(see Probabilistic Fault
Tree Analysis)
Component
failure rates
Fault detection
coverage and
fail-safe bias of
inputs
Repair times
C.PFD.SYST.1
Even if there are
systematic faults, the
chance of failure per
demand is less than 10-3
C.PFD.SYST.2
Fail-safe design will
ensure that at least 90%
of failures due to
systematic faults are
fail-safe
(Note: design
assessment criteria
might impose a claim
limit of 90%)
148
a) Double
thermocouple
disconnection or veto
will cause a trip
Thermocouples
fail low in 90% of
cases
b) Compiler, loader
and processor flaws
protected by the
reversible computing
technique
Tests indicate
99.995% fail-safe
bias
c) ADC and
application software
and configuration
flaws covered by
dynamic on-line tests
The requirements
are correct
Detects 90% of
systematic
failures
Version: 1.1
Sub-claim
Argument
Assumptions
C.NO-FLT.HW
Established designs +
system tests + reliability
tests imply that there
will be no systematic
hardware flaws
C.NO-FLT
Version: 1.1
The requirements
are correct
It has undergone
functional tests to
reveal compilerinduced faults
The functional
tests can reveal
all compilerinduced faults
149
R.TIM
Claim
Argument
Assumptions/
evidence
Time<5
secs
C.TIM.STATIC
Instruction
execution times
are correct
ADC conversions
and output time
are correct
Timing measurements +
argument that the
execution time is
bounded and relatively
constant
Test results
C.TIM.REV
An excessive or infinite
loop will be detected
by the reversible
computer
implementation
The reversible
computer
implementation is
OK
R.UPD
Claim
Argument
Evidence
Updating
the system
should not
introduce
faults
C.UPD.DATA
C.UPD.PROG
Adequate support
infrastructure
See Anticipated
Change analysis
There is sufficient
protection to prevent
updates of program or
data introducing
dangerous faults
C.TIM.TEST
Worst measured time is
2.4 seconds
150
Version: 1.1
Assumptions
10% of sensor failures are unrevealed
10% of buffer failures are unrevealed
Common failures are 10% of individual failures
10% of channel failures are unrevealed by a channel trip
10% of channel failures are unrevealed by the monitor
Channel failure rate (CPU + ADC + DCL) 1 pa
Sensor failure rate 10-3 pa
Version: 1.1
151
Probability estimation
The system is unsafe if a dangerous fault exists but is unrevealed. Internal checks,
monitor checks and proof tests are the main methods for revealing failures.
Systematic faults are mainly deemed to be incredible (see the sub-claim R.NOFLT).
For random failures we have to include the risk of common cause failures, and the
chance they will remain undetected until the 3-monthly proof test. Taking the
case of the sensors, the basic failure rate is estimated to be 10-3 per annum. We
assume that the common mode failures are 10% of this (10-4 per annum), and 10%
of these will be undetected until the 3 monthly proof test (10-5 per annum). On
average the dangerous sensor measurement failure will be unrevealed for one
and half months (0.125 of a year), so the probability of unrevealed unavailability is
(0.125 10-5 ). The unavailability of temperature measurements due to two
unrevealed random failures in one duct is negligible (around 10-10). Since the
demand is only made on one duct, we only need to consider the unavailability of
a single duct measurement.
A similar argument can be applied to the isolation amplifiers and buffers. The
dominant factor is again common mode failure, which is assumed to affect all
buffers simultaneously, so the calculation is identical to the one used for the
thermocouples.
For the hardware channel failures we assume the common mode failure rate is
10% of the single channel failure rate (10-1 per annum). Of these 10% are
unrevealed by a channel trip (10-2 per annum), and 10% of the remainder are not
detected by the monitor (10-3 per annum). An unrevealed failure persists an
average of 0.125 years, so the overall is 12.5 10-5
The probability assignments for the fault tree events are summarised below,
including those which are assumed to be incredible (probability zero).
[duct-specific fault]
Demand(i) and
2oo2 Sensor (I) failed unrevealed
or
3oo4 [ Buffer (A,I) and Buffer (B,I) fail
unrevealed ]
or
software reads input J instead of input I
0.125 10-5
0.125 10-5
(proof tests,
analysis)
or
152
Version: 1.1
12.5 10-5
0
(proof test +
monitor + DCL)
(analysis + test +
online test)
or
high trip logic flawed
or
multiplexor hardware latches past values
(analysis, fault
injection)
or
operating on stale copy of input data
or
sends old copy of output data
or
execution time too long
or
PFD
12.7 10-5
Version: 1.1
153
program which can be verified by proof testing and by testing in conjunction with
the modified DCL.
Change of computer hardware or software tools. The fail-safe integrity checks
provide protection against flaws in the new hardware and software tools. The
separate channel structure and simple input-output interfaces permit selective
upgrading on a per-channel basis (phased commissioning).
Change in functional requirements. Would require repetition of the formal proof
and the formally developed software. Proof tools have to be available (or be reimplementable on another system). Formal proof requires relatively scarce
expertise and could represent a risk in terms of greater implementation delays
and higher update costs. However licensing risks and the associated costs are
likely to be reduced.
Change of sensors. Relatively simple technology. Changes can be
accommodated by re-scaling the buffer amplifiers or changing the scaling
constants in the software. Verifiable via proof testing, dynamic on-line tests and
the monitor output.
Regulatory changes. If the requirements for diversity become more stringent,
diversely implemented channels can be used to protect against systematic
hardware and software flaws. This is relatively simple as each channel is
independent. Diverse sensors and buffers are also feasible. Requirements for more
rigorous system testing should be feasible as each channel is a standalone unit,
and tests can be performed individually without the need to test for interaction
effects.
154
Version: 1.1
safety reviews
problem analysis
Version: 1.1
155
Special tools/skills:
DCL design
test environments
test suites
Domain knowledge:
sensor characteristics
CMF mechanisms
Anticipated changes:
trip parameters
trip logic
fault detection
number of inputs
processor hardware
interface hardware
156
Version: 1.1
implemented from scratch using a different formal notations and support tools.
Obsolescence of the dynamic coded logic could be a problem, but the basic
structure should be re-implementable in a new technology, and the fail-safety
can be reviewed by independent specialists and tested directly by fault injection.
As a fall-back, the system could be re-implemented with diverse hardware and
software in the channels.
MTTR
The impact of these results on the safety case should be assessed. If the results
undermine the safety case, changes to the system design, operating procedures,
or monitoring systems may be necessary.
Version: 1.1
157
requirements for the subsystems. In the specific reactor trip example there might
be requirements for the following.
D.ARCH
D.ENV
D.POW
D.DCL
D.INP
D.ADC
D.MON
D.CPU
D.SW
Note that the subsystem requirements will include any evidence required for the
safety case (e.g. environmental test evidence, timing, fault tolerance tests, fault
injection tests, etc.). This evidence could be part of the subsystem deliverable.
As an example of how the subsystem requirements are elaborated, the
requirements for the software (D.SW) are given below. The requirements placed
on the software are based on an apportionment of the top-level safety functions
together with additional requirements imposed by lower level design decisions.
The requirements include the basic functional requirements for the software,
specific design constraints on the implementation method, and requirements for
safety case evidence.
From (R.SEC and R.UPD). Every complete scan cycle, send the software
configuration data (number of inputs, input scale factors, trip limit
values, software version number and sumchecks).
SW.TRIP
158
Version: 1.1
Scan the two temperature readings (Ra, Rb) from the ADC.
Perform 1oo2 voted high temperature trip (HiTrip = max(Ta,Tb) > Tlimit).
SW.IO
Satisfy the specified interface requirements for the ADC, DCL, and
Monitor ports (from D.DCL, D.ADC, D.MON).
SW.CHK
SW.TIM
From R.TIM. The software scan cycle should be less than 5 seconds
including the time required for all input and output operations.
SW.FM
SW.DIV
SW.CYC
Version: 1.1
159
SW.TRIP.CASE
SW.TIM.CASE
SW.V&V.CASE1 SW.FM.VER
SW.REV.CASE
SW.V&V.CASE2 SW.ETST.CASE
SW.DIV.CASE
SW.DES.CASE
SW.TOOL.CASE
Provide evidence for the integrity of the delivered system and the
development process: safety plan, safety audit records, quality
plan, QA records, plans, design documents, software, proof files,
V&V records.
SW.PRODUCT
Provide all necessary items for use and long-term support: design
documents, software, proof scripts, test environment, support
tools.
K.10 References:
[1]
160
Version: 1.1
Appendix L Index
accident mitigation .................... 18, 61
deterministic argument...............15, 86
adequate ..............................................8
diversity ................................................19
assumptions.........................................14
certification .........................................37
fault elimination..................................18
commercial risk.....................................8
conservatism .......................................49
correctness..........................................41
FMEA ....................................................41
costs......................................................29
Hazops .................................................39
IAEA-367...............................................65
design options.....................................69
independent assessment..................49
Version: 1.1
161
integrity checks.................................. 37
qualitative argument.........................16
interlocking ......................................... 36
robustness ............................................16
security...........................................21, 62
maintenance ..................................... 51
management..................................... 52
modifiability ........................................ 62
MTTF ..................................................... 15
MTTR..................................................... 22
support tools........................................31
novelty................................................. 36
teams .................................................117
operator .............................................. 39
PES........................................................ 55
Preliminary .......................................... 23
preliminary safety case............... 22, 23
probabilistic arguments.................... 15
probabilistic criteria........................... 65
project lifecycle........................... 22, 45
purchaser............................................ 39
162
tolerable ..............................................61
tools ................................................89, 90
traceability ....................................28, 54
training ...........................................23, 73
usability ................................................63
validation.......................................40, 56
verification.....................................40, 56
voting ...................................................71
watchdogs ....................................70, 72
Version: 1.1