Critical Software For Nuclear Reactors: 11 Years of Field Experience Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Critical Software For Nuclear Reactors :

11 Years of Field Experience Analysis


Jean-Cyril Laplace
Technicatome
BP34000 Aix-en-Provence
jlaplace@tecatom.fr
Phone : (33) 4 42 60 27 60
Fax : (33) 4 42 60 25 00
Michel Brun
Technicatome
BP34000 Aix-en-Provence
mbrun@tecatom.fr
Phone : (33) 4 42 60 27 16
Fax : (33) 4 42 60 25 00
Abstract
Technicatome designs the nuclear reactors of the
submarines and aircraft carriers of the French Navy.
To improve the software development process of its
new generation of digital instrumentation and control
systems, and to evaluate their actual dependability, an
analysis of data on operating experience has been
performed. It covers 10 years of operation, more than
5.5 millions hours and 350 versions for 30 critical
applications.
The following conclusions can be drawn from this
experience:
Classical methods are efficient and sufficient if they
are performed according to high quality
requirements.
The analysis of data on operating experience is an
efficient means to improve the development
processes.
Emergent methods, such as formal methods, would
have been of little help to prevent the errors actually
encountered in operation, since the latter concern
HW/SW interactions and real-time issues extremely
difficult to model.
1. Introduction
Fifteen years ago, Technicatome introduced a new
generation of Instrumentation and Control systems (I&C)
based on a full authority digital technology. To guarantee
a very low unsafe failure rate, Technicatome developed a
system architecture based on a highly dependable
proprietary computer and the associated methodology
and means to support the development of critical
software.
Today, this technology and these computers are used
in five operational reactors, but also in railway and space
applications. In all cases, critical software applications
have been developed by Technicatome.
1.1. I&C architecture and dependable computers
The I&C is composed of three main sub-systems:
The safety-critical I&C, performing the critical
functions for the protection of the reactor (e.g.,
reactor shutdown) or for the availability of the steam
(e.g., the regulation of the steam generator level);
The man machine interface, also delivering important
services for safety and availability;
The auxiliary I&C, performing non critical functions.
This article focuses on the first item.

The safety-critical I&C sub-system relies on the
following main technical choices:
The system-level architecture uses 3 identical
channels, each one involving about 10 different
computers. A channel acquires its input data via its
own sensors and exchange them with the other
channels. After an input consolidation phase and a
processing phase, the 3 channels deliver their own
logical command signals to a 2 out of 3 voter that
actually drives the actuator.
With respect to design fault, dependability relies on
fault prevention, in particular by the choice of simple
design alternatives and high quality development
processes. This allows the hardware and the software
of the 3 channels to be strictly identical.
With respect to physical faults, dependability relies on
the cross monitoring of hardware and software and on
temporal redundancy: data processing is done twice
per computer cycle and the resulting two logical
commands are compared before being transmitted.
The unsafe failure rate achieved is very low (<10
-8
unsafe failure per hour), allowing the interval
between functional tests to be extended up to one
year.
Dependability evaluation of the computers is based on
fault injection (see <2>).
Without being nuclear specialists, developers have a
good knowledge of the process controlled by the I&C.
This favours a critical review of the specifications
during design. This shows to be extremely beneficial
to the early identification of specification errors and
ambiguities, and finally to the overall quality of the
software product.
1.2. Critical software development process
Critical software is developed according to the IEC880
standard which rules the development of safety critical
software in the nuclear domain. In this framework, the
main characteristics of our software development process
are the following:
Rigorous and early configuration management. A
particular objective is to support the collection of
pertinent data on the quality of the software
development process from its very early phases.
Use of computer aided specification tools to express
specifications in some formalised but non
mathematical notation. This is aimed at
(i) preventing and identifying specification
inconsistencies and incompletenesses and
(ii) favouring the understanding of the specification.
Complete cross checking of all the development
outputs. This operation is performed by engineers of
the same team as the developers' so as to benefit from
the competence of the checkers (since they usually
develop the same kind of software). This is preferred
to a strict hierarchical independence.
Execution of unit tests both on the host and on the
target computers.
Evaluation of all software with respect to multiple
metrics, such as structural complexity, test coverage,
etc. Emphasis is put on the analysis of the causes of
discrepancies.
Great care about software validation ("white box"
testing performed by the software development team)
and functional validation ("black box" testing
performed by the team in charge of the functional
specification). These steps are performed using test-
beds supporting the automatic execution of scenarios.
Coding realised in C to support a long-term
(several decades) code maintenance with strong
coding constraints to ensure fault prevention.
Enforcement of knowledge transfer between the
software development teams and those in charge of
the functional specification and field operation (by the
sharing of validation test-beds, for example).
2. Collection of data on operating experience
2.1. The projects
The data analysed hereafter concern software designed
for the safety-critical I&C of three categories of reactors:
The land-based RNG reactor used for qualification
(including I&C qualification) under actual
operational conditions and crew training.
SNG reactors used for the "Le Triomphant" class
SSBN. Two such reactors are currently operational.
PAN reactors used for the "Charles de Gaulle" class
aircraft-carrier. Two such reactors are currently
operational.
About 150 computers and 30 different software
applications are currently operational. They have been
developed according to identical processes and they
perform very similar functions; this homogeneity
simplifies the data analysis and enhance the significance
of the results.
2.2. The data collection process
The first and main phase of the data analysis is the
exhaustive and redundancy-free collection of data, in
particular for the first steps of the development.
In our context, this is ensured by the configuration
management which identifies each and every software
evolution and maintains a log of the origin of the
modifications. Five causes of modification are
considered:
Modification (M): a functional modification has been
requested.
Development (D): an error has been detected during
the integration phase.
Receipt (R): an error has been detected during the
software validation and functional validation phases.
Field check (S): an error has been detected in the field
(S1) during the installation control procedure
planed by the software team,
(S2) during the functional verification procedure
planed by the specification team (and especially
performed during the reactor trial phase).
Exploitation (E): an error has been detected during
the exploitation phase.
(E1) The error has triggered an erroneous
signalling on the man-machine-interface, without
that triggering a logical command however.
(E2) The error has triggered a logical command
without that activating a safety function however
(thanks to the 2/3 voting).
(E3) The error has triggered at least two logical
commands and then led to a spurious activation of
a safety function.
(E4) The error has put the I&C in an unsafe state
(e.g., incapability of performing a safety function).
Reports of the M class are due to evolutions of
functional needs; they are not further studied. Reports of
the D, R, S, classes are due to errors detected before
operation. Reports of the E class correspond to the
activation of the actual residual faults of the software.
Data related to the D, R, S and E classes are analysed
according to 3 complementary points of view :
The consequences, which gives an indication of the
severity of the error.
The origin, which provides an input to improve the
development process (in particular through the
analysis of the efficiency of a new method faced to the
actually encountered errors).
The number of occurrences with respect to time. This
provides an indication of the evolution degradation
or improvement of the quality of the software
development process.
Results of these three analyses are given hereafter.
3. Results and analyses
From the end of 1988 to the middle of 1997, 363
versions of 30 software applications have been produced
for the 3 projects mentioned before. These different
applications are based on the same firmware for which
54 different versions have been produced. The collected
data covers more than 5.5 millions hours of operation.
3.1. Analysis of consequences
The distribution of the modification reports given in
Table 1 raises several preliminary remarks.
Very few errors were detected during the exploitation
(E class). Moreover,
none has had an impact on the safety of the
reactor,
none has had an impact on the availability of the
reactor.
Concerning the D, R, S classes, the identification of
the phases during which the error is detected shows
that:
The integration tests are highly efficient (the D
class represents 40% of the errors detected during
the development process). In fact, this has been
especially true for the first "tuning" of the
hardware/software interface on the RNG project.
The software and functional validation tests are
highly efficient (the R class represents nearly 40%
of the errors detected during the development
process). This owes particularly to the tight
relation between the development and the
functional specification teams.
The tests performed in the operation field are
important (the S class represents nearly 15% of
the errors detected during the development). This
is especially clear for the RNG prototype reactor
which has been used for the validation of the I&C
(before the divergence of the reactor). This is
explained in 3.2.
The high ratio of errors in S2 or E for the RNG
prototype demonstrates the efficiency of a life-size
trial experiment to "tune" the I&C system. Note
that the S2 and E errors detected during the SNG
project are due to evolutions affecting the
hardware or firmware after the initial RNG tuning.
On the contrary, there is no error in S2 or E for the
PAN project for which, even though the
applications were different from those of the
Functional
Modification
Errors detected during the
development process
Errors detected in operation
Project \Class M D R S1 S2 E1 E2 E3 E4 Total
RNG 157 35 5 10 5 3 0 0 215
SNG 55 32 10 0 4 1 1 0 0 103
PAN 20 20 5 0 0 0 0 0 0 45
Total 232 52 50 5 14 6 4 0 0 363
Table 1. Distribution of detected errors
RNG/SNG, the hardware and firmware were
identical.
3.2. Analysis of causes
This paragraph identifies the (technical) origins of the
errors that have been detected, and proposes some
prevention means.
Concerning the errors detected lately, the main
conclusions are the following:
A high percentage of these errors originates from the
man-machine interface functions included in the
safety-critical computers. Actually, it appeared that
these functions were harder to specify than expected.
The reinforcement of the verification effort, initiated
after the RNG experience, led to a significant
reduction of the occurrence of such errors on
subsequent projects. This effort has been further
accentuated with the introduction of tools ensuring
the good specification of the data exchange related to
signalling functions.
The errors that led to the triggering of logical
commands (5 in S2, 4 in E2) originate from the
interface between hardware and software. These
errors concerned more specifically:
issues related to the real-time constraints, (even
though a cyclic executive is used),
issues related to the I/O components (some of the
components showed a behaviour not compliant
with their user manual).
The faults leading to these errors are very difficult to
activate in spite of the controls performed in
laboratory. In fact, as it is very difficult to reproduce
the real load and asynchronous situations in a
laboratory environment, some very particular
situations can only be encountered in a life-size trial.
Its worth noting that these anomalies are out of the reach
of formal methods and, often, of laboratory tests.
Although the same software bas been used for the 3
channels, most of the errors (>80%) only led to the
activation of one logical command out of three. This
shows the efficiency of the asynchronism between the
3 processing channels with respect to such software
errors.
4. Temporal analysis
The analysis is done with respect to the creation date
of each version; this allows the monitoring of the
evolution of the quality of the design/validation process.
Such a monitoring is very important; indeed, the
consequences of the evolutions of the design
methodologies and of the continuous turn-over of the
development team shall necessarily be mastered for
software with such a long lifetime. This monitoring shall
also cover the (potentially negative) effects of the
constant improvement of productivity imposed by
economic constraints.
A special emphasis is put on errors detected "lately" in
the life cycle (S1+S2 and E classes).
The analysis is carried out according to two
complementary points of view (see Figure 1):
absolute time, for all projects,
relative time with respect to the reactor first
divergence.
This analysis leads to the following conclusions:
The number of errors found during development
(integration, validation, ...) is strongly decreasing
with time, thanks to the improvements mentioned in
3.2.
There is an important decrease of the number of
t0 t0+1 RNG
1st
diverg.
t0+3 t0+4 SNG
1st
diverg.
T0+6 t0+7 t0+8 t0+9
E
S1+S2
0
1
2
3
4
5
t0 t0+1 RNG
1st
diverg.
t0+3 t0+4 SNG
1st
diverg.
T0+6 t0+7 t0+8 t0+9
E
S1+S2
RNG
SNG
Figure 1. Errors identified in RNG and SNG projects (errors/year)
errors detected after the SNG first divergence (both S
and E class) thanks to the improvement of the
production process and the elaboration of solutions to
the generic hardware/software tuning problems
mentioned before.
Figure 1 illustrates the effect of the "tuning" period
(initial tuning of complex hardware/software and
real-time issues mentioned previously) where an
important number of errors detected on the field were
corrected. This phase covers about one year before
and one year after the reactor start-up. Note that one
error (dated t0+8) was corrected a long time after its
detection, despite the strong investigation efforts.
On the overall period of 10 years, the global yearly
ratio of error is decreasing despite the turn-over of
teams. This indicator is very important to get
confidence in the production process (organisation,
methodology, tools,....).
5. Conclusions
The analysis of the origin of software errors from data
on operating experience is a very fruitful operation: it
supports the identification of weak points in the software
production process and provide hints to improve it. In
particular, it allows investments to be focused on classes
of errors that have shown to occur.
This analysis requires (i) a large sample of software in
operation and (ii) a systematic process for the collection
of field data. Note that the validity of the conclusions of
such an analysis depends on the homogeneity of the
sample. Moreover, special care must be taken to cover
the very early phases of the software development process
where an improvement of errors avoidance/elimination is
the most cost effective.
The approach described in this paper, based on the
software configuration management, has demonstrated
its effectiveness for more than ten years. The experience
gained during this period leads to the following
conclusions:
The hardware problems experienced on the I&C, both
on digital and analog parts, represents the major
contribution to the problems actually experienced on
reactors. This underlines the high quality of the
software development process but also points out that
the effort shall also be put on issues related to
hardware.
The major part of the software errors discovered lately
concerns the initial tuning of the hardware/software
interface and real time issues. Therefore, special care
must be taken of the tuning phase, especially when
modifications involving the hardware or having an
impact on real time are realised.
A test phase involving the real process is mandatory.
Concerning the residual errors actually found:
Asynchronism has demonstrated its efficiency
against hardware/software interface errors.
Indeed, no simultaneous spurious logical
commands (which would not be filtered out by the
2 out of 3 vote) has occurred.
On the kind of hardware used in our I&C,
diversification of the hardware/software interface
and of the handling of real time issues would be
difficult to achieve. Furthermore, software
diversification on identical hardware components
would certainly not be sufficient. Some of the
residual errors would probably be avoided by a
complete diversification (of hardware and
software), but this approach may lead to an
increase of interface issues (not considering the
exploitation constraints and costs).
It is not sure whether formal specification and
design methods would have been able to address
this kind of errors. We are convinced that the
introduction of formal methods would have shown
a very low cost/efficiency ratio with respect to our
current semi-formal method.
The high quality development process is sufficient to
get a high level of confidence on the application part
of the critical software (the one not concerned with
hardware/software interfacing and real-time issues).
Economically effective improvements of the design
process can be achieved using the field data analysis.
The efficiency of such improvements can be verified
with use of simple indicators.
At last, we underline that the development team
should not be only concerned by the functions to develop
but also by the role of the system in the final installation.
This can be enforced by the proximity of the software
development teams and the specification and exploitation
teams. Design is basically a human activity; multiplying
methods and tools would not compensate for the quality
and the motivation of man in charge of the design.
References
[1] ASTREE Odometric Safety Control Unit -
Safecomp92 Zurich 28-30 October 1992
[2] Fault Injection Technique - International Workshop on
FIT ; Gteborg 17-18 June 1993

You might also like