03 Fault Covearage008

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

DTD 5 ARTICLE IN PRESS

Reliability Engineering and System Safety xx (2005) 1–10


www.elsevier.com/locate/ress

A method for evaluating fault coverage using simulated fault injection


for digitalized systems in nuclear power plants
Suk Joon Kima, Poong Hyun Seonga, Jun Seok Leeb,*,
Man Cheol Kimb, Hyun Gook Kangc, Seung Cheol Jangc
a
Department of Nuclear and Quantum Engineering, Korea Advanced Institute of Science and Technology, 373-1 Guseong-dong,
Yuseong-gu, Daejeon 305-701, South Korea
b
Center for Advanced Reactor Research, Korea Advanced Institute of Science and Technology, 373-1 Guseong-dong, Yuseong-gu,
Daejeon 305-701, South Korea
c
Integrated Safety Assessment Team, Korea Atomic Energy Research Institute, 150 Deokjin-dong Yuseong-gu, Daejeon, 305-353, South Korea

Received 1 April 2004; accepted 3 May 2005

Abstract
The fault coverage for digital system in nuclear power plants is evaluated using a simulated fault injection method. Digital systems have
numerous advantages, such as hardware elements share and hardware replication of the needed number of independent channels. However,
the application of digital systems to safety-critical systems in nuclear power plants has been limited due to reliability concerns. In the
reliability issues, fault coverage is one of the most important factors. In this study, we propose an evaluation method of the fault coverage for
safety-critical digital systems in nuclear power plants. The system under assessment is a local coincidence logic processor for a digital plant
protection system at Ulchin nuclear power plant units 5 and 6. The assessed system is simplified and then a simulated fault injection method is
applied to evaluate the fault coverage of two fault detection mechanisms. From the simulated fault injection experiment, the fault detection
coverage of the watchdog timer is 44.2% and that of the read only memory (ROM) checksum is 50.5%. Our experiments show that the fault
coverage of a safety-critical digital system is effectively quantified using the simulated fault injection method.
q 2005 Published by Elsevier Ltd.

Keywords: Digital plant protection system; Local coincidence logic processor; Fault coverage; Simulated fault injection; Heartbeat-watchdog timer; ROM
checksum

1. Introduction However, the migration from analog to digital I & C


systems within nuclear power plants has increased the
Modern technologies based on both digital hardware and complexity of such systems. The I & C systems that are
advanced software algorithms are being rapidly developed being developed are computer-based, comprising digital
and widely used. Due to the progress of instrumentation and hardware and software components. These systems perform
control (I & C) technologies for process engineering modern complex functions that are essential to the safety-critical
digital technology is expected to significantly improve the requirements of nuclear power plants. To prevent significant
performance and the safety of nuclear power plants. risks from arising, these systems must be dependable [1].
The development of a methodology for the probabilistic
safety assessment (PSA) of digital I & C systems is a critical
* Corresponding author. Tel.: C82 42 869 3860; fax: C82 42 869 3849. issue. Present PSA techniques are used to evaluate the
E-mail addresses: sukjoonkim@kaist.ac.kr (S.J. Kim), phseong@kaist.
ac.kr (P.H. Seong), wahrheit@kaist.ac.kr (J.S. Lee), charleskim@kaist.ac.
relative effects of contributing events on system-level safety
kr (M.C. Kim), hgkang@kaeri.re.kr (H.G. Kang), scjang@kaeri.re.kr (S. or reliability. In addition, PSA provides a unified means of
C. Jang). assessing physical faults, recovery processes, contributing
effects, human actions, and other events that have a high
0951-8320/$ - see front matter q 2005 Published by Elsevier Ltd.
doi:10.1016/j.ress.2005.05.002 degree of uncertainty [2]. However, conventional PSA
DTD 5 ARTICLE IN PRESS
2 S.J. Kim et al. / Reliability Engineering and System Safety xx (2005) 1–10

techniques cannot adequately evaluate all features of digital However, the foregoing studies concentrated only on one
systems. Kang and Sung found that fault coverage, common semiconductor component test in the manufacturing
cause failures, and software reliability are three most critical process. A nuclear power plant digital instrumentation and
factors in the safety assessment of digital systems [3]. control (I & C) system is a very large system, with many
Among these factors, this study focuses on evaluating components, that precludes the use of conventional
method of the fault coverage of actual nuclear power plant evaluation methods.
digital system. Therefore, in this study, we propose an evaluation
The probability of a fault being properly removed from a method for the fault detection coverage of a system, and
fault-tolerant system is referred to as fault coverage. The this method is represented as following:
fault coverage value crucially affects the dependability of a
System coverage
system. Thus, fault coverage is one of the most critical
factors in a PSA. There are mathematical and qualitative Total number of detected faults in the system
expressions for the fault coverage. Mathematically, the fault Z (2)
Total number of faults in the system
coverage C is defined as the fault processed correctly
divided by the fault existence. For a given time Dt, the number of faults at component i is
  liDt, where li is the fault rate of component i(iZ1, 2,
C Z P fault processed correctly=fault existence (1) 3,.,N).1 Therefore, the total number of faults in the system
Qualitatively, coverage is a measure of the system’s can be represented as
ability to detect, locate, contain, and recover from the X
N
presence of a fault. There are four primary types of fault Total number of faults in the system Z li Dt (3)
coverage: (1) fault detection coverage, (2) fault location iZ1
coverage, (3) fault containment coverage, and (4) fault As shown in the previous section, fault detection coverage is
recovery coverage. Thus, the term ‘fault processed correctly’ defined as the fault processed correctly divided by the fault
refers to one or more of the four coverage types [4]. existence. The detected faults in component i can be
Most of safety-critical systems manage faults in a fail- obtained with multiplying the fault detection coverage (Ci,d)
safe manner when they detect a fault. For example, the by the total number of faults in component i: Ci,d$liDt. The
digital protection systems of the Ulchin nuclear power units total number of detected faults in the system is represented
generate safety signals when they detect a fault. That is, in as follows:
this case, the fault detection coverage is a matter of interest.
The purpose of this study is to introduce a quantitative, Total number of detected faults in the system
fault-detection coverage evaluation method for a fail-safe
X
N
digital system by using a simulated fault injection. Z Ci;d ,li Dt (4)
This paper is structured as follows. In Section 2, we iZ1
describe the fault-detection coverage evaluation method.
The target system and the local coincidence logic (LCL) in Hence, (2) can be derived by substituting (3) and (4):
the DPPS are introduced in Section 3. The experiment setup PN
iZ1 Ci;d ,li
is presented in Section 4. In Section 5, we present some System fault detection coverage Z P N (5)
iZ1 li
application results from the experiment. We conclude the
paper in Section 6. The fault detection coverage of each component is
evaluated using the simulated fault injection experiments.

2. Overview of the evaluation method 2.2. Fault injection method

2.1. Coverage evaluation method There are three types of faults: permanent faults,
transient faults, and intermittent faults.
Several studies have considered a quantitative evaluation Permanent faults are related to irreversible physical
of fault detection coverage by using fault injection methods. defects in the circuit. These defects can be produced during
Koche et al. proposed a deductive fault simulator for the the manufacturing process or during normal operation.
fault coverage evaluation [5]. Levendal and Menon used Transient faults appear during the operation of a circuit, and
hardware description languages to describe small circuits the duration of this fault is very short. Intermittent faults are
and faults are applied to the circuit, such as function similar to transient faults, being temporary, but this type of
variables stuck at 0 or 1 and control faults [6]. Mao and fault appears and disappears repeatedly in time, without
Gulati proposed an RTL fault model and simulation
methodology [7]. Hayne and Johnson evaluated coverage 1
In this paper, we treat the fault rate as the failure rate from MIL-HDBK-
by using hardware description and fault injection [8]. 217F. That is, all fault occurrences in the system lead to a failure state.
DTD 5 ARTICLE IN PRESS
S.J. Kim et al. / Reliability Engineering and System Safety xx (2005) 1–10 3

periodical behavior. In this study, we consider only PARAMETERS


(SENSOR AND
PARAMETERS
(SENSOR AND
PARAMETERS
(SENSOR AND
permanent stuck-at faults as possible faults in the system. TRANSMITTER) TRANSMITTER) TRANSMITTER)

A safety-critical system utilizes multiple redundancies. TR TR


Therefore, the effect of transient faults or intermittent faults
can be neglected.
There are three methods by which to evaluate the ANALOG DIGITAL ANALOG
INPUT INPUT INPUT
fault detection coverage of a system: hardware fault MODULE MODULE MODULE

injection, software-implemented fault injection, and


simulated fault injection. Each method is summarized as
follows [10], [11]. BISTABLE BISTABLE
PROCESSOR PROCESSOR

2.2.1. Hardware implemented fault injection


This is carried out at the physical level, disturbing the
hardware with parameters of the environment (such as
LCL LCL LCL LCL
heavy ion radiation and electromagnetic interferences) or PROCESSOR PROCESSOR PROCESSOR PROCESSOR

modifying the value of integrated circuit pins.

2.2.2. Software implemented fault injection


The objective of this technique is to reproduce errors at DIGITAL DIGITAL DIGITAL DIGITAL
OUTPUT OUTPUT OUTPUT OUTPUT
the software level that would have been produced upon MODULE MODULE MODULE MODULE

faults occurring in either hardware or software. This


technique is based on different practical types of injection,
such as the modification of memory data, or the mutation of
SELECTIVE SELECTIVE
the application software or the lowest service layers (at 2/4 2/4

operating system level, for example).

2.2.3. Simulated fault injection Fig. 1. DPPS trip path block diagram.
In this technique, the system being assessed is simulated
on another computer system. Faults are induced by altering
the logical values of the model elements during the
digitized trip demand signals from four bistable processors
simulation.
and provide the binary outputs that indicate whether two or
This work focuses on the simulated fault injection
more channels are in a trip condition. The individual 2 out of
technique to evaluate fault detection coverage, because
4 outputs of an LCL processor are appropriately combined
using the hardware-implemented fault injection technique
to generate an initiation signal for the particular function.
is difficult, requires expensive hardware, and faults cannot
be controlled and limited by the complexity of the system. Fig. 2 shows the 2 out of 4 coincidence logic. The 2 out of 4
In addition, the software implemented fault injection coincidence logic function is coded by C complier for 8051
technique concentrates on software rather than hardware. and stored in ROM.
For these reasons, we applied the simulated fault injection For the experiment, the LCL processor is realized by
technique to a digital system model, whereby faults are the hardware description language. The coincidence logic
simulated using the hardware description code with self-checking is programmed and stored in the ROM.
modification. The results of the local coincidence logic operation and the
error detection signals are sent to the output pin of the
8051 CPU.
3. Target system

The target system to evaluate fault coverage is an LCL


processor in digital plant protection system (DPPS). Fig. 1 4. Application experiment setup
shows a block diagram of the DPPS [12]. The DPPS protects
the core fuel design limits and the reactor coolant system The LCL system is a digitalized system, and the major
pressure boundary by tripping the reactor when monitored hardware consists of CPU, RAM, and ROM. An actual
plant conditions exceed design limits. It is one of the LCL system is very complex and, for convenience, must be
safety-critical digital systems in Ulchin nuclear power plant simplified. Fig.3 shows a simplified block diagram of the
5 and 6. LCL system used. This simplified system is designed to
Each LCL processor in the DPPS performs the imitate the LCL processor in the DPPS. The simplified
coincidence logic function. The LCL processors receive system is comprised of CPU, RAM, and ROM.
DTD 5 ARTICLE IN PRESS
4 S.J. Kim et al. / Reliability Engineering and System Safety xx (2005) 1–10

CH.A
PARA
METER CH.B
TRIP CH.C
SIGNAL
CH.D
CH.A

CH.B

CH.C

CH.D

CH.A
PARAMETER TRIP
SIGNAL
CH.C

CH.B

CH.D
CH.A
BYPASS CH.B
SIGNAL
CH.C
CH.D

Fig. 2. 2 out of 4 coincidence logic.


PN
4.1. Coverage evaluation for the experiment system iZ1 Ci;d ,li
System fault detection coverage Z P N
iZ1 li
In section 2, we suggested that the fault coverage of a
system could be evaluated by weighting the component fault lCPU CCPU C lRAM CRAM C lROM CROM
Z (6)
coverage with their relative fault rates. For the experiment, lCPU C lRAM C lROM
we simplify the system to CPU, RAM, and ROM.
Therefore, representation (5) is expanded to the following: where,

PARAMETERS PARAMETERS PARAMETERS


(SENSOR AND (SENSOR AND (SENSOR AND
TRANSMITTER) TRANSMITTER) TRANSMITTER)

TR TR

ANALOG DIGITAL ANALOG LCL


INPUT INPUT INPUT
MODULE MODULE MODULE
CPU

BISTABLE BISTABLE
PROCESSOR PROCESSOR

RAM ROM

I/O port

LCL LCL LCL LCL


PROCESSOR PROCESSOR PROCESSOR PROCESSOR
Heartbeat
signal

Watchdog timer
DIGITAL DIGITAL DIGITAL DIGITAL
OUTPUT OUTPUT OUTPUT OUTPUT
MODULE MODULE MODULE MODULE Counter

SELECTIVE SELECTIVE
2/4 2/4

Fig. 3. Simplified LCL system.


DTD 5 ARTICLE IN PRESS
S.J. Kim et al. / Reliability Engineering and System Safety xx (2005) 1–10 5

Table 1 as addition and Boolean AND. The CU is responsible for


Failure rate of each component fetching instructions from the main memory and determin-
CPU ROM RAM ing their type. The register is a small, high-speed memory
Failure rate 0.273860 0.083558 0.133009 used to store temporary results and certain control
(faults/106 h) information, such as program counter, accumulator, or
instruction register.
The simplified computer performs its functions accord-
CCPU, CRAM, CROM: Fault detection coverage of CPU,
ing to the following steps [14]:
RAM, and ROM
lCPU, lRAM, lROM: failure rate of CPU, RAM, and ROM (1) Fetch the next instruction from memory into the
instruction registers.
Failure rates of the CPU, RAM, and ROM are evaluated
(2) Change the program counter so that it points to the
with MIL-HDBK-217F [9]. Table 1 shows the failure rates
following instruction.
of the respective components.
(3) Determine the type of instruction just fetched.
(4) If the instruction uses data in memory, determine where
4.2. Simulation system they are.
(5) Fetch the data, if any, into internal CPU registers.
In the previous section, the LCL processor system (6) Execute the instruction.
consists of CPU, PROM, SRAM, and I/O devices. Because (7) Store the instruction.
of the system’s high complexity, it is very difficult to (8) Go to step (1) to begin executing the following
describe the entire LCL processor system with hardware instruction.
description language only. In this work, the LCL processor
system is simplified to three main components: CPU, RAM, We select Visual CCC language as hardware descrip-
and ROM. The I/O is replaced with CPU’s input and output tion language for the experiment. The structures of the
pins. We selected 8051 as the CPU because, relative to other hardware description are as follows.
CPUs, it is easy to describe the hardware and to program
the functions of the LCL processor. An 8051 is capable of 4.2.1. The 8051 header file
addressing 64 Kb of program and data memories. The LCL The 8051 header file defines the 8051 internal structure.
processor, with a self-checking algorithm, is programmed Fig. 5 shows the structure of the header file. Instructions,
with the C compiler for the 8051 microcontroller and is registers, and memory, such as RAM and ROM are declared
stored in the ROM. The simplified computer hardware in the header file. All of the declarations are used for the
description is used to simulate the fault injection and the function of 8051.
error detection of the LCL processor in the DPPS.
Fig. 4 shows a block diagram of the 8051 processor 4.2.2. The 8051 source file
simulator [13]. A typical 8051 contains a CPU with a The 8051 source file performs the functions of 8051.
Boolean processor, five or six interrupts, two or three 16-bit Fig.6 shows the structure of the source file. It fetches and
timers/counters, RAM, and ROM. In the CPU, there are determines the appropriate types of instructions from the
three important parts: arithmetic logical unit (ALU), control program, reads data, if necessary, and executes the
unit (CU), and registers. The ALU performs operations such instructions. Most of those functions are located in
the Simulate function. Fig. 7 shows the structure of the
Simulate function. This function continually performs the
following processes; fetch the instruction, determine its
Internal
Decoder type, locate the data to be added, fetch the data from
RAM
memory, perform the arithmetic or logical operation, and
8051 store the data.
Core

Internal 4.3. Fault type


ALU
ROM
Permanent stuck-at faults are used in this study, because
they can change the order of an execution or inhibit an
execution. Permanent faults are related to irreversible
physical defects in the circuit that remain indefinitely.
External External Such defects can originate during the manufacturing process
RAM ROM or during normal operation. Permanent faults occurring
during normal operation can be the result of wear, on any of
Fig. 4. Block diagram of the 8051 processor simulator. numerous mechanisms, that initially causes intermittent
DTD 5 ARTICLE IN PRESS
6 S.J. Kim et al. / Reliability Engineering and System Safety xx (2005) 1–10

Fig. 5. The 8051 header file.

faults until finally provoking permanent faults. These cases 4.4. Number of experiments
are considered to be possible defects of the LCL processor.
The stuck-at faults are circuit failures equivalent to one In this study, many experiments are performed to
or several circuit nodes being fixed at logic 0 (stuck-at 0), evaluate the fault detection coverage. In the simulations,
logic 1 (stuck-at 1), or wrong data. The relative simplicity of an input case ‘0000 0000’ and the stuck-at fault are
this model has led to its wide use in the industry. Fig. 8 considered. This case can be divided into CPU, ROM, and
shows an example of the stuck-at fault injection. The stuck- RAM.
at fault injection method is realized by the data modification Table 2 represents the number of fault cases in the CPU
for the experiment. In RAM and ROM cases, one byte of and Fig. 9 shows how to inject the stuck-at fault in the CPU.
original data is modified by AND, OR operations so that one The fault is injected into three components of the CPU:
bit is fixed to 0 or 1. In CPU cases, one byte of original data address decoder, instruction register, and program counter;
is assigned to wrong data so that one byte data is changed to all of these components are very sensitive to a fault
wrong data. occurrence of the CPU function. For the address decoder
The stuck-at fault operation is applied continuously to fault, a normal instruction is replaced with another
each address in the simple computer system components instruction. The simulator has 111 instructions, and 110
until the program is completed. This injection method possible wrong instruction assignment cases can be
allows an easy way to implement the permanent stuck-at generated, making the total possible cases for the decoder
fault to modify signals and variables. Stuck-at faults that 12,210. The instruction register is a small, high-speed
occur in a system can cause infinite loops or instruction memory that holds the instruction currently being executed.
errors that lead to wrong results. In 8051, the instruction register has 8 bit; therefore, the

Fig. 6. The 8051 source file.


DTD 5 ARTICLE IN PRESS
S.J. Kim et al. / Reliability Engineering and System Safety xx (2005) 1–10 7

Simulate()

Fetch next instruction LoadHex()


Load()
Hex2short()
Point to the next instruction

Decode the instruction Decode()

Locate data
Setbit() GetBit()

Fetch data from memory ClearBit() GetRegisterBank()

Advance the process ProgramCompletion() PrintPorts()

Fig. 7. Structure of simulate function.

number of possible stuck-at fault cases is 256. The program Table 3 shows the number of experiments on RAM and
counter is a register that points to the next instruction to be ROM. In the RAM and the ROM, both stuck-at 1 and stuck-
executed. A 16-bit program counter is used in 8051, and the at 0 faults are injected. The RAM has 434 bytes and the
number of possible stuck-at fault cases is 65,536. Therefore, ROM has 384 bytes. Therefore, the total number of
the total experiments on the CPU are 78,002. experiments on them is 13,088. All faults are injected into
the system sequentially. For instance, one permanent stuck-
at 0 fault is injected into RAM and is simulated in the
Original data D7 D6 D5 D4 D3 D2 D1 D0 hardware description; it then validates the result from the
OR description’s output ports, then changes to another bit and
Reference data 0 0 0 0 0 0 0 1
repeats the simulation.

4.5. Fault detection mechanism


Modified data D7 D6 D5 D4 D3 D2 D1 1
In order to detect those faults, two fault detection
Stuck-at-1 fault
mechanisms are considered: Heartbeat-watchdog timer and
(a) Stuck-at 1 fault (For RAM and ROM) ROM checksum.
Fig. 10 shows the Heartbeat-watchdog timer fault-
Original data D7 D6 D5 D4 D3 D2 D1 D0 detection mechanism. The Heartbeat-watchdog timer
AND method involves heartbeat signals, which are emitted at
Reference data 1 1 1 1 1 1 1 0 regular intervals after executing logic, and the watchdog
timer that counts the signals. When faults are injected into
the simplified system, a fault occurrence can be detected by
Modified data D7 D6 D5 D4 D3 D2 D1 0 checking the watchdog timer time-out.
The ROM checksum is useful in safety-critical systems
Stuck-at-0 fault or in other applications that require very high degree of
(b) Stuck-at 0 fault (For RAM and ROM)
Table 2
The number of fault cases in the CPU
Original data D7 D6 D5 D4 D3 D2 D1 D0
Number of fault cases Total number
of experiments

Modified data 1 0 1 1 1 0 1 1 Address decoder 111!110Z12,210 (The number


fault of addresses in decoder is 111)
Stuck-at-data fault
Instruction reg- 28Z256 (Instruction register has 78,002
ister fault 8 bits)
(c) Stuck-at fault (For CPU) Program counter 216Z65,536 (Program counter
fault has 16 bits)
Fig. 8. An example of stuck-at fault injection.
DTD 5 ARTICLE IN PRESS
8 S.J. Kim et al. / Reliability Engineering and System Safety xx (2005) 1–10

Normal Instruction Faulted Instruction


Hex Code Instruction Hex Code Instruction Fault injection
01110100 MOV 00000001 AJMP
00000110 INC
11110001 ACALL


(a) Fault example (Address decoder)
Simplified System
Faulted data Register
HeartbeatSignal
Heartbeat Signal
00000001 ?
00000110
11110001

(b) Fault example (Instruction register, Program counter)

Fig. 9. An example of stuck-at fault injection in CPU.


Check the heartbeat signals
Check the heartbeat signals

reliability [15]. The checksum adds all of the program


memory locations within the ROM address range. The final
data value that is accumulated from the addition of the ROM
locations is compared with the checksum result in Intel Fig. 10. Heartbeat-watchdog timer fault detection mechanism.
hexadecimal format. If the result matches the checksum
value, the program ROM is validated. Otherwise, a possible Table 4 shows the results of the fault detection coverage
error condition is flagged. of eight sections in the CPU. When 12,210 stuck-at faults
When the permanent stuck-at faults injected into the are injected into the decoder, about 23.0% of them are
hardware description are running, one byte cell will be detected. When 256 stuck-at faults are injected into the
accessed with an incorrect address in the decoder, and the instruction register and 65,536 stuck-at faults are injected
byte cell will be accessed with an incorrect instruction in into the program counter, nearly 100.0% of the faults are
the Instruction Register (IR). In the Program Counter (PC), detected in each case. The fault detection coverage of the
the faults affect the PC such that it is not able to operate the CPU is found to be 74.3% by calculating the average of
count function. In the Heartbeat-watchdog timer case, the three cases above.
heartbeat signals are emitted at regular intervals after Tables 5 and 6 show the results of the fault detection
executing the 2 out of 4 coincidence logic. The heartbeat coverage of the eight sections in the RAM and the ROM,
signals are then used to detect faults. In the ROM checksum respectively. The fault simulations for the RAM and
case, a checksum error signal is emitted before executing the ROM are performed with two cases: stuck-at 1 and
the 2 out of 4 coincidence logic, if the ROM has any error. stuck-at 0. These divisions are each subdivided into eight
The checksum error signal is used to detect the fault. cases, because 8-bit RAM and ROM are selected for the
simulation experiment. The fault detection coverage for
each of the 8 cases is evaluated by experiment. Then, the
5. Application experiment result fault detection coverage for the stuck-at 0 case and the
stuck-at 1 case are obtained from the average of the 8 cases.
5.1. Heartbeat-watchdog timer case Finally, the fault detection coverage of the RAM and the
ROM cases are obtained from the average of cases; stuck-at
We can calculate the fault detection coverage of the
system by the simulation experiment. Table 4
Results of the simulation with injected stuck-at faults
Table 3
Fault detection coverage
The number of experiments in RAM and ROM
CPU 0.743011
Number of fault cases Total number of (Decoder: 0.229538, Instruction register: 0.999496, Program
experiments counter: 0.999999)
ROM Stuck-at 1: 434 byte !8 bit/byte 6944 RAM 0.019047
Stuck-at 0: 434 byte !8 bit/byte (s-a-1: 0.026049, s-a-0: 0.012045)
RAM Stuck-at 1: 384 byte !8 bit/byte 6144 ROM 0.132071
Stuck-at 0: 384 byte!8 bit/byte (s-a-1: 0.154378, s-a-0: 0.109764)
DTD 5 ARTICLE IN PRESS
S.J. Kim et al. / Reliability Engineering and System Safety xx (2005) 1–10 9

Table 5 coverage of the target system is relatively low. In this


The results of the fault detection coverage of the eight sections in RAM experiment, the checksum method is used to increase the
Stuck-at 1 Stuck-at 0 fault detection coverage of ROM.
Fault detection 1st bitZ0.028646 1st bitZ0.007813 Fig. 11 provides a comparison of the checksum case
coverage per bit result with that of the watchdog timer case. The fault
2nd bitZ0.020888 2nd bitZ0.007813 detection coverage of the ROM is increased to about 50.0%
3rd bitZ0.026042 3rd bitZ0.015625
with the checksum method, compared to 13.2% fault
4th bitZ0.028646 4th bitZ0.013021
5th bitZ0.031250 5th bitZ0.013021 detection coverage by the heartbeat-watchdog timer. In
6th bitZ0.031250 6th bitZ0.007813 addition, the fault detection coverage of the simplified LCL
7th bitZ0.024380 7th bitZ0.015625 system is increased to 50.5%, compared with fault detection
8th bitZ0.018229 8th bitZ0.015625 coverage of 44.2% for the heartbeat-watchdog timer. This is
Average 0.026049 0.012045
Fault detection 0.019047
because the checksum method is a self-checking method
coverage that adds all of the program memory locations within the
ROM address range and then compares its result to the
original one. Therefore, a logic mutation of one bit can lead
to a calculation result problem and cause a system error.
0 and stuck-at 1. Accordingly, the fault detection coverage
Such problems can be detected using the checksum method.
of the RAM is 1.9%, and that of the ROM is 13.2%.
However, if a stuck-at 0 fault is injected to an unused part of
The results of the fault detection coverage and the failure
the ROM, that fault does not affect the checksum result.
rate are applied to equation (6) to evaluate the fault
Therefore, all stuck-at 0 faults that are injected to unused
detection coverage of the simplified LCL system. The
locations cannot be detected. The fault detection coverage
calculation process and the results of the fault detection
coverage are as follows. of the ROM and the system are about 50% when using the

lCPU CCPU C lROM CROM C lRAM CRAM


Csystem Z
lCPU C lROM C lRAM
0:743011 !0:273680 C 0:132071 !0:083558 C 0:019047 !0:133009
Z Z 0:442463 ðz44:2%Þ ð7Þ
0:273680 C 0:083558 C 0:133009

ROM checksum method.


5.2. Comparison with ROM checksum case

In the above mentioned experiments, heartbeat signals 6. Conclusions


are emitted at regular intervals after executing logic without
a self-checking algorithm. After faults are injected into the In this study, we introduced a fault-detection coverage
target system, a watchdog timer can detect a fault evaluation method to increase the safety of nuclear power
occurrence whether heartbeat signals are emitted from the plant digital systems. To evaluate the proposed method,
simplified system. In this case, slightly less than 50% of the simulated fault injection experiment is performed on a
faults are detected. This result shows that the fault detection simplified system with program. The LCL system in the
DPPS is selected for assessment. Because of its complexity,
the LCL system was simplified. The simplified system
Table 6 consists of CPU, RAM, and ROM. The permanent stuck-at
The results of the fault detection coverage of the eight sections in ROM
fault is selected as a possible fault in the system and injected
Stuck-at 1 Stuck-at 0 using the code modification. The two out of four
Fault detction 1st bitZ0.147465 1st bitZ0.099307 coincidence logic with a fault detection algorithm is
coverage per bit installed in the ROM. The Heartbeat-watchdog timer and
2nd bitZ0.156682 2nd bitZ0.110599 the ROM checksum method are selected as fault detection
3rd bitZ0.129032 3rd bitZ0.092166
methods for the system. Then, the fault detection coverage
4th bitZ0.112903 4th bitZ0.094470
5th bitZ0.184332 5th bitZ0.073733 of each component is evaluated. The fault detection
6th bitZ0.163594 6th bitZ0.206770 coverage of the system is calculated using the sum of
7th bitZ0.218894 7th bitZ0.105991 weighting the failure rate to each component’s coverage
8th bitZ0.122120 8th bitZ0.092166 divided by the system’s failure rate.
Average 0.154378 0.109764
The application result of the experiment is shown as
Fault detection 0.132071
coverage follows. In the case of Heartbeat-watchdog timer, the fault
detection coverage is approximately 44.2%. Fault detection
DTD 5 ARTICLE IN PRESS
10 S.J. Kim et al. / Reliability Engineering and System Safety xx (2005) 1–10

60.0%

50.0%

40.0%

30.0%

20.0%

10.0%

0.0%
Fault detection coverage of simplified
Fault detection coverage of ROM
system
Heartbeat-watchdog timer 13.2% 44.2%
Checksum 50.0% 50.5%

Fig. 11. Comparison of the results.

coverage of digital systems can be improved by using self- [4] Dugan Joanne B, Trivedi Kishor S. Coverage modeling for
checking algorithms. In this study, the ROM checksum dependability analysis of fault-tolerant systems. IEEE Trans Comput
1989;38(6):775–87.
method is used as a self-checking algorithm. If the ROM
[5] Khoche A, Sherlekar SD, Venkateshesh G, Venkateswaran R. A
checksum method is applied, the fault detection coverage of behavioral fault simulator for ideal. IEEE Design Test Comput 1992;
the system increases to 50.5%. 9(4):14–21.
Our results show that the proposed method is a useful [6] Levendel YH, Menon PR. Test generation algorithms for computer
application for the design of safety-critical digital systems hardware description languages. IEEE Trans Comp 1982;C31:577–89.
used in nuclear power plants. [7] Mao W, Gulati R. Improving gate level fault coverage by RTL fault
grading. Proceeding of international test conference 1996 pp. 150–159.
[8] Hayne RJ, Johnson BW. Behavioral fault modeling in a VHDL
synthesis environment. Proceeding of 17th VLSI test symposium
Acknowledgements 1999 pp. 333–340.
[9] MIL-HDBK-217F. Reliability prediction of electronic equipment;
This work is partly supported by the Korean National Dec 2 1991.
Research Laboratory (NRL) Program. [10] Sueh M, Tsai T, Iyer RK. Fault injection techniques and tools. IEEE
Comput 1997;30(4):75–82.
[11] Clark JA, Pradhan DK. Fault injection: a method for validating
computer-system dependability. IEEE Comput 1995;28:47–56.
References [12] Technical Manual for Digital Plant Protection System (DPPS) for
Ulchin 5 and 6, Westinghouse electric company LLC; 2002.
[1] Kaufman LM, Johnson BW. Embedded digital system reliability and [13] AT89 Series Hardware Description, Atmel Corporation; 2000.
safety analyses, NUREG/GR-0200; 2001. [14] Tanenbaum Andrew S. Structured computer organization. Englewood
[2] National Research Council. Digital instrumentation and control Cliffs, NJ, USA: Prentice-Hall International; 1984.
systems in nuclear power plants. Washington, DC: National Academy [15] Siewiorek DP, Swarz RS. Reliable computer systems-design and
Press; 1997. evaluation. In: Peters AK, editor. 1998.
[3] Kang Hyun Gook, Sung Taeyong. An analysis of safety-critical digital
systems for risk-informed design. Reliab Eng Syst Saf 2002;78(3):
307–14.

You might also like