Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Safety System Availability

1oo2D and TMR

Stephen R. Wilton
Engineering Associate
Central Technical Services
Syncrude Canada Limited
Fort McMurray, AB
Canada, T9H 3L1

KEYWORDS

Safety Instrumented System, Mean Time to Failure, Risk Reduction Factor, 1oo2D, TMR

ABSTRACT

Three simplifications in the calculation of safety instrumented system mean time to failure and risk
reduction factor clarify the dependence on the various factors involved. These include failure rate, fail
safe fraction, diagnostics coverage and common cause. Approximate MTTF and RRF formulas are
developed for 1oo2D and TMR systems.

INTRODUCTION

Safety Instrumented Systems Standard ISA-S84.01 requires evaluation of the target process safety
integrity level (SIL). This defines the average range of the probability of failure on demand (PFD). The
safety instrumented system (SIS) which protects the process must match this. Probabilities for the SIS are
calculated from Markov models using multiplication and inversion of large matrices. This makes it
difficult to retain an understanding of the relative importance of the various factors involved.
Three simplifications help to clarify the relationships:

1 States due to single channel failures detected by the diagnostics are short lived if quickly repaired.
They can be ignored. This reduces the dimensions of the matrices.

2 If all coverage factors, all common cause factors and all overt failure factors are assumed equal,
summation over channel components is eliminated and the transition probability expressions are shorter.

3 Transition probabilities for failures are very small and only the first few terms of the matrix
binomial expansion are significant. This greatly simplifies the hazard or risk reduction factor expression
which otherwise involves a matrix raised to a very high power.

1oo2D SYSTEMS

These systems have two parallel channels each with diagnostics that disable the channel output when a
channel failure is detected. They have two modes1 - calculate / calculate (cc) and calculate / verify (cv).
The Markov model shows failure rates and repair rates that are associated with system transitions from
any state to any other state. States for this dual redundant system with diagnostics include

1 both channels OK

2 one channel failed safe and detected by diagnostics


3 one channel failed dangerous and detected by diagnostics
4 one channel failed safe and undetected by diagnostics (cc)
4a one channel failed dangerous and undetected by diagnostics (cv)

5 both channels failed safe


6 one or both channels failed dangerous and undetected by diagnostics

Repair rates for detected single channel failures are much higher (shorter time) than the failure rates. The
system is in these states (2 and 3) for such a relatively short time that these states can be ignored. The
transition probability P matrix2 reduces to

P = [ 1-λs1 λ1-4 λ1-5 λ1-6 ]


[ 0 1-λ s4 λ4-5 λ4-6 ]
[ 0 0 1 0 ]
[ 0 0 0 1 ]

λi-j = transition probability from state i to state j


λsi = sum of the other terms in the i-th state row

The system failure rates are a function of the single channel parameters:
λ failure rate per hour

S fail safe fraction (overt)


1-S fail dangerous fraction (covert)

C fraction detected by diagnostics (coverage)


1-C fraction undetected

β fraction failing two channels at once (common cause)


1-β fraction failing only one channel

assuming all Cs equal, allβs equal, and all Ss equal. Typical order of magnitude is 10-5 for λ, 0.01 for β,
0.99 for C, and 0.5 for S. The transition probabilities for the two modes are listed in the appendix along
with approximations obtained by deleting the less significant terms.

MTTF1oo2D

Q is the upper left partition of the P matrix.

Q = [ 1-λs1 λ1-4 ]
[ 0 1-λs4 ]

I is the identity matrix.

I = [ 1 0 ]
[ 0 1 ]

and N = [ I - Q ] -1

= 1/(λs1λs4) [ λs4 λ1-4 ]


[ 0 λs1 ]

The mean time to failure is the sum across the top row of the N matrix.

MTTF = 1 / λs1 ( 1 + λ1-4/λs4 )

=~ 1 / [{ 2(1-C) + β }λ]

substituting expressions from the appendix.


RRF1oo2D

Partitioning

P = I+∆

where I is the identity matrix and

∆ = [ -λs1 λ1-4 λ1-5 λ1-6 ]


[ 0 -λ s4 λ4-5 λ4-6 ]
[ 0 0 0 0 ]
[ 0 0 0 0 ]

Using the first few terms of the binomial theorem since all terms of∆ are small

Pn = [ I + ∆ ]n

= In + n In-1 ∆ + n(n-1)/2 In-2 ∆2 + ...

=~ I + n ∆ + n(n-1)/2 ∆2

n is the average number of hours a failure remains undetected by on line diagnostics prior to off line proof
testing and is typically on the order of 104.

The probability of failure on demand is the upper right term of the Pn matrix.

PFD = n λ1-6 + n(n-1)/2 ( -λs1 λ1-6 + λ1-4 λ4-6 ) + ...

(cc) =~ 2n(1-C)(1-S)λ nλ < 1

and (cv) =~ n (1-C)(1-S)( β + nλ )λ

substituting expressions from the appendix.

The risk (or hazard) reduction factor is the inverse of the PFD.

RRF = 1 / PFD

Note the very strong dependence on the diagnostic coverage C and the relatively weaker dependence on
the common cause factor β. Recall that by order of magnitude C ~ 0.99, S ~ 0.5, β ~ 0.01, and nλ ~ 0.1.
2oo3 SYSTEMS

Triple-modular redundant (TMR) or 2oo3 systems have three parallel channels with two out of three
voting on the outputs. With more channels, there are more states. They have eleven states2,3 or more 4,5.
The states and transition probabilities are listed in the appendix.

Some states can be eliminated since they are quickly repaired. The transition probability matrix reduces to

P = [ 1-λs1 λ1-3 λ1-5 0 λ1-10 λ1-12 ]


[ 0 1-λs3 0 λ3-9 λ3-10 λ3-12 ]
[ 0 0 1-λs5 λ5-9 λ5-10 λ5-12 ]
[ 0 0 0 1-λs9 λ9-10 λ9-12 ]
[ 0 0 0 0 1 0 ]
[ 0 0 0 0 0 1 ]

MTTF2oo3

N = [I-Q]-1

Summing across the top row

MTTF = 1/Det ( N11 + N12 + N13 + N14 )

where Det = λs1 λs3 λs5 λs9

= λs1 N11

N11 = λs3 λs5 λs9

N12 = λ1-3 λs5 λs9

N13 = λs3 λ1-5 λs9

N14 = λ1-3 λs5 λ3-9 + λs3 λ1-5 λ5-9

MTTF = 1/λs1 [ 1 + λ1-3/λs3 + λ1-5/λs5 + λ1-3λ3-9/λs3λs9 + λ1-5λ5-9/λs5λs9 ]

=~ 1/[{ 3(1-C) + βS }λ] [ 1 + 3(1-S)/{2+βS/(1-C)} ]

substituting expressions from the appendix.


RRF2oo3

PFD = n λ1-12 + n(n-1)/2 ( - λs1 λ1-12 + λ1-3 λ3-12 + λ1-5λ5-12 ) + ...

=~ 3n(1-C)(1-S)[ β + (1-C)(1-S)nλ ]λ

=~ 3n(1-C)(1-S)βλ if (1-C)(1-S)nλ /β << 1

substituting expressions from the appendix.

RRF = 1 / PFD

Note and the very strong dependence on both the common cause factor β and the diagnostic coverage C.
Recall that by order of magnitude C ~ 0.99, S ~ 0.5, β ~ 0.01, and nλ ~ 0.1.

CONCLUSION

If λ, S, C and β are the same for the 1oo2D and the 2oo3 system, there is little to choose between them
on the basis of MTTF.

For 1oo2D systems, the calculate / verify form has a lower PFD and higher RRF than the calculate /
calculate form if

β + nλ < 2

as is always the case if proof testing is done at appropriate intervals.

The 1oo2D calculate / verify system has a lower PFD and higher RRF than the 2oo3 if

3 [ β + (1-C)(1-S)nλ ] > β + nλ

or β > nλ/2 [ 1 - 3(1-C)(1-S) ]

>~ nλ/2

Common cause is critical for 2oo3 systems.

The three simplifications used here, while not exact, lead to an easier understanding of the common
measures of system availability. Simple formulas replace large dimension and high power matrices and
computer programs. State transition probability data used as input to these calculations is likely to be only
a rough approximation to the true figures so simplified calculations will often be adequate.
REFERENCES

1 Bukowski, J. V., and Goble, W. M., Using Markov Models for Safety Analysis of Programmable
Electronic Systems, Proceedings of the 50th A & M Conference on Process Control, ISA 1995 or ISA
Transactions 34 (1995).

2 Goble, W. M., Evaluating Control Systems Reliability - Techniques and Applications, NC:
Raleigh, ISA 1992.

3 Bukowski, J. V., and Goble, W. M., Comparing Control Systems Reliability - Architecture,
Diagnostics, and Common Cause, ISA 1994.

4 Goble, W. M., Safety of Programmable Electronic Systems - Critical Issues, Diagnostics and
Common Cause Strength, Fourth Conference on Advances in Process Control (IChemE) York, UK
Sept. 1995.

5 Goble, W. M., Bukowski, J. V., Brombacher, Prof. Dr. Ir. A. C., How Common Cause Ruins the
Safety Rating of Fault Tolerant PES, ISA 1996.
APPENDIX

State transition probabilities and approximations follow.

1oo2D

cc cv

λ1-4 = 2(1-C)(1-β)Sλ 2(1-C)(1-β)(1-S)λ


=~ 2(1-C)Sλ 2(1-C)(1-S)λ
λ1-5 = β[S+C(1-S)]λ [Cβ+(1-C)βS+2(1-C)(1-β)S]λ
=~ βλ [β +2(1-C)S]λ
λ1-6 = (1-C)[β+2(1-β)](1-S)λ (1-C)β(1-S)λ
=~ 2(1-C)(1-S)λ 0

λs1 =~ [β+2(1-C)]λ [β+2(1-C)]λ

λ4-5 = [S+C(1-S)]λ (1-C)Sλ


=~ λ (1-C)Sλ
λ4-6 = (1-C)(1-S)λ [(1-C)(1-S)+C]λ
=~ (1-C)(1-S)λ λ

λ s4 =~ λ λ
λ1-4/λ s4 = 2(1-C)S 2(1-C)(1-S)
<< 1 1

2oo3

λ1-3 = 3(1-C)(1-β)Sλ =~ 3(1-C)Sλ


λ1-5 = 3(1-C)(1-β)(1-S)λ =~ 3(1-C)(1-S)λ
λ1-10 = βSλ
λ1-12 = 3(1-C)β(1-S)λ =~ 0

λs1 =~ [3(1-C) + βS]λ

λ3-9 = 2(1-C)(1-β)(1-S)λ =~ 2(1-C)(1-S)λ


λ3-10 = βSλ + 2(1-β)Sλ =~ 2Sλ
λ3-12 = (1-C)β(1-S)λ =~ 0
λs3 =~ 2Sλ
λ1-3/λs3 = 3(1-C)/2 << 1

λ5-9 = 2(1-C)(1-β)Sλ =~ 2(1-C)Sλ


λ5-10 = βSλ
λ5-12 = (1-C)β(1-S)λ + 2(1-C)(1-β)(1-S)λ =~ 2(1-C)(1-S)λ

λs5 =~ [2(1-C)+βS] λ
λ1-5/λs5 = 3(1-C)(1-β)(1-S)/[2(1-C)+βS] =~ 3(1-C)(1-S) / [2(1-C)+βS]

λ9-10 = Sλ
λ9-12 = (1-C)(1-S)λ

λs9 =~ Sλ
λ3-9/λs9 = 2(1-C)(1-β)(1-S)/S << 1
λ5-9/λs9 = 2(1-C)(1-β) << 1

States for 2oo3 or TMR systems include

1 All 3 channels OK

2 1 channel failed safe and detected (SD)


3 1 channel failed safe and undetected (SU)
4 1 channel failed dangerous and detected (DD)
5 1 channel failed dangerous and undetected (DU)

6 1 channel failed SD, another failed DD


7 1 channel failed SU, another failed DD
8 1 channel failed SD, another failed DU
9 1 channel failed SU, another failed DD

10 System failed safe


11 System failed dangerous and detected
12 System failed dangerous and undetected

You might also like