Dependable Software Systems

Topics in
Software Reliability

Material drawn from [Somerville, Mancoridis]


What is Software Reliability?
•  A Formal Definition: Reliability is the
probability of failure-free operation of a
system over a specified time within a
specified environment for a specified


What is Software Reliability?
•  An Informal definition: Reliability is a
measure of how closely a system matches
its stated specification.
•  Another Informal Definition: Reliability
is a measure of how well the users perceive
a system provides the required services.


Software Reliability
•  It is difficult to define the term objectively.
•  Difficult to measure user expectations,
•  Difficult to measure environmental factors.
•  It’s not enough to consider simple failure
–  Not all failures are created equal; some have
much more serious consequences.
–  Might be able to recover from some failures

Failures and Faults
•  A failure corresponds to unexpected run-
time behavior observed by a user of the
•  A fault is a static software characteristic
which causes a failure to occur.


Failures and Faults (Cont’d)
•  Not every fault causes a failure:
–  Code that is “mostly” correct.
–  Dead or infrequently-used code.
–  Faults that depend on a set of circumstances to
•  If a tree falls ...
–  If a user doesn’t notice wrong behavior, is it a
–  If a user expects “wrong” behavior, is it a fault?


Improving Reliability
•  Primary objective: Remove faults with the
most serious consequences.
•  Secondary objective: Remove faults that
are encountered most often by users.


Improving Reliability (Cont’d)
•  Fixing N% of the faults does not, in general,
lead to an N% reliability improvement.
•  90-10 Rule: 90% of the time you are
executing 10% of the code.
•  One study showed that removing 60% of
software “defects” led to a 3% reliability


The Cost of Reliability
•  In general, reliable systems take the slow,
steady route:
–  trusted implementation techniques
–  few uses of short-cuts, sneak paths, tricks
–  use of redundancy, run-time checks, type-safe
•  Users value reliability highly.
•  “It is easier to make a correct program
efficient than to make an efficient program

The Cost of Reliability (Cont’d)
•  Cost of software failure often far outstrips
the cost of the original system:
–  data loss
–  down-time
–  cost to fix


Measuring Reliability
•  Hardware failures are almost always
physical failures (i.e., the design is correct).
•  Software failures, on the other hand, are due
to design faults.
•  Hardware reliability metrics are not always
appropriate to measure software reliability
but that is how they have evolved.


Reliability Metrics (POFOD)
•  Probability Of Failure On Demand
–  Likelihood that system will fail when a request
is made.
–  E.g., POFOD of 0.001 means that 1 in 1000
requests may result in failure.
•  Any failure is important; doesn’t matter
how many if > 0
•  Relevant for safety-critical systems.

Reliability Metrics (ROCOF)
•  Rate Of Occurrence Of Failure
–  Frequency of occurrence of failures.
–  E.g., ROCOF of 0.02 means 2 failures are
likely in each 100 time units.
•  Relevant for transaction processing systems.


Reliability Metrics (MTTF)
•  Mean Time To Failure (MTTF):
–  Measure of time between failures.
–  E.g., MTTF of 500 means an average of 500
time units passes between failures.
•  Relevant for systems with long transactions.


Reliability Metrics (Availability)
•  Availability:
–  Measure of how likely a system is available for
use, taking in to account repairs and other
–  E.g., Availability of .998 means that system is
available 998 out of 1000 time units.
•  Relevant for continuously running systems.
–  E.g., telephone switching systems.


Time Units
•  What is an appropriate time unit?
•  Some examples:
–  Raw execution time, for non-stop real-time
–  Number of transactions, for transaction-based


Types of Failures
•  Not all failures are equal in their
–  Transient vs permanent
–  Recoverable vs non-recoverable
–  Corrupting vs non-corrupting
•  Consequences of failure:
–  Malformed HTML document.
–  Inode table trashed.
–  Incorrect radiation dosage reported.
–  Incorrect radiation dosage given!

Steps to a
Reliability Specification
•  For each subsystem, analyze the
consequences of possible system failures.
•  Partition the failures into appropriate
•  For each failure identified, describe the
acceptable reliability using an appropriate


Automatic Bank Teller Example
•  Bank has 1000 machines; each machine in
the network is used 300 times per day.
•  Lifetime of software release is 2 years.
•  Therefore, there are about 300,000 database
transactions per day, and each machine
handles about 200,000 transactions over the
2 years.


Example Reliability Specification
Failure class Example Reliability metric
Permanent, The system fails to ROCOF =1 occ./1000 days
non-corrupting operate with any
card; must be restarted.

Transient, The magnetic strip on POFOD = 1 in 1000 trans.

non-corrupting an undamaged card
cannot be read.

Transient, A pattern of transactions Should never happen

corrupting across the network causes
DB corruption.


Validating the Specification
•  It’s often impossible to empirically validate
high-reliability specifications.
–  No database corruption would mean
POFOD < 1 in 200 million
–  If a transaction takes one second, then it would
take 3.5 days to simulate one day’s
•  It would take longer than the system’s
lifespan to test it for reliability!

Validating the Specification (Cont’d)
•  It may be more economic to accept
unreliability and pay for failure costs.
•  However, what constitutes an acceptable
risk depends on nature of system, politics,
reputation, ...


Increasing Cost of Reliability




Steps in a
Statistical Testing Process
–  Determine the desired levels of reliability for
the system.
–  Generate substantial test input data based on
predicted usage of system.
–  Run the tests and measure the number of errors
encountered, and the amount of “time” between
each failure.
–  If levels are unacceptable, go back and repair
some faults.
–  Once a statistically-valid number of tests have
been run with acceptable results, you’re done.

Problems with Statistical Testing
•  Generating “typical” test data is difficult,
time consuming, and expensive, especially
if the system is new.
•  Idea of what constitutes “acceptable” failure
rate may be hard to determine:
–  sometimes, expectations are unrealistic.
•  Some ideas may be hard to measure:
–  they’re hard to specify concretely.
–  small sample size implies results not
statistically valid.

Programming for Reliability
•  As we have seen, squeezing the last few
bugs out of a system can be very costly.
–  For systems that require high reliability, this
may still be a necessity.
–  For most other systems, eventually you give up
looking for faults and ship it.
•  We will now consider several methods for
dealing with software faults:
–  Fault avoidance
–  Fault detection
–  Fault tolerance, recovery and repair

Fault Avoidance
•  The basic idea is that if you are REALLY
careful as you develop the software system,
no faults will creep in.
•  Can be done in degrees:
–  Basic fault avoidance:
•  Use of information-hiding, strong typing, good
engineering principles.
–  Fault-free software development:
•  Use of formal specification, code verification,
strictly followed software development process.

Basic Fault Avoidance
•  Basically, you use all of the “holy” ideas we
have discussed in class:
–  Programming techniques:
•  information-hiding, modularity, interfaces
•  object-oriented techniques (where appropriate)
•  “structured programming”: top-down design, no
GOTOs, ...
–  System structure:
•  scrutable system structure
•  simple interactions between components
•  low coupling and high cohesion

Basic Fault Avoidance (Cont’d)
•  Use of languages and tools that give good
support to these ideas!
•  No cheats, hacks, or other “risky behavior”.


“High-Risk Behavior”
•  Magic numbers vs declared constants.
•  Type cheats.
•  Compiler quirks, bit ordering, (hardware)
architectural dependencies.
•  Use of “special knowledge”
–  E.g., implicit shared context between software
•  Features that are likely not to be portable or
robust, or make system evolution difficult.

Programming Constructs
•  Raw pointers:
–  are a notorious cause of crashes,
–  make code incomprehensible to others,
–  it’s difficult to make them truly safe.
–  On the other hand, object-oriented
programming languages use typed, safe
pointers to refer to abstract programming


Error-Prone Programming
Constructs (Cont’d)
•  Dynamic memory allocation:
–  In C, another major cause of crashes and
confusion, e.g., ever forget to malloc()?
–  Again, object-oriented programming basically
solves this.
•  Parallelism/concurrency:
–  Logic is hard to get correct.


Error-Prone Programming
Constructs (Cont’d)
•  Floating-point numbers:
–  Many possible problems; may get unexpected
results from comparisons leading to unexpected
execution paths.
•  Recursion:
–  Again, logic is often hard to get completely
correct. Also, infinite recursions exhaust


Error-Prone Programming
Constructs (Cont’d)
•  Interrupts:
–  Can make logic of program difficult to follow.
•  This isn’t to say you should never use these
constructs, just that there are real and
inherent risks that tend to introduce faults
and decrease program reliability.


Fault-Free Software
•  Uses formal specifications, logic, & related
•  Extensive and frequent reviews of all
software artifacts.
•  Requires a programming language with
strong typing and run-time checking, plus
good abstraction mechanisms.
•  Requires serious time and effort.


Cost of Fault-Free
Software Development
•  The cost of this approach can be very high!
–  Must have management buy-in on costs.
–  Developers must be experienced and highly
trained, not only in traditional software
development techniques, but also in
mathematics, logic, and special tools.
•  Usually, the cost is so prohibitive that it is
simple not feasible.


Cleanroom Software
•  Analogy to hardware manufacturing: avoid
introducing faults in the first place by
operating in a dedicated “clean”
–  What may and may not be used is rigidly
–  Well-defined process.
–  Attention to minute details.


Cleanroom Software
Development (Cont’d)
•  Some impressive (but limited) results:
–  Specialized projects done by IBM, academia.
–  Claim is very high reliability at roughly the
same “cost”.
•  Not clear this will work in industry:
–  Requires highly-trained and patient developers.
–  Potentially costly in time and human resources.


Cleanroom Development Process
Error, rework


Define Develop Formally

software structured verify
requirements program code


Design Test
statistical integrated
tests system


Cleanroom Process is Based on ...
•  incremental development
•  formal specification
•  static verification based on correctness
•  statistical testing to determine reliability


Cleanroom Process Teams
•  Specification Team:
–  Responsible for developing and maintaining
system specification.
–  Produces a formal specification of the system
requirements for use by the ...


Cleanroom Process Teams (Cont’d)
•  Development Team:
–  Develops the software based on formal
specification provided.
–  Only allowed to use a handful of trusted
implementation techniques.
–  Code may be type-checked by tools, but no
executables are generated.
–  Once code is written, it is formally verified
against the specification. (static verification)
•  E.g., all loops are guaranteed to terminate.
•  Once done, pass code off to ...

Cleanroom Process Teams (Cont’d)
•  Certification Team:
–  While development team is ongoing, the
certification team is working in parallel to
devise a set of statistical tests to “exercise” the
eventual compiled code.
–  Statistical modeling determines when to stop


Fault Tolerance
•  It is not enough for reliable systems to avoid
faults, they must be able to tolerate faults.
•  Faults occur for many reasons:
–  Incorrect requirements.
–  Incorrect implementation of requirements.
–  Unforeseen situations.
•  Roughly speaking, fault tolerance means
“able to continue operation in spite of
software failure.”


Steps to Fault Tolerance
•  Detection of failure:
–  How can you determine if there is a fault?
–  Can we determine what fault caused the
•  Damage assessment:
–  Which parts of the system have been affected?
–  How serious is the damage?
–  Can we ignore it? Fix it?


Steps to Fault Tolerance (Cont’d)
•  Fault recovery:
–  Restore the program state to a previous “safe”
–  E.g., Restoring a database and unfolding the
most recent transactions.
–  May require “snapshots” of old states to be
•  OR


Steps to Fault Tolerance (Cont’d)
•  Fault repair:
–  Can you simply fix the damage?
–  Can you stop the fault from occurring again?
without shutting down the system?
–  E.g., mail router: just try a different route


Software Fault Tolerance
Programming Techniques
•  N-Version programming (NVP)
•  Exception Handling
•  Subtypes
•  Run-time Assertions


N-Version Programming (NVP)
•  Basic idea from hardware TMR: use three
identical components that vote; majority
•  With NVP, have three distinct programs
that vote.
•  Claimed success has been criticized in some


NVP Criticism
•  Software failure is a design fault, not a
mechanical failure.
•  How do you know which team interpreted
the requirements correctly?
•  This technique has been used in the real
–  The Airbus-320 uses NVP in its on-board


Exception Handling
•  Sometime during execution you detect that
some kind of failure has occurred (i.e., “an
exception has been raised”).
•  While you could repair it in-line, usually
these kinds of failures can occur in multiple
•  It makes sense to define common handling
routines in one place.


Exception Handling (Cont’d)
•  An exception handler is a specialized
routine for dealing with an error. Usually,
handlers are grouped together.
•  Both C++ and Java support exception
handling (in roughly the same manner).
–  Java provides a variety of predefined
exceptions; C++ provides none.


Exception Handling (Cont’d)
•  We’ll assume the Java model; C++ is
•  Exceptions are objects.
•  The only interesting thing about them is the
name of the defining class and its place in
the overall exception class hierarchy:


Java Exception Class Hierarchy



IOException NoSuchMethodException RuntimeException

EOFException FileNotFOundException

Exception Handling in Java
public static String get () throws EOFException {
String ans;
if (tokenizer == null || !tokenizer.hasMoreTokens()) {
try {
curLine = inStream.readLine ();
tokenizer = new StringTokenizer(curLine);
catch (IOException e) {
if (!(e instanceof EOFException)) {
put ("Error: " + e.getMessage());
if (tokenizer.hasMoreTokens()) {
return tokenizer.nextToken();
else {
throw new;

Exception Handling in Java
•  In Java and C++, you raise an exception
explicitly via a throw statement.
•  The function that throws the exception may
choose to handle it locally; more likely, it
will let the caller decide what to do.
•  If the caller handles the exception
completely, then all is fine.
•  However, it is common to handle only some
cases, and then let the caller of the caller
handle the rest.

Exception Handling in Java (Cont’d)
•  Any function that raises an exception (either
explicitly via a throw or implicitly by
calling another function that raises an
exception) must either:
–  Explicitly handle all possible exceptions.
–  Declare itself as (possibly) raising an exception.


Exception Handling in Java (Cont’d)
•  If you have a block of code that might raise
an exception, embed it in a try block.
•  Right after the try block, you can try to
resolve those exceptions by a sequence of
catch statements.
•  If the exception does not get resolved by
one of these catch’s, then the exception is
sent back to the caller for resolution.


Exception Handling in Java (Cont’d)
•  Use of inheritance to structure exception
classes is very intuitive and useful.
•  Exceptions often don’t have a very
interesting state.
–  Usually, there’s an error message,
–  Perhaps some parameters to indicate what/
where things went wrong.


Exception Handling in Java (Cont’d)
•  Exception handlers are a way to separate
out the basic functionality from the
exceptional conditions.
•  The old-fashioned way would be to pass
various status flags with every function call.


Exception Handling in Java (Cont’d)
•  Try to limit exceptions to exceptional
circumstances and real error conditions.
•  Expected behavior (such as normal EOF)
may be better handled in-line, if you have a
•  The basic Java libraries make extensive use
of exceptions, so it’s hard to avoid them.
•  Yes, exceptions usually entail some
overhead. Most feel it is a worthwhile

Other Software Fault Tolerance
Programming Techniques
•  Subtypes: E.g.,
–  integer sub-ranges: 1..10,
–  even integers, > 0,
–  proper enumerated types, subclasses
•  Run-time Assertions: E.g.,
–  Module/class invariants, pre/post-conditions
•  All of these must be enforceable by
language or system to be practically useful.


Cost of Using
Subtypes & Assertions
•  The performance overhead may be
•  Might want to disable run-time checks after
sufficient testing.
•  E.g.,
–  Anna is a superset of Ada;
–  Anna adds special run-time checks embedded
in what the Ada compiler thinks are comments.


Damage Assessment
•  Once a failure has been detected, must try to
analyze extent of damage:
–  What parts of the state have been infected?
–  Is there any built-in redundancy to check for
–  Are values within legal range?
–  Are invariants preserved?


Common Assessment Techniques
•  Checksums are used in data transmission
(e.g., parity)
•  Redundant pointers check validity of data
•  Watchdog timers: If process doesn’t
respond after a certain period, check for


Forward Fault Recovery
•  Try to fix what’s broken based on
understanding of program and then carry
•  Often requires redundancy in
•  If you can figure out which value is
obviously wrong, you may be able to derive
its real value from the rest of the state.


For Example ...
•  Data transmission techniques.
•  Redundant pointers
–  If enough pointers are OK, can rebuild
structures such as file system or database.
•  Sometimes, domain understanding is
enough to help figure out what “real” values
should be.


Backward Fault Recovery
•  Restore state to a known safe state.
–  E.g., Restore file system from tape backup.
•  Some systems are transaction oriented:
–  Changes are not committed until computation is
–  If an error is detected, changes are not applied.
•  Requires period “snapshots” to be taken and
•  Simple approach, but may entail loss of

