SIL2 Assessment of An Active/Standby COTS-based Safety-Related System

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/324509526

SIL2 assessment of an Active/Standby COTS-based Safety-Related system

Article  in  Reliability Engineering [?] System Safety · April 2018


DOI: 10.1016/j.ress.2018.04.009

CITATIONS READS

12 2,418

5 authors, including:

Giovanni Mazzeo L. Coppolino


Parthenope University of Naples Parthenope University of Naples
34 PUBLICATIONS   293 CITATIONS    100 PUBLICATIONS   971 CITATIONS   

SEE PROFILE SEE PROFILE

Salvatore D'Antonio Claudio Mazzariello


Parthenope University of Naples Hitachi Rail STS
96 PUBLICATIONS   897 CITATIONS    28 PUBLICATIONS   400 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

KONFIDO View project

KONFIDO - Secure and Trusted Paradigm for Interoperable eHealth Services View project

All content following this page was uploaded by Giovanni Mazzeo on 28 November 2018.

The user has requested enhancement of the downloaded file.


SIL2 Assessment of an Active/Standby
COTS-Based Safety-related System

Giovanni Mazzeoa,∗, Luigi Coppolinoa , Salvatore D’Antonioa , Claudio Mazzariellob , Luigi Romanoa
a University of Naples ’Parthenope’ - Centro Direzionale, Isola C4, 80133 Napoli, IT
b Hitachi Ansaldo STS - Via Argine, 425, 80147 Napoli, IT

Abstract
The need of reducing costs and shortening development time is resulting in a more and more pervasive use of Commercial-Off-The-
Shelf components also for the development of Safety-Related systems, which traditionally relied on ad-hoc design. This technology
trend exacerbates the inherent difficulty of satisfying – and certifying – the challenging safety requirements imposed by safety
certification standards, since the complexity of individual components (and consequently of the overall system) has increased by
orders of magnitude. To bridge this gap, this paper proposes an approach to safety certification that is rigorous while also practical.
The approach is hybrid, meaning that it effectively combines analytical modelling and field measurements. The techniques are
presented and the results validated with respect to an Active/Standby COTS-Based industrial system, namely the Train Management
System of Hitachi-Ansaldo STS, which has to satisfy Safety Integrity Level 2 requirements. A modelling phase is first used to
identify COTS safety bottlenecks. For these components, a mitigation strategy is proposed, and then validated in an experimental
phase that is conducted on the real system. The study demonstrates that with a relatively little effort we are able to configure the
target system in such a way that it achieves SIL2.
Keywords: Dependability, Reliability, Safety-Related, COTS, ICS, SIL.

1. Rationale and Contribution different SIL levels. For each of them, we reported the asso-
ciated Tolerable Hazard Rate (THR) bounds, the failure mode,
The key building blocks of IT infrastructures responsible the consequent hazard, and the related system typology (i.e.,
for the management, control, and regulation of industrial oper- SCS or SRS). Making a system compliant to a specific SIL
ations are generally referred to as Industrial Control Systems means providing evidence of the achievement of THR thresh-
(ICS). Among these, a wide variety of critical systems exists, olds, which – for a complex system – is by no means a trivial
notably Safety-Critical Systems (SCS) and Safety-Related Sys- task.
tems (SRS). A SCS has full responsibility for controlling haz- The assessment process is even more difficult when Commercial-
ards and consequently its failure or malfunction may result in Off-The-Shelf (COTS) components – whose internals are par-
catastrophic outcomes, such as death or serious injury to peo- tially or totally unknown – come into play. COTS are being
ple, loss or severe damage to equipment/property, or environ- increasingly used by the industry to reduce costs and to shorten
mental harm. SRS support SCS, since they include the hard- development (and possibly deployment) time. However, since
ware and software that carries out one or more safety functions. COTS are general-purpose components that have not been de-
Thus, failure of an SRS increases the risk for the safety of peo- signed and developed for robust operation, obtaining a predictable
ple and/or of the environment (EN50129 says: “SRS carries operation profile (for individual components and – even more
responsibility for safety”). The focus of this paper is on SRS. – for the resulting system) is a challenging endeavour. The re-
Due to their importance, SRS must be proven to be reliable, search community has proposed techniques – mostly guidelines
through rigorous and internationally accepted methodologies. – to help reliability engineers carry out the assessment of safety
Standards (e.g. EN50129) exist, classifying quantitatively the properties [1][2][3][4][5]. However, the aforementioned stud-
likelihood of a failure through the concept of Safety Integrity ies are mainly qualitative. It is also worth emphasizing that
Level (SIL). Specifications include four SIL levels, where SIL0 the existing literature – with a few exceptions (e.g. [1]) – does
indicates that there are no safety requirements and SIL4 is typ- not refer to real industry applications, which limits the applica-
ically reserved to SCS. Table 1 shows the classification of the bility of the proposed techniques to commercial setups. Con-
versely, we contribute a methodological framework that can be
∗ Corresponding
author used in practice as a reference for certifying a wide class of
Email addresses: giovanni.mazzeo@uniparthenope.it (Giovanni emerging critical systems, virtually any system for which: i)
Mazzeo), luigi.coppolino@uniparthenope.it (Luigi Coppolino), the general architecture has already been designed, ii) business
salvatore.dantonio@uniparthenope.it (Salvatore D’Antonio),
claudio.mazzariello@ansaldo-sts.com (Claudio Mazzariello), constraints impose that (radical) changes to the architecture be
luigi.romano@uniparthenope.it (Luigi Romano) avoided, and iii) the main COTS components that must be inte-
Preprint submitted to Reliability Engineering and System Safety November 28, 2018
Table 1: Classification of SIL levels with associated THR and system typology

System
THR Failure Mode SIL Hazard
Typology
The system cannot be Catastrophic and Fatal
≥ 10−9 to < 10−8 4 SCS
recovered by the operator Outcome
Only an experienced operator Critical and Fatal
≥ 10−8 to < 10−7 3 SCS
can recover the system Outcome
An operator can recover Marginal, Injuries may
≥ 10−7 to < 10−6 2 SRS
the system occur
Minor availability issues. Negligible, Minor injuries
≥ 10−6 to < 10−5 1 SRS
Always recoverable may occur
There is no safety requirement Nuisance, Dissatisfying to
0
on the system the user

grated have already been chosen. We demonstrate, with respect • It provides quantitative measurements of failure rates in a
to a real industrial system, that it is possible to achieve SIL2 real setup, that can be reused in other industrial contexts.
via proper configuration of system parameters and set-up of re-
juvenation procedures. The paper in fact addresses SIL2 as- • It quantifies the impact of some best practices that can be
sessment of a real COTS-based SRS, specifically the two-nodes implemented in a wide class of systems, to increase their
cluster server hosting the Train Management System (TMS) ap- dependability.
plication of Hitachi Ansaldo STS (ASTS). The study enabled The remainder of this work is organized as follows. Sec-
ASTS to identify the conditions under which the architecture of tion 2 gives an insight into previous work. Section 3 describes
the Active/Standby cluster – incorporating COTS software and the ASTS use case. Section 4 presents the approach adopted.
hardware – can be certified as SIL2 compliant, as specified in Sections 5-6 describe the cluster modelling and the mitigation
EN50129. strategies proposed, respectively. These are followed by Sec-
In order to achieve this goal, the system is assessed by means tion 7, that describes the experimental validation phase. Finally,
of a hybrid approach, i.e., a combination of analytical mod- section 8 concludes the document with some final remarks.
elling and experimental evaluation. Analytical modelling is
done using PRISM (the well-known and widely used tool for
formal model checking), while experimental evaluation relies 2. Related Work
on direct measurements on the real system. The hybrid ap-
proach we present consists of several phases, which can be Goal of this section is to show related work focused on
grouped in five main iterative stages, i.e.: 1) Identifying the methods useful to assess the reliability of clustered COTS-based
THR and safety bottlenecks of the system in its current config- systems.
uration through formal models; 2) Defining possible corrective Connelly et al. [1] propose an approach to delineate safety
actions; 3) Weighing their effectiveness, i.e. evaluating the po- assurance for the use of COTS OS in safety-related applications,
tential impact on the THR cluster model; 4) Validating the re- which must fall in SIL2. Their proposed solution is focused
sults of previous phases through an experimental campaign on on developing encapsulation mechanisms able to isolate the in-
the real system. 5) Using experimental estimates within models fluence of a COTS failure. They analyzed OS failure modes,
to calculate the final THR. which may affect safe functionalities, and then proposed miti-
In the specific case of the ASTS TMS system, the safety bot- gation techniques whose impact is not reported. Unlike [1], our
tlenecks turned out to be COTS Operating System (OS) and work provides an entire methodology with tangible and quanti-
Cluster Resource Manager (CRM). Extended versions of the fied results that can be actually useful for other use. Qualitative
estimations are definitely needed but not enough. Our contri-
cluster model were drawn to evaluate the effectiveness of mit-
igation actions. The less costly one (in terms of resources and bution is a rigorous analysis of models and experimental results
effort), which also provided results of THR in SIL2, was soft- of the clustered system that gives deep evidence on the system
ware rejuvenation[21]. A Quantitative Accelerated Life Test reliability.
(QALT) and an Accelerated Degradation Test (ADT) were used Jones et al. [5] survey available practical methods useful to
to estimate the experimental Mean Time To Failure (MTTF) and assess the safe integrity of COTS-based systems with standard
IEC61508. Authors propose the adoption of testing techniques
Time To Aging-Related Failure (TTARF) of a single server node
in the short-term and in stressed execution, respectively. Such like stress, interface and statistical test (black-box methods) or
measurements were used as input to the rejuvenation-extended tools available to check software, analyze data flows, and in-
cluster model, and the final value of THR was found. ject faults (white-box methods). Limitations of both methods
This paper makes three important contributions: are presented, i.e., the lack of failure data confidential to the
supplier, the difficulty when computing results of tests with-
• It presents a comprehensive methodology, providing a out automated mechanisms, the problem of optimizing time-
practical path to SIL2 certification of COTS-based sys- consuming tests, the difficulty of covering a wide range of soft-
tems. ware faults. Our paper does more than that: it defines an assess-

2
ment methodology flow and shows an implementation on a real order to regulate trains movements and prevent fatal accidents.
use case with also on-field evaluations. The overall architecture of the RTCS (figure 1) is composed by
Park et al. [4] evaluate the effect of rejuvenation actions on
the availability of an Active/Standby cluster. Authors defined a Human Machine Interface (HMI)
generic Markov model through which estimate how the system Train Management System Application Server
availability varies by changing repair time and period of reju-
ASTS TMS Application ASTS TMS Application
venation. Main weaknesses of this paper that we contribute to
address are: i) a purely theoretical estimation without evidence Pacemaker & CoroSync

or proves from real world; ii) an estimation of the best rejuve- Linux Suse OS Linux Suse OS
nation period without considering that the aging of the system SIL2 NIC NIC
Fan HDD Fan HDD
depends on aging factors and so needs empirical evidence; iii)
Pwr RAM CPU Pwr RAM CPU
the underestimation of the possibility that the cluster resource
manager fails, which is something that can certainly occur. Motherboard Motherboard
Skramstad et al. [2] perform a study of the possible solu- Node 1 - Active Node 2 - Standby
tions proposed by the academic and industrial community to
SIL4 Interlocking System (IXL) RBC
address the certification of critical systems composed of COTS
ERTMS Euroradio
to a specific SIL level. They got to three different considera-
Sensors Actuators
tions: one could be to supervise the memory through memory
mapped storage or by calculating checksums regularly; another
one could be to test the system even though this is an unfeasi-
ble solution when the COTS is represented by a whole operating
system; and finally, to diversify the adoption of COTS compo- Figure 1: The ASTS Railway Traffic Control System
nents to avoid common failure modes. This study, unlike ours,
is fairly superficial, authors report few approaches available in the following layers:
literature without providing a comprehensive analysis. Sensors/Actuators - Multiple sensors and actuators are de-
Finally, Pierce et al. [3] present a detailed report on how to ployed on rail tracks, trains, traffic lights, and the surrounding
assess Linux for SRS. This document provides guidelines that environment to receive monitoring information and send con-
should set more guarantees on the OS reliability. Many OS fea- trolling signals.
tures are identified as possible sources of failure and for this European Rail Traffic Management System (ERTMS) Euro-
reason they should be disabled, i.e., the developer should create radio - A safety communication protocol used to transfer data
a monolithic kernel with a minimum number of functionalities. between sensors and the Radio Block Center.
The conclusion of the study is that Linux, properly tuned, would
Radio-Block Center (RBC) - A component responsible of
be suitable for use in many safety related applications with SIL1
collecting signals coming from or directed to field sensors through
and SIL2 requirements. Such a work is indeed of value for the
a GSM-R radio transmission (at 900MHz).
amount of information provided, but at the same time the ef-
fective impact of mitigation techniques on the reliability is not Interlocking (IXL) - A Safety-Critical System made of mul-
predicted. tiple signal apparatus able to prevent trains from conflicting
What emerges from literature is the lack of quantitative es- movements by only allowing trains to receive authority to pro-
timations and practical examples of certification assessment in ceed, when routes have been set, lock and detected in safe com-
industrial applications. Several techniques or approaches are binations. The IXL is a vital subsystem with hard real-time con-
presented but none of them can be actually of support for the straints. A missed deadline of the IXL may cause catastrophic
certification. This work wants to bridge the gap by providing failures. For this reason, ASTS designed the IXL with HW&SW
an entire verified approach to face the SIL2 certification of al- dedicated to hard real-time systems, which were certified with
ready existent systems set up. a SIL4 level.
Train Management System (TMS) - The TMS provides non-
3. Overall System Architecture vital functions that oversee and automatize trains movements
and support the dispatcher in operations of train traffic control
In this section we provide a description of the system un- and management in a wide area of a railways network. The
der study, and its surrounding environment, on an “as is” basis. TMS communicates with the distributed IXL signalling system
The architecture presented in the following is the one currently to monitor the railways area and send control signals. Data re-
adopted on-field by ASTS (e.g., at Rome train station). ceived from the IXL is, e.g., stored in databases or sent to multi-
The Train Management System (TMS) – being investigated in ple Human Machine Interfaces (HMIs) workstations. The TMS
this work – is part of a larger and more complex set of subsys- is installed in a central control centre where operators take de-
tems needed for the train monitoring that goes under the name cisions.
of Railway Traffic Control System (RTCS). The RTCS, used by The TMS is considered failed if either the monitoring appli-
ASTS, is a hierarchically structured system designed to monitor cation provides to the operator wrong information that do not
railway traffic, control railroad signals, and track switches in reflect the real railways status (functional failure), or if the ser-
3
vice completely stops being provided (non-functional failure). ough reliability testing activity. The ASTS proprietary applica-
The architecture currently adopted by ASTS, therefore, aims at tion software uses FT-CORBA, which embeds fault tolerance
achieve a high grade of availability through redundancy. How- mechanisms that help to ensure a resilient and highly-available
ever, the only usage of redundant HW subsystem units is not message broker service. FT-CORBA was used, e.g., for an Air
satisfactory for the SIL2 certification, especially if the system Traffic Control System, which is indeed a safety-critical system
brings into it COTS components. [12].
Unlike the IXL, the TMS provides non-vital functions. A failure The focus of this work is to perform a thorough reliability eval-
of the TMS could lead, as an example, to the planning of col- uation on software layers below the ASTS proprietary applica-
liding routes. In such a case the underlying SIL4 fail-safe IXL tion.
would prevent loss of life avoiding the actual impact, e.g. by
stopping the trains. However, the shutdown of trains would re- 4. Concepts and Approach
sult in degradation of QoS and reduction of service availability,
which turn in loss of money and reputation for the train com- The Safety Integrity Level (SIL) is a way to indicate the tol-
pany. For this reason, the TMS is classified as a Safety-Related erable failure rate – interchangeably defined as Tolerable Haz-
System. ard Rate (THR) – of a particular Safety Instrumented Function
The TMS is based on a client-server architecture. It is a col- (SIF). The TMS implements one main SIF function, i.e., the
lection of commercial redundant equipment (dual application continuous monitoring and controlling of a particular railways
servers, dual data base servers and at least dual workstations) area, which must be kept in a safe state, with respect to the func-
connected to a high speed LAN. Core of the TMS subsystem tional or non-functional failures defined in 3.
is the Application Server which runs most important SW mod- The correct provision of the SIF does not depend on the time
ules. The TMS Application Server – for simplicity, from now in which it is performed, i.e., the TMS has not stringent real-
on only TMS – is composed by two server machines clustered time requirements. This work assumes that functional or non-
in a Active/Standby configuration to provide uninterrupted ser- functional failures can be generated by (Figure 3): 1) Hard-
vice. To further enhance reliability, redundancy is not only ap- ware Faults caused by a hard or soft malfunction within the
plied in terms of replicated machines but also in terms of sub- electronic circuits or electromechanical components, which re-
system units (e.g. RAM, Disks, Fans, Power Supply) in order quire a repair intervention or a reboot, respectively; 2) Software
to eliminate Single Points of Failure (SPF). Units replicas are Faults caused by errors influenced by the total time the system
equal, that is, ASTS does not enforce a diversity fault-tolerant has been running, i.e., Aging-Related Mandelbugs, or any other
mechanism which involves the usage of subsystems of different SW fault not caused by aging errors, i.e., Non-Aging-Related
technologies. Each server node, connected to its motherboard, Mandelbugs [16].
has: 2 hot-swap power supplies, 2 hot-swap fans, 2 RAMs, 1 Software aging is often cause of Linux failures at both Kernel
CPU, 2 HDDs in RAID1 configuration, 1 Network Interface and User level, as Cotroneo et al. [21] demonstrate. Aging-
Card (NIC). related faults – caused by exhaustion of OS resources (e.g. mem-
On top of the hardware, ASTS employs Linux SUSE with High- ory leak) – can reduce performance or culminate in a system
Availability (HA) Extension: a COTS OS provided with two hang/crash.
clustering software (Pacemaker Cluster Resource Manager (CRM) If a fail-stop failure of a node occurs, that is, the server node
and Corosync) responsible for resources orchestration, failure exhibits a crash or a hang, than, the CRM detects it, unless
diagnosis, nodes coordination, and fail-over management. The the CRM itself is not failed. The TMS cluster stops providing
Pacemaker CRM allows for monitoring the health and status services (it is down) when: the active node fails and the CRM
of node resources, managing dependencies, and automatically fail-over operation is in action, or the active node and the CRM
stopping and starting services. It relies on Corosync messaging fail, or the active and standby nodes fail together.
layer, in charge of reliable messaging communications, mem- The ratio between the number of manifested TMS failures and
bership and quorum information needed by the cluster for node the operating time is the failure rate, which is the inverse of the
orchestration. For simplicity, in the rest of this paper, Pace- Mean Time Between Failures (MTBF): the mean operating time
maker and Corosync are indicated as a single entity under the (uptime) between failures of the cluster.
CRM notation. The overall goal of the assessment process is to provide
The topmost layer of the system is the ASTS proprietary TMS evidence able to demonstrate that the TMS failure rate falls
application. This consists of several SW modules that commu- within SIL2 bounds: 10−7 ≤ T HRS IL2 < 10−6 . The assess-
nicate with the IXL through a CORBA message broker. Since ment would be facilitated if it had been possible to build the
the ASTS proprietary application software is beyond the scope system from scratch using hardware and software components
of our work, the reliability of the broker itself is not addressed dedicated for mission-critical systems (i.e. VxWorks, PikeOS,
in our study (except for aspects related to its interaction with QNX). However, this would have meant for ASTS a drastic in-
other software modules). The decision of using a CORBA bro- crease in terms of costs, which cannot be justified for a system
ker was taken by the ASTS proprietary application software without stringent real-time requirements. For this reason, the
development team. In this study, we assume that the ASTS fulfillment of SIL2 is pursued within the context of the already
proprietary application software (which includes the broker), defined TMS architecture, and using the COTS components that
is the result of a rigorous development process, and of a thor- have already been selected. The paper shows how it is possible
4
TMS Modeling and Analysis

Model
Preliminary Units Nodes Cluster Reliability
Inputs
Study Modeling Modeling Modeling Evaluation
Definition

Enforce No
Mitigation THRTMS < THRSIL2
Strategies
Experimental Validation
No
Yes
QALT
Yes Analysis
End THR*TMS < THRSIL2 Reliability Accelerated
Evaluation Tests
ADT
Analysis

Figure 2: Assessment Methodology Flow Diagram

TMS to alternate the Active node. The latter is chosen as it is a good


Failure compromise between costs and impact on the overall THR.
Experimental Validation Experiments are really important
Software Hardware
in the certification process. Standards (e.g. EN50129) clearly
Fault Fault state that ”Fail-safe behaviour of component under adverse con-
ditions shall be demonstrated”, and it is desirable to obtain ”Ev-
idences that the failure mode will not occur as a result of com-
Non-Aging- Aging- ponent ratings being exceeded”.
Soft Hard
related related
Malfunction Malfunction The experimental phase, in this paper, aims at: 1) Demonstrate
Mandelbugs Mandelbugs
the goodness of TMS node failure rate estimation; 2) Define the
Figure 3: Typologies of faults that could cause TMS failure TMS TTARF in order to determine a proper period of rejuvena-
tion.
The system is subjected to a stress loading scheme through a
to configure the existing design so to achieve the desired level
workload generator. Then, failure and degradation data sets are
of reliability.
used for a QALT/ADT analysis. QALT and ADT are proven as
To do that, in this paper an hybrid approach is pursued (fig-
the best solutions to measure reliability metrics and, at the same
ure 2) composed by the following three main phases:
time, understand and quantify the effects of stress. QALT/ADT
System Analysis and Modeling - In this work, a three-level are usually leveraged for HW components. However these have
hierarchical modeling – composed by combinatorial and state- been demonstrated feasible to observe the behavior of SW suf-
space models – is used: 1) First, failure rates (λU ) of TMS fering from software aging [20][22].
servers subsystem units are estimated through continuous Markov
chains; 2) Then, the single server node failure rate (λN ) is pro-
duced through a Fault Tree, which OR results of previous phase; 5. Modeling and Analysis
3) Finally, at the top, the overall cluster is modelled. Such
The three-level hierarchical modelling of the TMS cluster is
a model is specified and analysed using a formal verification
here deepened. In the remainder of this section different rates
method. In particular, in this paper, the probabilistic model
– having an exponential distribution – are used. While some,
checking tool, namely PRISM [18], has been used. This allows
like failure rates, depend on the unit/system under study, others
an automatic verification of specific properties of the proba-
are fixed. The MTTR, e.g., is based on ASTS service agree-
bilistic model defined, useful to determine the compliance of
ments, which guarantee units replacement within 18Hrs → µ =
the cluster to SIL2 requirements.
1/64800s. The Time to Switch – equal to 30s – is evaluated
Mitigation Strategies Enforcement Most relevant outcomes through on-field tests of Active-Standby switches → θ = 1/30s.
coming from TMS modelling are leveraged to propose possi- In the same manner, the Time to Reboot has been calculated and
ble ways useful to enhance the final cluster THR. In this work, is equal to 302s → γ = 1/302s.
mitigation strategies are proposed to address both software and
hardware failures. Possible identified solutions were: i) increas- 5.1. Subsystem Units
ing the number M of Active nodes and enforce a M-to-1 config-
Estimating subsystem units failure rates is fundamental for
uration; ii) reducing Single Point of Failure (SPF); iii) enforcing
the final THR definition. In this regard, dedicated Markov chains
a software rejuvenation of server nodes to reduce the impact of
– needed to model the replication of a specific unit – have been
aging-related failures, and take advantage of the system reboot
5
used. Reporting a diagram for each element would be too much Table 2: Reliability Orders of Magnitude
as, in most cases, these are trivial. What is important, instead, ASTS
is the definition of failure rate values used in input to models RAM CPU HDD Fans M.Board NIC PWR OS Node
Cluster
since these are particularly difficult to define. λ 10−8 10−5 10−8 10−6 10−5 10−6 10−8 10−5 10−5 10−6
Unfortunately, getting faithful estimates is non-trivial as it is
difficult to be sure of the accuracy. Nevertheless, field studies,
which are also strongly recommended by standards, represent a 5.1.5. Motherboard
good chance. The standard CEI-EN-50129, e.g., specification Even if a rigorous statistical study on motherboards is not
says: ”Random HW failure rates, or probabilities of component available, an estimation of the motherboard failure rate is needed.
failure, should be based on field data if possible”. In this work a rough measure of the motherboard failure rate has
been carried out: starting from [11], rates of motherboard com-
5.1.1. Memory ponents failed in one year is equal to 0.7%. Knowing that the
TMS memories are replicated and equipped with IBM Chip- number of produced ASUS (one of the market leader) mother-
kill technology that protects memory from any single chip fail- boards in one year is 17M, the number of failed units is around
ure and from multi-bit errors. Therefore, the Markov model 119K. Hence, the evaluated Failure Rate, supposing a mean
related to this subsystem should model the ChipKill behaviour. usage time of 8Hrs per day, is λ Mboard = 1.6 × 10−5 .
However, Sridharan et al. [13] in their memory failure estima-
tion already bring in the ChipKill mechanism. For this reason, 5.1.6. NIC and Power Supply
the RAM Markov chain can take into account the only effect Smith et al. [6] report a credible range of vendor-confidential
of replication. Researchers estimated through an experimental failure rates for different hardware units based on IBM predic-
campaign that the Failure Rate – in accordance with [8] – is tions. The upper bounds, representing the worst measure, may
λ = 1 × 10−5 . Using this estimate, the memory Markov model be adopted in this case. Hence, failure rates of NIC and Power
yields λRAM = 2.6 × 10−8 . Supply are, respectively: λNIC = 5.6×10−6 and λPWR = 6×10−4 .
The latter is replicated and therefore its final value – produced
5.1.2. Disks through a Markov Model – is λPWR = 8.7 × 10−8 .
Disks’ model needs to take into account the RAID1 mirrored
replication as Eckartet al. [19]. Schroeder et al. [7] performed 5.1.7. Operating System and CRM
an exhaustive 5-year test campaign on 100’000 different types Modelling the OS and the CRM is not feasible as their in-
of disks in high-performance computing sites. Authors got em- trinsic ”stateful” characteristic make such a task impractica-
pirical conclusions on field replacement rates of drives with less ble. However, their contribution to the overall reliability of the
than five years old, and on failure rates trends during the years. server node must be taken into account as failures in the OS
They obtained a failure rate λ = 2 × 10−4 . Such a value, in input or the CRM can certainly lead to an outage of the TMS. For
to a RAID1-aware model, produces λHDD = 3.1 × 10−8 . this reason, in this paper, the worst estimates of λOS and λCRM
that can be found in literature are considered for the modelling
5.1.3. CPU phase. An estimate of OS Failure Rate is obtainable in [15]. In
The failure rate estimate for this unit has been extracted this work, different servers operating systems were tested in one
from Nightingale et al. [9] work who – in their field study – year. Authors argue that Linux SUSE, the one on ASTS TMS
made a thorough analysis of hardware failure rates for con- severs, had 13mins of outages. This means that in one year the
sumer PCs investigating failures coming from CPU, DRAM, failure rate is λOS = 2.4 × 10−5 . Regarding the CRM, instead,
and disks over 8 months of tests. They found that CPU and the estimation of failure rate is extracted from Mendiratta et
DRAM failures are strongly dependent. Authors state that nu- al. [17] work, which statistically determine λCRM = 2.2 × 10−4 .
merical reliability results obtained are consistent with [7][8]. In
the worst case, the estimated Failure Rate of CPUs is λCPU = 5.2. Nodes
5.8 × 10−5 [9]. This is the final value used as the CPU has no The evaluation of node’s failure rate is realized through a
replicas. Fault Tree which merges previous results. It is assumed that
a single node fails in case at least one of the replicated/non-
5.1.4. Fans replicated units analysed before goes down. Hence, units are in
Jin et al. [14] conducted a deep study on the reliability of a OR relation between each other. This means that the resulting
fan systems. They report their fans’ reliability using different failure rate is strongly influenced by the worst unit.
parameter values in Weibull distribution, acceleration factors, It is worth noting that the server fault tree brings in also the
and reliability tests. Authors conducted accelerated tests with COTS OS contribution. This, in fact, cannot be excluded from
higher level of stress in order to estimate the Failure Rate, ob- the final evaluation as certainly its effects impact the overall
taining 4 × 10−4 . This – given in input to a trivial Markov chain reliability. The CRM failure rate, instead, is not taken into ac-
needed to model the unit replication – yields λFan = 3.2 × 10−6 . count in the node estimation since it is part of the cluster model
shown in next subsection.
The estimated node failure rate is λNode = 6.7 × 10−5

6
5.3. Cluster before the repair is accomplished the CRM or the standby node
A formal modelling and analysis of the overall cluster is fail, then the cluster becomes down in state 3 or 5. The rest of
performed through the probabilistic model checker PRISM [18], the model is quite the same and does not require a further de-
developed by the University of Birmingham. PRISM provides scription. The TMS behaviour is then checked through a set of
automatic verification of Continuous Stochastic Logic (CSL)
λcrm
properties for Continuous-Time Markov Chains (CTMCs) prop-
µcrm
erly defined in a state-based language. It allows constructing a λa λs
6 θ
mathematical model able to capture the system’s behaviour, and 7 8 µs 5
then use it to analyse formally-specified quantitative properties.
PRISM parses one or more temporal logic CSL properties – λs µa
µs µa µcrm
properly defined by the user – and performs model checking, λcrm λs
λa 3
determining whether the model satisfies each property. 0
µcrm
The model reflecting the TMS cluster (available on github1 ) λa µs
λcrm 4
comprises three components that need to be evaluated: active 1
µa
and standby nodes, and the CRM. As already stated, these can λcrm
µs
manifest software or hardware faults leading to a crash. The λs
µa
TMS is considered down whether the active fails and the CRM is µcrm

performing the failover operation, or both the active and standby 2 λa

fail, or the active and the CRM fail. Failure rates of active and
standby nodes are assumed the same since the standby is in
Hot configuration, i.e., it is running and continuously updat- Figure 4: TMS Cluster Markov Model
ing the application state on the active node. The TMS analy-
sis in PRISM focuses upon a CTMC with transition rates ex- CSL properties defined in PRISM syntax. A set of rewards is
ponentially distributed. To accurately delineate the different also assigned to states. Through these a wider range of quantita-
states in which the cluster can be, four PRISM modules are de- tive measures relating to model behaviour can be observed. For
fined: active nodes, standby nodes, CRM, and cluster. The num- the TMS case study, three rewards have been defined to evaluate
ber of states (or nodes) belonging to active and standby mod- three cluster conditions: up, danger, down.
ules are kept generic (NA , NS ) in order to subsequently verify The following properties are model checked for the TMS:
how results vary by changing the amount of active and standby (1) The expected TMS failure rate until time T.
nodes (S A = {sa 0 , .., sa NA }; S S = {s s 0 , .., s s NS }). A transition
(2) The expected time the TMS cluster is in state down, dan-
si −→ si−1 represents a node failure with rate λ, while the tran-
ger, up until T.
sition si −→ si+1 stands for a node repair with rate µ.
(3) The probability of any failure occurring within T days.
The CRM module, instead, is represented by only two states
(S C = sc 0 , sc 1 ) needed to indicate which of the two possible ex- (4) The probability that Active/Standby/CRM cause the fail-
ecution condition the CRM is in, i.e. whether it is working, or ure of the TMS.
it is failed due to a software fault. The failure occurs with rate
λCRM , while the recovery, i.e. a node restart, is realized with 5.4. Results
rate µCRM . Figure 5 reports main outcomes of the TMS model checking
The cluster module, instead, is used to define the overall activity. A first result of interest comes from the evaluation of
state of the TMS in each instant of time. In PRISM jargon, it property 1 that relates to the time spent by the system in con-
is synchronized on the state of the other modules transitions: ditions of up (state 7), down (states 6,5,4,3), or danger (states
when a module changes its state, the cluster module reacts ac- 0,1,2,8). Figure 5a shows the evaluation in a time window of
cordingly. When the failure of the last active node occurs the 365 days. The TMS is down 4.38 × 10−4 days, it is in dan-
[active failure] action triggers the cluster module associated ger 1.75 × 10−2 days, while it is up 364.999 days. This means
states. that the system Availability = U ptime/(U ptime+Downtime) =
Figure 4 reports the Markov chain representative of the clus- 0.99999.
ter module. Gray nodes mean that the cluster is temporarily It is interesting to emphasize another remarkable outcome: the
down. The TMS starts in state 7 where both nodes and the CRM inverse proportionality relation between the number of active
are working correctly. The cluster may experience a failure of nodes and the probability of TMS failure. The result – an esti-
the active node (with rate λa ) moving the system in an inter- mate of property 2 – is plotted in figure 5b in a time window of
mediate state in which the CRM is performing the failover (in 30 days. The increased level of redundancy, in fact, leads to a
time 1/θ). After that, the cluster goes in state 8 where it is up reduced downtime. When one active fails, the TMS experiences
but with only one node available. A repair intervention (with any downtime caused by the active-standby switch.
rate µa = 1/MT T R) could bring the system back to state 7. If Then, property 3 is evaluated. Table 5d reports the number of
failures occurred – and the related failure rate – as the time in-
1 https://github.com/dzobbe/cluster-reliability-model creases. Numbers tell that, with the current configuration, the

7
(a) (b)

Num Failure
Time(Hrs) Failures Rate
1080 0.003 3.43E-7
2880 0.008 9.17E-7
5040 0.014 1.59E-6
6840 0.019 2.17E-6
8760 0.024 2.75E-6

(c) (d)

Figure 5: TMS Formal Model Results

cluster has a failure rate (2.75 × 10−6 in 8760Hrs) larger than reduces the load on each server with less stressful situa-
SIL2 thresholds. tions for the HW components that get equally worn-out.
Finally, by evaluating property 4, it is possible to notice how However, this solution would entail that both nodes ac-
much the COTS CRM impacts on the overall reliability. Figure cumulate software errors – e.g., due to memory leaks –
5c, in fact, shows the probability of each failure type (the CRM, which could remain dormant or not. A different solution
the active or the standby node) occurring first. It is evident that that also ensures a uniform wear of HW units is explained
the CRM has the highest effect on the overall reliability. At the later.
same time, the standby node has the lowest impact due to its
non-running condition. An additional approach aims at mitigate the failure probability
of most critical components – i.e., the COTS OS and its CRM
– which were proven to be the weak points of the cluster (fig-
6. Mitigation Strategies ure 5c). Regarding the OS, measures can be taken for example
Results coming from TMS formal model verification proved at kernel level as suggested by Pierce et al. [3] that provide
that the current system configuration is not compliant with SIL2 thorough guidelines in this sense. The idea is to configure the
bounds. Hence, mitigation strategies need to be defined and kernel to serve only the critical application disabling unused
applied in order to reach the desired level of reliability. modules, driver peripherals, the graphical interface X window
approach consists in using different cluster configurations. In system, unused user processes. Other actions can be enforced
this sense, two possibilities may be pursued, i.e.: on Linux SUSE and, in particular, on Pacemaker/Corosync. In
fact, there are CRM settings that could affect the response of
• Enforce a M-to-1 structure, that is keeping one standby the system in case of failure conditions and that define policies
node and increasing the number M of active nodes. In on the management of the critical service. All of these, how-
fact, as mentioned in 5.4, the probability of failure could ever, can provide a little improvement, which is also difficult to
drastically decrease by only adding one node. In that quantify.
case, the PRISM modeling tool produces a failure rate Hence, what is proposed here is a technique highly used in
of 3.41 × 10−8 , which could be enough to satisfy SIL2 re- reliability engineering: Software Rejuvenation [21][4], a tech-
quirements. However, such a solution may be expensive nique of proactive fault tolerance in which the system is period-
for ASTS in terms of costs of equipment purchasing and ically reboot to clean the memory. In fact, it is well known that
on-field server maintenance. most critical SW failures are transient. These may be caused
• Replace the Active/Standby configuration with Active/Ac- by error conditions due to software aging phenomenon, that is
tive. In this way, both nodes actively run the TMS appli- the issue that SW can exhibit data corruption or unlimited re-
cation and the workload is balanced between them. This source consumption during its execution time. Although faults

8
left by developers still remain, the periodic rejuvenation can In both cases, two widely accepted techniques [20][22] have
help to remove or at least minimize transient failures thus re- been used to obtain valid estimates and, at the same time, min-
ducing possible outages. imise the test duration.
In order to evaluate the rejuvenation impact, an extension A Quantitative Accelerated Life Test (QALT) is the tech-
to the PRISM CTMC model seen in 5.3 is provided. Figure 6 nique adopted to evaluate (1). QALT is designed to provide
shows the new states added to the cluster model. It is assumed reliability information on a component or system through fail-
that each node after a certain time interval (α) moves to the ure data obtained from an accelerated test. It is usually adopted
failure-prone state 9. From that moment, unexpected shutdown for hardware component testing but it is also demonstrated by
of the Active node (with rate λa ) may be experienced and thus Matias et al. [20] as an accurate solution to observe – in short-
requiring the switch-over to the Standby node (S i ) that will be term and stressed executions – the behavior of systems suffer-
completed after Time to Switch (θ). Only a rejuvenation action ing from software aging. QALT allows a quantification of the
– occurring with rate β – can bring the cluster back from 9 to MTTF by applying controlled stresses useful to reduce the test
the state of higher robustness. period. Then, QALT uses the lifetime data obtained under stress
The process of rejuvenation is leveraged as an opportunity to conditions to estimate the lifetime distribution of the system in
periodically switch the Active node running the TMS applica- its normal use condition.
tion, thus avoiding a situation where one node always runs, ac- An Accelerated Degradation Test (ADT) is the method used
cumulates SW errors, and stresses the same HW units. Hence, to analyse (2). ADT extends the QALT technique. Rather than
when the rejuvenation begins the cluster enforces the switch to looking at failure times, ADT uses degradation measures to pro-
the other node within time θ. Therefore, the cluster will have duce a time to failure. ADT fits well for aging-related failures
only one server available in the time interval required to imple- as these are particularly difficult to empirically observe. How-
ment the hardware reboot (rate γ) of the node under rejuvena- ever, ADT is designed for physical systems. Hence, this paper
tion. This is acceptable since the rejuvenation action may be makes use of Matias et al. [22] approach who made ADT ap-
triggered in time windows of low workload. Rates α and β have plicable for software aging studies.
Both QALT/ADT make use of a life-stress relationship to
λs
β
model the connection between failure/degradation times observed
10 9 at some stress levels to the system lifetime distribution in its
λa
θ
α
normal use conditions.
11
γ 6 θ
Following [20][22], the relationship model adopted in this work
7 to estimate the time to failure is the Inverse Power Law (IPL):L(s) =
µa 1/ksw Where L is the SUT life characteristic (e.g. the mean
time to failure), s represents the stress level, while k and w are
Figure 6: Rejuvenation Extension model-related parameters to be defined from the observed ex-
periments. The particularity of IPL is that scaling s by con-
been tuned through a PRISM formal checking. Results, in line stant k causes the proportionate scaling of L. A linear relation-
with [10], demonstrate that optimal values are α = 336Hrs and ship exists when both L and s are on a log-log scale: ln(L) =
β = 168Hrs. −ln(k) − wln(s).
The rejuvenation yields a better failure rate result: 2.28 × 10−7 . Clearly, experiments results have a certain variability which
Such estimate – compliant with SIL2 requirements – needs now needs to be considered. This means that a standard IPL is not
an on-field verification. enough. Therefore, the IPL is usually integrated with a specific
pdf that allows a definition of confidence intervals. The idea
is to make the pdf mean time to failure parameters dependant
7. Experimental Validation
on the stress variable. In this work, a IPL-Weibull distribution
This section illustrates the experimental validation conducted is used. This choice seems reasonable as Weibull, unlike Expo-
on a TMS cluster test bed having the same architecture seen in nential distribution, has a non-constant Hazard function, which
3. In such a test bed, the TMS receives messages on railway is appropriate in case of experimental evaluations as systems
status from a simulated IXL actually used by ASTS for test pur- get old. The two-parameters Weibull is used, having the follow-
t β t β
poses. Experiments aim at proving the validity of model results ing pdf and CDF: f (t) = βη ( ηt )β−1 e−( η ) ; F(t) = 1 − e−( η ) ,
and the proposed rejuvenation strategy. The analysis of exper- where β is the shape parameter, i.e., the slope the Weibull CDF;
iments moves in two directions. The idea is to test the TMS and η is the scale parameter (or characteristic life). In terms
under stressful conditions by looking at: of reliability, f (t) represents the failure rate function, while
R(t) = 1 − F(t) represents the system availability. The ratio
(1) Functional and crash/hang failures caused by aging/non- f (t)
λ(t) = 1−F(t) is the Hazard function, i.e., the instantaneous fail-
aging related mandelbugs to evaluate the on-field Mean ure rate as function of age. β governs the trend of λ(t) which
Time To Fail (MTTF) of single server nodes. can increase (β > 1) or decrease (0 < β < 1).
(2) Aging degradation factors to estimate the Time To Aging A common approach is to evaluate β and η parameters using a
Related Failure (TTARF) and so being able to define an Maximum-Likelihood Estimation (MLE), which generates those
empirical rejuvenation period.

9
numbers from the observed failure data set. The scale param- Table 3: ASTS Statistical Data
n
eter, e.g., is calculated for n observations as η = [ T iβ /r]1/β ,
P
Naples Genoa Rome Florence
i=1
where T is the observation time and r represents the number of T rains/day 1150 650 1100 440
Extension(Km) 137 205 250 350
failures.
The IPL-Weibull is derived by setting η = L(s) in the Weibull
pdf defined before, yielding: of messages sent by the IXL to the TMS within a specific time
w β−1 −(ksw t)β window T ; ii) the TCPReplay tool to shoot the recorded traffic
f (t, s) = ks β(ks t)
w
e (1)
against the TMS at S 1, S 2, S 3 rates defined before.
So, using (1), the mean life of the systems in its use condition
can be evaluated: MT T F(s) = 1/ksw Γ(1/β + 1). 7.2. Results Analysis
Once defined the approach and the pdf distribution to use, the 7.2.1. MTTF Evaluation
workload and the stress schemes need to be characterized. Before showing the results obtained it is important to re-
mind that the MTTF analysis wants to observe failures of a
7.1. Workload and Stress Loading Scheme Characterization node caused by aging/non-aging related failures. In particular,
The enforcement of QALT/ADT requires the definition of the analysis here focuses on both functional and non-functional
the accelerating test variable, workload and the stress loading failures. That is, those causing wrong responses from the SUT,
scheme (i.e. stress levels, constant/varying stress). It is im- and those causing a crash or hang of the node.
portant to determine a proper stress variable that could really To automatically verify the functional correctness of TMS op-
expose the system to: functional and crash/hang failures caused erations, an Oracle is used. As already said, the TMS receives
by non-aging-related mandelbugs and those failures triggered UMs from the IXL and then provides the railway status to a
by aging-related errors. Human Machine Interface (HMI). It is here that the functional
The TMS continuously receives Update Messages (UMs) on correctness is checked, by verifying that the HMI has received
the railways occupation status from the underlying IXL. The Responses (R) from the TMS coherent with the IXL UMs. To
data obtained is then sent to other systems (e.g. workstations, this end – during normal execution – a dictionary of correct
databases) through a CORBA message broker. Each UM causes R in a one-to-one relation with corresponding UM is filled by
the highest and diversified number of libc syscall made by the dumping UM and intercepting correct R to the HMI. Then, dur-
TMS monitoring application to the Linux kernel. Furthermore, ing tests, the output of TMS application is compared with the
the higher is the content of UMs (i.e. the number of trains), specific entry in the dictionary in order to establish the func-
the more is the amount of TMS application states visited. This tional correct behavior.
inevitably yields more syscalls and a stronger use of resources The verification of crash or hang failures instead is made through
(e.g. memory). a log analysis conducted on system messages and on CRM logs.
In normal operating regime – following statistical measurements When the Active/Standby nodes do not reply to the heartbeat
provided by ASTS and reported in table 3 – at most 1150trains/day messages, the CRM reports the event on its logs and also records
are in the interested area. Assuming zero trains in night hours the switch operation enforced.
(1-5am), this means a number of 0.95trains/min. The TMS has been subjected to accelerated tests at the iden-
Therefore, the most appropriate choice for the accelerating test tified levels for a period T = 1080Hrs. At the end, any func-
variable (or stress variable) is the rate of UMs sent by the IXL tional failure of the TMS monitoring application has been de-
containing at least 0.95 trains/min. tected. Contrariwise, a total of five hang failures have been
The selection of stress levels represent also a key aspect since found: two for S 1 and S 3, one for S 2. Table 4 shows the es-
using levels out of TMS design limits may generate failures not timates for the IPL-Weibull model parameters realized through
present in the system normal operation. In normal usage, the a Maximum-Likelihood (ML) estimation using the failure data
TMS receives 2msg/s. Based on this, three levels of constant obtained.
stress are applied to the system: S 1 = 35msg/s, S 2 = 25msg/s, Parameter ML Estimate CI-Lower CI-Upper
S 3 = 20msg/s. In fact, it is observed that when the rate of U M k 5.8626E-7 1.0833E-7 3.7533E-6
exceeds 20msg/s, system resources start being considerably w 2.496 1.9776 3.0143
used. In particular, the CPU usage – measured in relation to β 5.3171 2.8095 10.8062
the power of one core with the top tool – exceeds 600%, which
means that more than 6 out of 8 CPU cores are used. Con- Table 4: IPL-Weibull Parameters
temporarily, the overall system memory utilization goes beyond
50%. The identified rate at which the TMS seems to wrongly Using those parameters, the SUT mean life in its use condi-
behave is 45msg/s. For this reason the stress levels chosen are tion (2msg/s) is estimated through the IPL-Weibull. Figure 7a,
enough below the critical level. reports the Life-Stress graph obtained. The green line is gener-
In order to implement automatic tests and reduce testing time, a ated by linearizing the IPL-Weibull, and the blue lines represent
workload generator is needed. This is realized by using: i) the its confidence intervals. The point at which 2msg/s, on the x-
well-known TCPDump tool to capture TCP packets in the flow axis, intersects the green line is the node MTTF estimated for
10
S 1. On the contrary, using levels S 2-S 3, five aging trends have
been detected in the testing period T = 1080Hrs. Using the ob-
tained degradation data in input to the ADT it is possible to esti-
mate the TTARF. The values for the IPL-Weibull model param-
eters realized through the ML estimation are: k = 0.00005; w =
1.1916; β = 9.2835.
Using those estimates, the TMS time to aging in its use con-
dition*) (2msg/s) is evaluated. Figure 7b, reports then the Life-
(a) (b) Stress graph useful to evaluate the TTARF. The ADT says that
the TMS server node may encounter an aging-related failure af-
Figure 7: IPL-Weibull Life-Stress Graphs ter 8170Hrs of execution. Since the lower bound of the 90%-
Confidence interval is 1710Hrs, the period for the rejuvenation
action to be enforced is set to 1500Hrs which is 3× bigger than
the normal condition level. The node failure rate can be now the one estimated during the modelling phase.
produced using the MTTF lower bound of the 90%-Confidence This empirical rejuvenation period can be now used in conjunc-
interval as this seems the most conservative choice. In that tion with the empirical node failure rate into the PRISM cluster
case, the result obtained is λl90%
Node = 1/MT T F = 1/74738Hrs = model. Results reveal that the TMS system reaches a THR value
1.3 × 10−5 . that is highly SIL2-compliant:
The empirical failure rate estimate of the TMS server node is
better than the one obtained during the modelling phase (λNode = T HRT MS = 1.45 × 10−7 < T HRS IL2 (2)
6.7 × 10−5 ). Although this might seem strange, it was quite ex-
pected as during the modelling phase pessimistic choices were 8. Conclusions
made in order to be as conservative as possible. The empirical
node failure rate, given in input to the cluster model (sec. 5.3) – This paper presented a SIL2 assessment strategy conducted
with no rejuvenation enforced – yields: T HRTMS = 8.73 × 10−7 . on a commercial SRS. Differently from most of the existing
Such a value is below SIL2 upper bound. However, this out- literature (which does not refer to real industry applications),
come lies on the edge of the boundary and therefore needs to we contribute a methodological framework that can be used in
be further improved to obtain a more robust evaluation. Hence, practice as a reference for certifying a wide class of emerging
the rejuvenation strategy is leveraged and the period of rejuve- critical systems, virtually any system for which: i) the general
nation is empirically estimated. architecture has already been designed, ii) business constraints
impose that (radical) changes to the architecture be avoided,
7.2.2. TTARF Evaluation and iii) the main COTS components that must be integrated
Unlike QALT, where failures definition is provided, the ADT have already been chosen. We demonstrate, with respect to a
execution requires the description of degradation factors, i.e., real industrial system (namely: the train management system
the system indicator that better represents the possible trend to- by Hitachi Ansaldo STS), that it is possible to achieve SIL2 via
ward aging-related failures (aging indicators). The detection of proper configuration of system parameters and set-up of rejuve-
aging phenomena is usually performed by looking at the pro- nation procedures.The system is designed as an Active/Standby
cess Memory Consumption (MC) [21][22]. This has been mon- cluster based on COTS components. A hybrid approach – i.e.
itored through the /proc/meminfo and the valgrind tool wrapped based on a combination of modeling and experiments – was
around the TMS application. taken to estimate the reliability figures of interest. In a first
Since the manifestation of software aging phenomena in short phase, by means formal model checking, it was proven that the
period of time is pretty difficult, this paper followed the ap- original configuration of the cluster, used on-field by Ansaldo,
proach of Matias et al. [22]. That is, tuning the memory page was not compliant to SIL2 requirements. Then, based on the re-
size to influence more the aging intensity. In [22], in fact, a sults of the modeling analysis, a set of mitigation strategies was
thorough Design Of Experiment (DOE) activity arrived to the proposed, to improve the safety of the system and ultimately
conclusion that using a page size of ∼ 200KB can impact the make it SIL2-compliant. The strategies mainly rely on software
aging trend manifestation. rejuvenation, which was proven to provide the best compromise
Afterwards, an additional important choice is the definition of in terms of trade-offs between costs and reliability improve-
a threshold (D f ) for the MC, needed for the ADT execution to ment. Finally, the effectiveness of the proposed rejuvenation
establish that an aging degradation has occurred. This highly strategy – as well as of other modelling predictions – was val-
depends on the application running. Therefore, an average MC idated by means of an experimental campaign, performed on a
has been delineated by testing the TMS application. In normal real ASTS test-bed.
usage MC ≈ 1500MB. It has been observed that when MC
reaches 2500MB the syscall latency starts growing monotoni- References
cally. Hence, setting D f to 2500MB seems a reasonable value.
[1] Simon Connelly, Holger Becht. ”Developing a Methodology for the use
During the log analysis the MC has been inspected every of COTS Operating Systems with Safety-Related software”; Queensland:
60Hrs. Any degradation has been encountered for stress level Australian System Safety Conference, 2011.

11
[2] Skramstad, Torbjorn. ”Assessment of Safety Critical Systems with COTS COTS Commercial-Off-The-Shelf
Software and Software of Uncertain Pedigree”; Trondheim.
[3] Pierce, RH. Preliminary Assessment of Linux for Safety Related Systems CRM Cluster Resource Manager
Health & Safety Executive; 2002.
[4] Kiejin Park, Sungsoo Kim; ”Availability analysis and improvement of Ac- CSL Continous Stochastic Logic
tive/Standby cluster systems using software rejuvenation” Journal of Sys-
tems and Software, Volume 61, Issue 2, 15 March 2002, Pages 121-128, CTMC Continous-Time Markov Chains
ISSN 0164-1212, http://dx.doi.org/10.1016/S0164-1212(01)00107-8.
[5] C. Jones, R.E. Bloomfield, P.K.D. Froome, P.G. Bishop; ”Methods for as- DOE Design Of Experiments
sessing the safety integrity of safety-related software of uncertain pedigree
(SOUP)”; Report No: CRR337 HSE Books 2001
[6] W.E. Smith, K.S. Trivedi, L. A. Tomek, J. Ackaret; Availability Analysis
HMI Human Machine Interface
of Blade Server Systems; IBM Systems Journal, 2008.
[7] Bianca Schroeder, Garth A. Gibson. ”Disk Failures in the Real World: ICS Industrial Control System
What Does an MTTF of 1,000,000 hours mean to you” In Proceedings
of the 5th USENIX conference on File and Storage Technologies. IPL Inverse Power Law
[8] Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber; ”DRAM Er-
rors in the Wild: a Large-Scale Field Study”; Seattle: SIGMETRIC/Per- IXL Interlocking
formance, 2009.
[9] Edmund B. Nightingale, John R Douceur, Vince Orgovan. ”Cycles, Cells MC Memory Consumption
and Platters: An Empirical Analysis of Hardware Failures on a Million
Consumer PCs”; Eurosys, 2011 ML Maximum-Likelihood
[10] Hoang Pham; ”Handbook of Reliability Engineering”; 2003, Springer;
doi: 10.1007/b97414 MTBF Mean Time Between Failures
[11] Matt Bach; Rates of retired components.
http://www.hardware.fr/articles/944-1/taux-retour-composants-13.html; MTTF Mean Time To Fail
Puget Systems
[12] Luca Montanari; “Online Failure Prediction in Air Traffic Control MTTR Mean Time To Repair
Systems” https://www.dis.uniroma1.it/~dottoratoii/media/
students/documents/tesi_143.pdf
[13] Vilas Sridharan and Dean Liberty. 2012. ”A study of DRAM failures in
NIC Network Interface Card
the field”; In Proceedings of the International Conference on High Per-
formance Computing, Networking, Storage and Analysis (SC ’12). IEEE QALT Quantitative Accelerated Life Tests
Computer Society Press, Los Alamitos, CA, USA, Article 76 , 11 pages.
[14] Xiaohang Jin, E. W. M. Ma, T. W. S. Chow and M. Pecht. ”An in- RTCS Railway Traffic Control System
vestigation into fan reliability” Prognostics and System Health Manage-
ment (PHM), 2012 IEEE Conference on, Beijing, 2012, pp. 1-7. doi: SCS Safety Critical Systems
10.1109/PHM.2012.6228836
[15] ”ITIC Global Server Hardware, Server OS reliability report”. 2014 SIF Safety Instrumented Function
[16] M. Grottke, A. Nikora, K. Trivedi, ”An Empirical Investigation of Fault
Types in Space Mission System Software” in Proc. IEEE/IFIP, Conf. on SIL Safety Integrity Level
Dependable Systems and Networks, 2010, pp. 447-456
[17] Veena B. Mendiratta; ”Reliability analysis of clustered computing sys- SPF Single Point of Failure
tems”; Software Reliability Engineering, 1998. Proceedings. The Ninth
International Symposium on SRS Safety Related Systems
[18] Kwiatkowska M., Norman G., Parker D. (2002) PRISM: Probabilistic
Symbolic Model Checker. In: Field T., Harrison P.G., Bradley J., Harder THR Tolerable Hazard Rate
U. (eds) Computer Performance Evaluation: Modelling Techniques and
Tools. TOOLS 2002. Lecture Notes in Computer Science, vol 2324. TMS Train Management System
[19] B. Eckart, X. Chen, X. He and S. L. Scott, ”Failure Prediction Models
for Proactive Fault Tolerance within Storage Systems,” 2008 IEEE Inter-
national Symposium on Modeling, Analysis and Simulation of Comput-
TTARF Time To Aging Related Failures
ers and Telecommunication Systems, Baltimore, MD, 2008, pp. 1-8. doi:
10.1109/MASCOT.2008.4770560 UM Updates Messages
[20] R. Matias Jr., K. S. Trivedi and P. R. M. Maciel, ”Using Accelerated Life
Tests to Estimate Time to Software Aging Failure,” 2010 IEEE 21st Inter-
national Symposium on Software Reliability Engineering, San Jose, CA,
2010, pp. 211-219. doi: 10.1109/ISSRE.2010.42
[21] Domenico Cotroneo, Roberto Natella, Roberto Pietrantuono, and Ste-
fano Russo. 2014. A survey of software aging and rejuvenation studies. J.
Emerg. Technol. Comput. Syst. 10, 1, Article 8 (January 2014), 34 pages.
DOI=http://dx.doi.org/10.1145/2539117
[22] R. Matias, P. A. Barbetta, K. S. Trivedi and P. J. F. Filho, ”Acceler-
ated Degradation Tests Applied to Software Aging Experiments,” in IEEE
Transactions on Reliability, vol. 59, no. 1, pp. 102-114, March 2010. doi:
10.1109/TR.2009.2034292

Appendix A. Acronyms
ADT Accelerated Degradation Test
ASTS Ansaldo STS
12

View publication stats

You might also like