Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Insights into the Sensitivity of the BRAIN (Braided Ring Availability Integrity

Network)—On Platform Robustness in Extended Operation

Michael Paulitsch, Brendan Hall


Honeywell Aerospace
michael.paulitsch@honeywell.com; brendan.hall@honeywell.com

Abstract periods is not necessarily required for the mission


Low-cost fault-tolerant systems design presents a success (such as safe landing of aircraft), but allows a
continual trade-off between improving fault-tolerant level of comfort or commercial benefits—for example,
properties and accommodating cost constraints. With the ability to drive the car home or to the garage in case
limited hardware options and to justify the system design of a failure to avoid towing. For safety-relevant systems,
rationale, it is necessary to formulate a fault hypothesis such as fail-operational by-wire systems, the safety
to bound failure assumptions. The system must be built implications of continued operation despite faults require
on a foundation of real-world relevance and the careful consideration.
assumption of coverage of the fault hypothesis. Generally, safety and reliability analyses of systems
This paper discusses a study that examines the have been published and mature industry practices for
sensitivity of a BRAIN (braided ring availability integrity conducting safety case assessment (ARP4754 [9],
network) design to different fault types and failure rates ARP4761 [35]) are in place. Often, the safety aspect is
in a safety-relevant application. It presents a Markov- seen in the context of available hardware or effects of
based model (using ASSIST, SURE, and STEM analysis certain external failure modes [36]. This paper
tools) and a series of experiments that were run to acknowledges the value of these analyses, but extends
analyze the overall dependability of the BRAIN the safety analysis in several dimensions: 1) in addition
approach. The study evaluates the mission reliability and to available hardware, the effect of integrity faults is
safety in the context of a hypothetical automotive considered; 2) effects of platform algorithms is
integrated x-by-wire architecture on top of the BRAIN. considered; 3) systems purely evaluating reliability do
Drawing from experience in the aerospace domain, the not consider periods of extended operation until repair.
authors investigate the possibility of continued operation We wished to examine extended operation because
for a limited period after a detected critical electronic repair may not be immediately available, and the system
failure. Continued operation would allow a driver to needs to be operational for some time. During this
reach repair facilities rather than stopping the vehicle to interval, the system may be especially vulnerable to
call for roadside assistance or ”limping home.” additional faults—both hardware exhaustion and integ-
rity violation. A good example in automotive systems is
1. Introduction the “limp home mode,” where the car operates in a
degraded mode, but is still operational [14]. While the
Commercial cost pressure and the maintenance and
effects of additional failures may be reduced, safety
repair capabilities at remote airports have made extended
relevant functionality and associated guarantees need to
operation despite faults and aircraft dispatch with fault
be maintained.
common in the aerospace domain. Extended-range Twin-
Our goal in this paper is to look at the reliability from
engine Operation Performance Standards (ETOPS) is an
a platform perspective. System safety can be truly
example of this practice that regulates twin-engine
contemplated only within a systems context—including
airplane operations and defines clear limits on
application, hazards, environmental factors, user
operational duration in the event of engine failure.
influence, etc. However, since a platform builds the
Similarly, minimum equipment lists (MEL) in aircrafts
foundation of a system, the strength and dependability
provide clear guidelines to pilots when a dispatch, in
vulnerabilities of the platform directly impact hosted
spite of a fault in a subsystem, is permitted. MELs are
applications.
established through detailed analysis of the safety effect
We believe that our hypothesis is especially
on the aircraft. E.g., ARP5107 provides specific
interesting to the automotive domain because efforts are
guidelines for engine electronics [31].
underway to apply safety standards from other domains,
This paper extrapolates the concept of extended
such as IEC61508 [7], ARP 4754 [9], to the automotive
operation despite faults to a hypothetical electronic x-by-
domain and/or to create special platform and automotive
wire platform in cars [1]. Continued operation for longer

37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
0-7695-2855-4/07 $20.00 © 2007
safety standards, such as AUTOSAR [34], ISO26262 it if it is bit-for-bit identical but not signaled with high
[10]. data propagation integrity. This comparison makes the
We do recognize that safety stretches far beyond system tolerant to multiple benign faults with high
availability, integrity, and reliability numbers and refer integrity.
readers to known literature such as [8]. A more general
discussion on dependability and terms used in this paper
can be found in [6]. The actual numbers are of less
consequence; our intent is to examine the relative
strength of different architectural polices in relation to
integrity and availability guarantees.

2. Other Related Work


Hammett and Babcock evaluated redundancy schemes
for by-wire systems [11]. Wilwert et al. quantify external
electromagnetic interference (EMI) [3] in by-wire
architectures. Wilwert et al. provide a good overview of Figure 1: Braided-ring communication topology
x-by-wire systems in [1]. Navet et al. present an Each node performs guardian enforcement for its
overview of automotive communication systems, the topological neighbors based on a synchronized global
basis of a by-wire platform, in [2]. Bertoluzzo et al. look time and TDMA (Time Division Multiple Access)
at by-wire applications and networks in [23]. Latronico schedule information to ensure medium availability.
and Koopman look at automotive communication The propagation comparison of each node can be
protocols algorithms deployed under hybrid fault leveraged to support high integrity sources, namely
scenarios [12]. Bridal performs reliability estimates for message-based self-checking pairs (see [18] for details).
repairable fault-tolerant systems and network topologies For this, two neighboring nodes of the BRAIN form a
[15][16]. pair and send their version of the message at the same
There are a multitude of reliability evaluation tools point in time. The propagation logic performs the
available. A good tools overview is given in [4]. A recent comparison as for any other message, and each receiving
integrated tool approach is Möbius [5]. Model node receives a high-integrity message if both sources
evaluations in this paper use ASSIST/SURE/STEM [13] (the pair) have sent the same data.
because of its fast model evaluation abilities. Initial versions of the ring have used this mechanism
only for application data. We have extended this idea to
3. Overview of BRAIN protocol level mechanisms. Each startup and
The BRAIN (braided ring availability integrity synchronization message is sent by a pair of nodes. The
network) is an alternative topology and guardian (fault- resulting extension simplifies the synchronization
containment strategy) for ensuring high-integrity data algorithm greatly because distributed algorithms reduce
propagation. A braided ring augments the standard ring to multi-master-based algorithms with trustable sources.
topology with increased connectivity. In addition to In addition, the precision of the BRAIN is improved due
neighboring connections, a node is also connected to its to more frequent synchronization using the same amount
neighbor’s neighbor via a link called the braid or skip of bandwidth and the removal of the Byzantine error
link (see Figure 1). term [19] normally present in clock synchronization
The BRAIN is a flooding network that minimizes the algorithms. The reader is referred to [20] for insights on
propagation delay of rings. Each node propagates a the cause of the error term removal.
message in real time, leading to only a few bits delay for SAFEbus/ARINC659 uses a similar algorithmic
each hop. As described in detail in [17][18], each node is approach. Self-checking-pair-based synchronization also
monitored for correct data propagation by the next node allows a very scalable tolerance to multiple fault
downstream through bit-for-bit comparison between the scenarios by simply adding master pairs.
data received on the direct and the skip link. Data This paper focuses on permanent faults and their
corruption is signaled to nodes downstream with special impact to BRAIN dependability claims. We understand
integrity fields in the dataflow or indicated via truncation the requirement for tolerating multiple transient faults or
(namely by stop-forwarding the message). The action high intensity radiated field (HIRF) effects. Next to
depends on the configuration of the ring (full-duplex or shielding approaches, quick restart after communication
half-duplex links). Because data flows in two directions, loss (about two communication rounds in the BRAIN for
each node receives correct data despite any arbitrary self-checking pair startup and integration approach) is the
failure. To tolerate multiple faults, each end node first algorithmic defense. Secondly, the self-stabilizing
compares data received from two directions, and accepts clique aggregation algorithm [18] quickly converges if

37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
0-7695-2855-4/07 $20.00 © 2007
cliques (subgroups of synchronized end systems) emerge [20]) are very enabling due to reduction overhead of
due to multiple transients or other faults. traditional fault tolerance schemes from voting in
software to select the first redundant data copy. In
4. Model Description addition, the partitioning and fault detection guarantees
To produce reliability approximations efficiently of such approaches are excellent and application software
using existing tools, we must make assumptions about is independent from platform software.
the models and their parameters: constant failure rates of Seeing the advantage of high-integrity compute for
components; constant repair rates; neglect certain type of the application software development (as the automotive
failure causes. In addition, reduction of less likely domain does with the use of fail-silent ECUs), the model
scenarios (such as multiple failure scenarios) assumes that any single integrity violation by the
dramatically reduces the model state space and the platform may have safety implications, because a
model’s solution time for given parameters. Some computed and distributed data value may be used by
assumptions may need to be revisited, depending on the multiple, possibly replicated, actuators.
overall deployment of the by-wire architecture and in Looking at safety from an availability perspective, the
light of recent development of electronics (e.g., aging model is concerned about redundancy exhaustion leading
silicon) [21]. Also, model parameters must be refined to to isolation of components and loss of communication,
reflect precise components and component reliability e.g., due to components or link loss. If communication
numbers and algorithmic or platform configuration between distributed components cannot be guaranteed
specifics; e.g. the reliability of the communication anymore, we assume safety is affected. A detailed
component depends highly on whether it is integrated description of the model and what leads to loss of
with a complex, high-power compute core or deployed as communication (availability) or decreased integrity is
standalone. Nevertheless, despite some approximations, given below.
the given models should show the sensitivity of the 4.1. Model Parameters
models to the parameters and produce an interesting
A by-wire architecture typically consists of several
result giving insights into:
components connected to the network. The number of
• sensitivity of platform reliability and impact on
components depends on the detailed architectural
safety to extended operation after a failure,
approach. We expect 8 to 12 nodes directly connected to
• sensitivity of the platform to reliability results, the network to support connections to the units support-
• algorithmic and configuration impact, and ing transmission, engine, distributed actuators and
• impact of integrity detection mechanisms; the full sensors, and control computers [1][11][23][29].
coverage approach of the BRAIN compared to Commercial transport airplanes are operational for
inline integrity approaches used in alternative missions averaging 3-10 hours, with major checks every
architectures, e.g., dual star topologies with couple hundred hours with different levels of overhauls.
redundant active central guardians [24][25]. In the automotive space, a car is driven on average for
The model is representative of the underlying 4000 hours [28]. Guaranteeing safety over such long
platform, but does not include application-specific mission periods without checks may not be economically
assignments of functions. Such assignments are viable. We also expect that the car electronics may not be
important for the final safety assessment; however, the equipped or may not be able to perform the necessary
presented model and parameters evaluate the foundation scrubbing activities to detect latent faults. Thus, we
and present insights about whether the foundation is assume that at every major service—similar to service
strong enough for application assignments. procedures in the aerospace domain, and as is more
The model evaluates the reliability of the platform. common because of increased levels of diagnosis [28]—
Correct operation of a platform constitutes looking at the car’s by-wire electronic and wire loom will be
availability and integrity and their implications to safety. checked for correctness and latent faults. Major service
Safety from a platform perspective means that either an intervals where all latent errors can be detected are
integrity violation or redundancy exhaustion has assumed to be in the order of 150 hours. This equates to
occurred. Integrity violation means that the data has been about 5250 miles (8450 km) at 35mph (56kph). Such
corrupted during transmission or in computation. If service intervals are currently not mandated in the
voting is applied, such data corruptions could be “voted automotive industry in most countries, though they are
out” at the application. If high-integrity computing and recommended during the vehicle warranty period. The
communication end systems are used, data can be used results of this paper could be used to consider impacts of
immediately without voting. Thus, any corruption of data service intervals on safety. Error detection coverage at
by the platform may have severe consequences. High- service is assumed to be largely perfect as manufacturing
integrity compute and communication platforms (such as level testing can be deployed for critical circuits. The
AIMS, aircraft information management systems, in service is expected to scrub all essential FDIR (fault
Boeing 777 and its communication backbone SAFEbus

37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
0-7695-2855-4/07 $20.00 © 2007
detection, isolation, and recovery) circuits using scan Table 1: Overview of parameters used in models
chains [32], similar to production-level tests. Parameter Values MRV
Component failures are detected with coverage that is Number of active 7, 8, 9, 10, 11, 12, 13, 14 10
based on the underlying network architecture, as nodes
described below. Once detected, the operation of the by- Mission interval ½, 1, 20, 50, 100, 120, 150, 150
wire system will be extended for a certain interval. The [hours] 200, 250, 3k, 4k, 5k, 6k, 7k
average transit time by car for transit from home to work Link failure rate 5x10-7, 10-6, 2x10-6, 5x10-6 10-6
is ½ hour according to U.S. census data [37]. It is [failures/hour]
assumed that an extension period of 1 hour should be Component failure 5x10-6, 10-5, 2x10-5 10-5
sufficient to return home and/or drive to a repair facility rate (fail-stop)
[failures/hour]
in case of a detected error. The models will contrast the
Component failure 5x10-8, 10-7, 5x10-7 10-7
approach by running models with near immediate repair rate (arbitrary)
(1 minute), no repair (modeled as very low repair rate of [failures/hour]
10-10) and some repair intervals around one hour. For Repair rate (exten- 10-10 (no repair), 0.1 (10 h), 1
extended operation reference, Boeing 777’s ETOPS ded operation time) 0.5 (2 h), 1 (1 h), 60 (1 min.;
rating allows the airplane to fly for up to 180 minutes on [1/hours] near immediate repair)
a single engine. The maximum operating time for engine
electronics is 125 hours if the time of fault occurrence is 4.2. Description of BRAIN Model
known [31]. The BRAIN can guarantee platform availability and
Component failure rate parameter ranges were chosen integrity, and thus safety, as long as there is either
based on our experience with similar technologies, and (1) one communication path with full propagation
reliability models were determined by CALCE [27] and integrity from the sender to all receivers or (2) two paths
according to MIL-HDBK-217 [26] (note that MIL- from the sender to the receiver where the receiver can
HDBK-217 is no longer maintained, but a good public perform bit-for-bit comparison between the two paths. In
source for reliability data). Connector failure rates are case 1, a single arbitrary component can be tolerated on
depend on the chosen connector type. MIL-HDBK-217 is the BRAIN because each receiver has one path from the
used to determine connector reliability ranges. sender. Each node on the path is checking the direct and
Connectors may also not perfectly fit the exponential skip links bit-for-bit for agreement, then signaling the
distribution (i.e. constant failure rate), but the model and result at the end via an integrity signaling field [18]. The
requirements to quickly solve the model for lot of bit-for-bit comparison of skip and direct link prevents
different model parameters forces us to make this any arbitrarily faulty node from corrupting data during
assumption. We assume link failure rates are dominated propagation without being detected. Similarly, for single
by connector failure rates. link or other benign component faults, the data will also
We use a hybrid component fault model where the reach each node on the ring with full integrity. Ad (2),
fail-stop failures are assumed to be higher than arbitrary for multiple benign faults (fail-stop), all receiving nodes
node faults, as a node is likely to be part of an ECU detect the multiple fault scenario because the integrity
(electronics control unit) or an LRU (line replaceable field at the end of the data indicates loss of integrity from
unit). The reliability of such parts (LRU or ECU) is often both directions on the ring. Once detected, the receiver
driven by the power supply unit and a significant number can perform bit-for-bit comparison of the two copies
of supply components leading to a low MTBF rate but received from each direction and still assure full integrity
with a benign system behavior like fail-stop. The fail of the data.
arbitrary behavior is driven by the communication chip Medium availability is enforced by the guardian
that is performing forwarding, checking, protocol mechanisms, which are performed by each node for its
activities etc. This behavior is probably unboundable; two direct neighbors. Synchronization is guaranteed as
thus, the arbitrary behavior in a failure case. The failure long enough self-checking pairs can send
rates of those components are assumed to be similar to synchronization messages.
the reliability numbers of the communication chip (or The SURE model defines state spaces for the fail-
chip where the communication chip is part of). silent and arbitrary component failures, link failures, and
Table 1 is an overview of the parameter values used self-checking pair failures (i.e. the link between the pairs
for the BRAIN and dual-star model and the values that or at least one of the two nodes has failed). Link failures
we expect to be most likely (most representative value are assumed to be benign (e.g. a link is broken or not, but
(MRV in the table)). These most representative values does not “corrupt” data integrity). A special state to
are used when several other parameters are varied to model the loss of connectivity in one ring direction is
show the sensitivity to variation. also modeled. Transitions between states are guided by
failure or repair rates. Details of the state space and its

37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
0-7695-2855-4/07 $20.00 © 2007
transitions are not in the scope of this paper, but we quantify, as it is very hard to evaluate the effects of an
modeled the above described behaviors. undetected error on the application. E.g., what are the
ASSIST/SURE/STEM requires definition of “death effects of an erroneous guardian (star) on data integrity?
states,” which define when the integrity and availability The model assumes that any undetected error may have a
of the communication can no longer be guaranteed due to safety impact. As the two communication paths in a dual-
faults. These states are at a higher level: star network architecture are used for availability with
• More than one arbitrary component fault present, inline integrity protection, any arbitrary faulty star may
• An arbitrary fault with any other fault combination have an effect.
(link or benign component fault), If a star is arbitrarily faulty, the faulty star is detected
• The connectivity from a sender to each receiver is with a near 100% probability due to the integrity check
less than two paths (which is either two fail-stop (e.g., a CRC). Yet, despite the high probability of
component faults or multiple link failures occurred detecting the star error, the probability of undetected
where connectivity from sender and receivers via errors and, thus, data integrity violations remains. This
two paths is broken), or probability of undetected errors depends on the strength
• All self-checking pairs failed and protocol execution of the inline integrity mechanism deployed to cover
is compromised (due to link failure between self- failures of the intermediate device. Currently, deployed
checking pairs or component failure). dual-star networks [24][25] use a 24-bit CRC for error
As the BRAIN performs bit-for-bit comparisons at detection. Assuming a uniform failure distribution for
each node any error is immediately detected. From the failures of the central guardian device affecting a frame,
detection of an error, extended operation is allowed until the probability of an undetected integrity failure of a
the faulty sub-component is repaired. The time to repair frame is 2-24 (about 5.96x10-8).
is a parameter, also referred to as the time for extended At 5 Mbit/s, it takes 100 µs to send a frame with an
operation. average frame length of 500 bits. Say the network is 50%
In addition to the model parameters in Table 1, the loaded, then 1.8x107 frames would be sent per hour. This
BRAIN-specific model parameter (Table 2) is the num- rate would lead to about 1 (=2-24 x 1.8 x 107) undetected
ber of self-checking pairs needed for protocol operation, frame per hour once a star is faulty.
such as clock synchronization, startup, and integration. Internal Honeywell explorations of the CRC32
polynomial used for Ethernet indicated that the
Table 2: Overview of BRAIN-specific parameters probability of undetected errors is increased from 2-32 to
Parameter Values MRV 2-28 for reasonable failure modes in intermediate relaying
Number of self- 1, 2, 3, 4 3 devices (such as switches or guardians) [30]. Such failure
checking pairs modes are characterized by the relaying device
4.3. Description of Dual-Star Model introducing systematic errors (such as a stuck at 0 or 1 bit
every 32 bits of a frame). Such faults may be common
To compare the BRAIN to alternative architectures, for implementations deploying 32-bit-based computing
we evaluated a commonly used architectural alternative, architectures handling frames. In [22], Paulitsch et al.
a dual-star model. Given the cost constraints, a dual-star argue that the error detection coverage credit may even
model seems to be the best of the alternatives of be less.
ring/star/bus dual replicated architectures for the We recognize that the described effects are specific to
following reasons. Pure bus-based architectures suffer a special CRC polynomial. But, in order to capture such
from spatial proximity faults and are likely excluded for weaknesses of inline integrity, this paper assumes that
by-wire architectures. Ring architectures (without skip the undetected error rate of a 24-bit CRC is degraded by
links) have low reliability due to the missing path to a factor of 10; resulting in a rate of 10 integrity violations
circumvent (benign) faulty nodes [33] and masquerading per hour. Such integrity errors have platform safety
faults for forwarded data. Combinations such as ring/star implication. If the CRC size would be increased to
architectures (e.g. wagon wheel architecture) can be 32bits, the rate would decrease to 0.07 (=2-28 x 1.8 x 107).
powerful as they remove some possibility of It should also be understood that the safety analysis
masquerading faults, but they can introduce reliability covers only passive devices and propagation errors as is
loss due to serialization of one communication path. the case in FlexRay and TTP/C central guardians. If such
In the star model, benign and arbitrary component relaying devices were to perform active protocol
faults and link failures are modeled. Given the protocol activities, the safety effects may be more severe.
dependencies of solutions on the market [24][25], Given these arguments, the model introduces a
solutions are thought to be single-fault tolerant to transition probability for integrity errors once a central
arbitrary failures from a protocol perspective. guardian is arbitrarily faulty. The death states for the star
For a dual star model, evaluation of inline integrity model are defined as:
approaches and their effect on safety is especially hard to

37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
0-7695-2855-4/07 $20.00 © 2007
• An integrity fault occurred after one faulty star is safety. Aerospace has recognized similar tradeoffs for
arbitrarily faulty, engine electronics and time-limited dispatch [31].
• More than one arbitrary end system component fault All experiments show the (normalized) reliability
(not the star) occurred because protocol operation number for different extended operations (the time it
(clock synchronization, integration, or startup) takes to get to the repair facility). The one minute
cannot be guaranteed to work correctly anymore, or extended operation is assumed to be the (near) immediate
• Two faulty stars are present. repair, which is likely the time it takes for the driver to
react to any indications of fault scenarios and to pull the
Table 3: Overview of dual star-specific car into a safe place to await for towing or repair
parameters vehicles.
Parameter Values On the other side of the spectrum is “no repair” during
Rate of undetected 10 (24 bit CRC), service intervals, which means that the driver keeps
errors [frame / hour] 0.07 (32 bit CRC), driving despite failures.
0 (ideal reference model)
The difference between the near immediate repair
In addition to the model parameters of Table 1, the value and the extended operation value under
star-specific model parameter introduces a rate for the consideration is the “cost” for the extended operation of
undetected integrity errors per hour (integrity violation) the by-wire platform. It is the decreased safety due to
(see Table 3). prolonged operation despite subsystem failure. The
Error detection of faulty stars is assumed to be near increased comfort comes at a price. We will discuss
perfect. The CRC checks will likely signal a faulty star targets for safety and the comfort tradeoff in the next
propagation behavior for most of the time enabling near sections.
perfect detection for the indication of a failure condition Unless varied or mentioned otherwise, the values used
to the driver. Again, please note that the model is only to produce the reliability numbers are the representative
valid for propagation failures and architectures like values of Table 1 and Table 2 – namely the number of
[24][25]. Once the star needs to perform active protocol active nodes is 10, the time between perfect detection of
activities or stores whole frames, the model may need to faults (service or mission interval) is 150 hours, link
be adapted. failure rate is 10-6, the arbitrary and fail-silent component
failure rates are 10-7 and 10-5 respectively, and the
5. Results and Discussion extended operation time (repair rate) is 1 hour.
This section gives the result of the sensitivity of the Note that the graphs include lines between the
BRAIN to certain parameters and a comparison to a dual different values, although the x-axis results are not drawn
star architecture. Reliability is the continuity of correct proportionally to their value. This design makes it easier
service. Reliability results are traditionally given over a to identify different scenarios or parameters in the graph.
specific mission time. We have argued that the loss of The reader should not infer a direct trend from the lines.
correct operation (missing availability or integrity of the Also watch the logarithmic y-axis scales.
platform) has safety effects. Thus, for our purposes, 5.1. Comparison BRAIN versus Dual Star
safety and reliability are the same. Safety numbers are
often expressed as probability of failing in an hour. We 1
will present the reliability/safety probability numbers 5

normalized to a per hour number; i.e. the reliability 4


6 2 6
number is divided by the service interval (the mission
time) when the by-wire architecture is assumed to be
inspected for failures in detail (scrubbing of any faults, 5 3 1 3
including latent faults). The service interval is 150 hours 4 2
for all experiments except for the experiment examining
the sensitivity to different service intervals. The a) Full-Duplex BRAIN Configuration b) Half -Duplex BRAIN Configuration

normalization of the probability alleviates comparison to


industry standard like IEC61508. A typical safety Figure 2: BRAIN configurations
number is 10-9 failures/hour for highly critical operation The BRAIN comes in two configurations having
(in aerospace, this 10-9 number is also applied to a slightly different forwarding algorithms. The full-duplex
mission, such as a flight of 4-10 hours, resulting in a (FD) BRAIN configuration deploys full-duplex links for
lower per hour number). We assume 10-9 failures/hour as direct and skip links, so nodes have dedicated point-to-
the target for x-by-wire safety in this paper inspired by point links in both directions. The other configuration is a
[11]. It is important to note that such numbers must be half-duplex (HD) BRAIN where nodes are connected
evaluated in the context of accepted safety requirements, with one shared wire pair, and only one node of the two
environ-mental factors, and other factors influencing sharing a link can send at a time to avoid collisions.

37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
0-7695-2855-4/07 $20.00 © 2007
The HD BRAIN is the preferred solution for 1.00E-06

unreliability / failures/hour
automotive systems, as it has fewer pins (80 over 120)
1.00E-07
and wires (60 over 80) compared to a 10-nodes FD
BRAIN. The configurations deploy slightly different 1.00E-08

protocol mechanisms, but essentially aim at the same 1.00E-09


goals—prevention of fault propagation and error 1.00E-10
detection coverage for propagation. The model of the FD 1.00E-11
BRAIN is nearly the same, except that more hardware
1.00E-12
(e.g. redundant links) allows tolerance of more faults, no repair 10 h 2h 1h 1 min
although the greater amount of hardware means that Star (24bits) 2.14E-07 2.00E-07 1.91E-07 1.82E-07 2.86E-08
more can fail. Star (32 bits) 1.96E-07 8.09E-08 2.47E-08 1.32E-08 2.36E-10
Star (perfect 1.47E-08 1.83E-09 3.87E-10 1.95E-10 3.27E-12
This section compares the extended operation capab- isolation)
ilities for two BRAIN configurations and three dual-star BRAIN FD 2.01E-08 2.37E-09 4.97E-10 2.50E-10 4.18E-12
configurations. BRAIN HD 1.84E-08 2.21E-09 4.65E-10 2.34E-10 3.92E-12

In the dual star configurations, frames are protected extended operation (time to repair)
with a 24-bit CRC for data integrity in Configuration 1
Figure 3: Normalized reliability for BRAIN and
(called “Star (24 bits)”) and with 32 bits in Configuration
dual-star networks
2 (called “Star (32 bits)”). In Configuration 3, the
protection of inline integrity is assumed to be perfect Overall, BRAIN is very strong compared to dual star
(perfect isolation) and end systems are able to choose the architectures. While the reference model with perfect
correct data from the correct star, leading to no integrity inline error detection coverage (called “star (perfect
violation and safety implication in case of an arbitrarily isolation)”) is at the same safety level as BRAIN
faulty star. A self-checking guardian may be a real variants, indicating a correct model, the actual dual star
implementation of such a near perfect guardian. approaches have significantly lower reliability numbers
Alternatively, diagnosis algorithms at the end systems due to the imperfect inline integrity (CRC) error
may provide increased protection. The third coverage resulting in some integrity violations per hour.
configuration is supplied only for reference to evaluate The results show that dual stars with 32bit CRC can
the impact of inline integrity (CRCs) on the reliability of meet the 10-9 target only for immediate repair and the use
the platform. Similar reliability numbers may be of 24bit version does not meet the 10-9 target at all.
achievable for triplex stars without reliance on inline Given that the number of connectors and links for a
integrity if voting is deployed to mask a faulty star. dual-star and the HD BRAIN is the same and no
Figure 3 shows the results of comparing different additional star component is needed, the BRAIN
architectures. As mentioned above, the 1-minute achieves a significant increase in system dependability.
extended operation is probably the optimal safety number 5.2. Sensitivity to Component Failure Rate
one can achieve. The numbers for the BRAIN are below
the 10-9 target mentioned for up to the 2-hour extended 5.2.1. Arbitrary Mode. Figure 4 presents the sensitivity
period. One hour was the proposed extended operation of reliability numbers to arbitrary component failures.
that would be needed to achieve the comfort to drive to Such data can support decisions about whether to
the next garage or home. For the HD BRAIN, this results integrate communication functionality into single chips.
in a “decreased” safety number of 3.92x10-12 to 2.34x10- The larger the die area, the more likely arbitrary failures
10
, but is still above the 10-9 failures/hour target. Thus, the modes are according to reliability models of chips.
increased comfort of continuous operation leads to a 5.2.2. Fail-Silent Mode. Figure 5 depicts the sensitivity
safety number that is still acceptable. of the BRAIN to fail-silent component failures. With a
One might initially (but wrongly) conclude that the low MTBF of 50000 hours (failure rate of 2x10-5) the
FD BRAIN would be more reliable because additional two hour extended operation is very close to the safety
links support full-duplex operation, which should also target 10-9, probably too close if model inaccuracies
make the FD BRAIN more robust to redundancy where evaluated.
exhaustion. Yet, this is not the case; the HD BRAIN is 5.3. Sensitivity to Link Failure Rate
actually slightly more robust compared to the FD BRAIN
Figure 6 depicts the reliability for varying link failure
despite less hardware. The additional hardware of the FD
rates. With an increasing link failure rate, the sensitivity
BRAIN is offset by more parts failing.
of the reliability seems to increase. At a link failure rate
of 5x10-6 and one hour extended operation, the reliability
is close to a 10-9 safety target.

37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
0-7695-2855-4/07 $20.00 © 2007
1.00E-07 5.4. Sensitivity to Active Components
Figure 7 depicts sensitivity of safety to the number of
unreliability (1/hour)
1.00E-08
components. The BRAIN is a ring network and, thus, in
1.00E-09 addition to a larger number of components that can fail,
ring serialization does have a slight impact. To a large
1.00E-10
extent, the skip link offsets for faulty components.
1.00E-11 Overall, the reliability decreases only slightly.
1.00E-07

unreliability (1/hour)
1.00E-12
5.00E-06 1.00E-05 2.00E-05
1.00E-08
no repair 1.24E-08 2.01E-08 3.86E-08
10 h 1.51E-09 2.37E-09 4.15E-09 1.00E-09
2h 3.18E-10 4.97E-10 8.56E-10
1h 1.60E-10 2.50E-10 4.30E-10 1.00E-10

1 min 2.68E-12 4.18E-12 7.18E-12


1.00E-11
component failure rate; arbitrary mode
(failures/hour) 1.00E-12
7 8 9 10 11 12 13 14
no repair 9.07E-09 1.18E-08 1.49E-08 1.83E-08 2.22E-08 2.63E-08 3.09E-08 3.58E-08
Figure 4: Normalized reliability dependent on 10 h 1.08E-09 1.41E-09 1.79E-09 2.21E-09 2.67E-09 3.19E-09 3.74E-09 4.34E-09

component failure rate (arbitrary mode) 2h 2.26E-10 2.96E-10 3.76E-10 4.65E-10 5.63E-10 6.73E-10 7.87E-10 9.13E-10
1h 1.14E-10 1.49E-10 1.89E-10 2.34E-10 2.83E-10 3.37E-10 3.97E-10 4.60E-10
30 min 5.70E-11 7.47E-11 9.47E-11 1.17E-10 1.42E-10 1.69E-10 1.99E-10 2.31E-10
1.00E-07
1 min 1.91E-12 2.50E-12 3.17E-12 3.92E-12 4.75E-12 5.66E-12 6.65E-12 7.73E-12
unreliability (1/hour)

1.00E-08 # of components

1.00E-09 Figure 7: Normalized reliability dependent on


number of components
1.00E-10

1.00E-11 5.5. Sensitivity to Platform Algorithm


1.00E-12
Parameters
5.00E-06 1.00E-05 2.00E-05
1.24E-08 2.01E-08 3.86E-08 1.00E-04
no repair
unreliability (1/hour)

10 h 1.51E-09 2.37E-09 4.15E-09 1.00E-05

2h 3.18E-10 4.97E-10 8.56E-10 1.00E-06

1h 1.60E-10 2.50E-10 4.30E-10 1.00E-07

1 min 2.68E-12 4.18E-12 7.18E-12 1.00E-08


1.00E-09
component failure rate; fail-silent mode
1.00E-10
(failures/hour)
1.00E-11
1.00E-12
Figure 5: Normalized reliability dependent on 1 2 3 4

component failure rate (fail-silent mode) no repair


10 h
2.40E-05
2.40E-05
1.04E-07
1.29E-08
1.84E-08
2.21E-09
1.81E-08
2.20E-09
2h 2.40E-05 2.74E-09 4.65E-10 4.64E-10
1.00E-07
1h 2.40E-05 1.38E-09 2.34E-10 2.34E-10
30 min 2.40E-05 6.91E-10 1.17E-10 1.17E-10
unreliability (1/hour)

1.00E-08 1 min 2.40E-05 2.31E-11 3.92E-12 3.92E-12

# of self-checking pairs
1.00E-09

Figure 8: Normalized reliability dependent on


1.00E-10
number of self-checking pairs
1.00E-11 Protocol mechanisms such as self-checking clock
synchronization, startup, and clique aggregation
1.00E-12
2.50E-07 5.00E-07 1.00E-06 2.50E-06 5.00E-06 deployed on the ring are easily extensible, which proved
no repair
10 h
1.45E-08
1.77E-09
1.56E-08
1.89E-09
1.84E-08
2.21E-09
3.16E-08
3.72E-09
7.01E-08
8.14E-09
critical for the reliability performance. Figure 8 indicates
2h 3.74E-10 3.99E-10 4.65E-10 7.80E-10 1.70E-09 that with the deployment of only one self-checking pair,
1h 1.88E-10 2.01E-10 2.34E-10 3.92E-10 8.55E-10 this pair dominates the safety impact. With an increasing
30 min 9.44E-11 1.01E-10 1.17E-10 1.97E-10 4.28E-10
number of self-checking pairs deployed, the reliability
1 min 3.16E-12 3.37E-12 3.92E-12 6.57E-12 1.43E-11
impact of a self-checking pair failure on the system
link failure rate (failures/hour)
diminishes. For three and four pairs, the numbers are
Figure 6: Normalized reliability dependent on largely the same. While two neighboring nodes can
link failure rate easily be paired to supply self-checking protocol
functionality and without creating a swamping effect,
losing hardware because of failures equivalent to more
than two self-checking pair failures is probably already a

37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
0-7695-2855-4/07 $20.00 © 2007
death state for redundancy-constraint applications such as increasing service time, reflecting the propensity for
applications in the automotive domain. Yet, it is faults when the platform is not scrubbed for latent faults
interesting that one or two self-checking pairs are not and the “reliability clocks” of components are not reset.
enough for the most optimal platform dependability Similar behavior might be observed with effects such as
performance. aging silicon [21].
5.6. Sensitivity to Service Interval Time 6. Conclusions
This paper argues that removal of latent faults can
The results presented in this paper illustrate the
only occur when a car is serviced, approximately every
benefits of the BRAIN’s hybrid behavior using the added
150 hours. Alternatively, techniques could be deployed
‘skip links’ for both integrity and availability
that perform sufficient testing more often, say after every
augmentation. The full coverage of the high-integrity
trip (e.g. when the car electronics is powered down).
data propagation of the BRAIN offers a significant
Similar self-test techniques are deployed in aerospace
improvement over the inline error coverage of the dual-
systems and—assuming independent devices self test and
star architectures. The half-duplex BRAIN also has an
have sufficient error detection coverage—could decrease
slightly better reliability overall, with fewer components
the “vulnerability” window drastically. The low service and similar connectivity requirements. From our analysis,
interval numbers of 0.5 or 1 hours should model such
we conclude that extended operation with a fault is
alternative approaches.
possible with certain configurations of the BRAIN
In the past, a car’s operational life was approximately
architecture.
4000 hours [28]. Today, some manufacturer’s goals are
Extended operation with dual star configuration also
even higher and approaching 6000 hours. The long
looks promising in relation to x-by-wire. However, the
service intervals should show the effect if no scrubbing
star architecture’s sensitivity to the quality of inline
(latent fault detection) is done during the vehicle life
error-detection mechanisms have also been illustrated.
time. Similarly, such long service intervals may address
The ability to augment this with improved, higher-level
some effects of silicon wear-outs [21], as the failure rate
diagnosis functions such as error strike counters may
may no longer be assumed to be constant. With loss of
considerably improve the system dependability claims. In
constant failure rates the “memory-less” properties of
addition, the introduction of strike counters into the
failure rates vanish and the actual age of the electronics
BRAIN may also increase system dependability; the
comes into consideration.
refinement of such strike-counting policies will be the
1.0E-05
subject of future work. Over-zealous indictment must
unreliability (1/hour)

1.0E-06

1.0E-07
also be carefully considered to reduce the risk of resource
1.0E-08 exhaustion from the impact of transient errors.
1.0E-09 The reader is finally cautioned that this work has
1.0E-10 assumed a constant failure rate for the electronics
1.0E-11
components examined. As the impact of technology
1.0E-12
0.5 1 20 50 100 120 150 200 250 3000 4000 5000 6000 7000
no repair 5.9E- 1.2E- 2.4E- 6.0E- 1.2E- 1.5E- 1.8E- 2.5E- 3.2E- 1.1E- 1.9E- 2.9E- 4.1E- 5.4E-
improvements such as decreasing geometries and the
10 h 5.8E- 1.1E- 1.3E- 1.9E- 2.1E- 2.2E- 2.2E- 2.2E- 2.3E- 2.4E- 2.4E- 2.4E- 2.4E- 2.4E- associated vulnerabilities of silicon wear-outs [21] are
2h 5.4E- 1.0E- 4.2E- 4.5E- 4.6E- 4.6E- 4.6E- 4.7E- 4.7E- 4.7E- 4.7E- 4.7E- 4.7E- 4.7E-
1h 5.0E- 8.7E- 2.2E- 2.3E- 2.3E- 2.3E- 2.3E- 2.3E- 2.3E- 2.4E- 2.4E- 2.4E- 2.4E- 2.4E- considered, some of the assumptions that underpin the
30 min
1 min
4.3E- 6.7E- 1.1E- 1.2E- 1.2E- 1.2E- 1.2E- 1.2E- 1.2E- 1.2E- 1.2E- 1.2E- 1.2E- 1.2E-
3.8E- 3.9E- 3.9E- 3.9E- 3.9E- 3.9E- 3.9E- 3.9E- 3.9E- 3.9E- 3.9E- 3.9E- 3.9E- 3.9E-
reliability assessment may need to be revisited.
service interval Reliability assessment when such effects are considered
may be a considerable challenge. However, the full
Figure 9: Normalized reliability dependent on coverage and fault detection presented by architectures
service interval such as BRAIN may help mitigate such effects.
Figure 9 should show some of the platform level Similarly, frequent and regular service intervals for
effects. As expected, the safety is largely independent of testing automotive electronics may not be accepted by
service intervals for the “immediate repair” scenario. customers, as prevention of failing of components may
Recall that the BRAIN has perfect fault detection not be perceived as immediate added value to passenger
properties because of the bit-for-bit comparison for the safety. The model in this paper could be extended to
platform communication propagation service, which include more frequent self-test diagnosis (e.g. at power-
explains the outcome. down), which typically achieves less error detection
Except for the “no repair” scenario, all other scenarios coverage, but may achieve higher safety numbers for
“stabilize” (or stay nearly constant) at a safety level from similar service intervals.
50 hours service interval onwards. Once a failure occurs,
another failure is unlikely within the interval to the next
service. For the “no repair scenario,” the safety decreases
at a higher rate and the dependability decreases with

37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
0-7695-2855-4/07 $20.00 © 2007
7. References [18] Hall, B. Driscoll, K., Paulitsch, M., Dajani-Brown, S.
“Ringing out fault tolerance. A new ring network for superior
[1] Wilwert, C., N. Navet, Y. Song, and F. Simonot-Lion low-cost dependability”, In Proc. of Int. Conf. on Dependable
“Design of automotive x-by-wire systems”, The Industrial Systems and Networks. pp.298-307. 28 June-1 July 2005.
Communication Technology Handbook, Dec 2004. [19] Lamport, L. and P.M. Melliar-Smith. “Byzantine clock
[2] Navet, N., Y. Song, F. Simonot-Lion, and C. Wilwert. synchronization”, In Proc. of ACM Symp. on Principles of
“Trends in Automotive Communication Systems”, Proc. of the Distributed Computing. Vancouver, British Columbia, Canada,
IEEE 93, 6 (2005). ACM Press. Aug. 27-29, 1984.
[3] Wilwert, C., F. Simonot-Lion, Y. Song, and F. Simonot. [20] Hoyme, K. and K. Driscoll, “SAFEbus”, IEEE AES
“Quantitative Evaluation of the Safety of X-by-Wire Systems Magazine, March 1993.
Architectures Subject to EMI Perturbations.” 3rd Nancy- [21] Condra, L. The Impact of semiconductor device trends on
Saarbruecken Workshop on Logic, Proofs, and Programs, aerospace systems. Report. Boeing. 2002.
Nancy, Oct. 13-14, 2005. [22] Paulitsch, M., J. Morris, B. Hall, K. Driscoll, and P.
[4] Geist, R. and K. Trivedi. “Reliability Estimation of Fault- Koopman. “Coverage and the Use of Cyclic Redundancy Codes
Tolerant Systems: Tools and Techniques.” IEEE Computer, in Ultra-Dependable systems”, In Proc. of Int. Conf. on
Vol. 23, No 7, July 1990. Dependable Systems and Networks. pp. 346-355. 28 June-1
[5] Courtney, T., S. Derisavi, S. Gaonkar, M. Griffith, V. July 2005.
Lam, M. McQuinn, E. Rozier and W.H. Sanders. “The Mobius [23] Bertoluzzo, M., G. Buja, and A. Zuccollo.
Modeling Environment: Recent Extensions—2005”, Proc. of “Communication Networks for Drive-By-Wire Applications”,
the 2nd Int. Conf. on the Quantitative Evaluation of Systems 11th Int. Conf. on Power Electronics and Motion Control.
(QEST'05), IEEE, Washington, DC, USA, 2005. European Power Electronics&Drives Ass. Riga, Latvia 2004.
[6] Avizienis, A., J.-C. Laprie, B. Randell, and C. Landwehr, [24] Kopetz, H. and G. Bauer. “The Time-Triggered
<Basic Concepts and Taxonomy of Dependable and Secure Architecture”, Proc. of IEEE. Vol. 91(1). pp. 112-126. 2003.
Computing”, Trans. On Dependable and Secure Computing, [25] FlexRay Consortium. FlexRay Communications System.
Vol. 1, No. 1, IEEE, Jan-Mar 2004. Protocol Specification. Version 2.1. Dec. 2005.
[7] IEC. IEC61508 Functional Safety. Parts 0 to 7. 1998, [26] Department of Defense. U.S. MIL-HDBK-217 Reliability
2000, and 2005. Prediction of Electronic Equipment. Version F. 1991.
[8] Leveson, N.G. System Safety Engineering: Back to the [27] CALCE. Center for Advanced Life Cycle Engineering.
Future. Aeronautics and Astronautics. Massachusetts Institute University of Maryland. http://www.calce.umd.edu/.
of Technology. Draft. 2002. [28] Lupini, C.A. Vehicle Multiplex Communication—Serial
[9] SAE. ARP 4754 (Aerospace Recommended Practice). Data Networking Applied to Vehicular Engineering, 2004.
Certification Considerations for Highly Integrated or Complex [29] Allied Business Intelligence. X-By-Wire. A Strategic
Aircraft Systems. Society of Automotive Engineers. Nov. 1996. Analysis of In-Vehicle Multiplexing and Next-Generation
[10] International Standards Organization. ISO 26262. Road Safety-Critical Control Systems. 2003.
Vehicles. Functional Safety. In preparation. 2006. [30] Personal conversation with Dan Johnson, Honeywell
[11] Hammett, R.C. and P.S. Babcock. Achieving 10-9 Aerospace, Advanced Technology. Nov. 2006.
Dependability with Drive-by-Wire Systems. Society of [31] SAE. ARP 5107 (Aerospace Recommended Practice).
Automotive Engineers (SAE) Technical Paper Series, Paper Guidelines for Time-Limited-Dispatch Analysis for Electronic
2003-01-1290, 2003. Engine Control Systems. Rev. B. Society of Automotive
[12] Latronico, E. and P. Koopman. “Design time reliability Engineers. Nov 2006.
analysis of distributed fault tolerance algorithms”, Proc. Int. [32] IEEE. IEEE standard test access port and boundary - scan
Conf. on Dependable Systems and Networks, IEEE, pp. 486– architecture. 21 May 1990.
495. 2005. [33] Kanoun, K. and D. Powell. “Dependability evaluation of
[13] Butler, R. “The SURE Approach to Reliability Analysis”, bus and ring communication topologies for the Delta-4
IEEE Trans. on Reliability, Vol. 41, No. 2, June 1992. distributed fault-tolerant architecture”, In Proc. of 10th Symp. on
[14] The EASIS Consortium. EASIS Project Glossary. Reliable Distributed Systems. Pisa, Italy. 1991.
Electronic Architecture and System Engineering for Integrated [34] AUTOSAR (AUTomotive Open System ARchitecture).
Safety Systems, Deliverable D0.1.1. http://www.easis.org/. Aug. http://www.autosar.org/. Accessed Dec. 2006.
2004. [35] SAE. ARP 4761 (Aerospace Recommended Practice).
[15] Bridal, O. “Reliability Estimates for Repairable Fault- Guidelines and Methods for Conducting the Safety Assessment
Tolerant Systems”, Nordic Seminar for Repairable Fault- Process on Civil Airborne Systems and Equipment. Society of
Tolerant Systems. Lungby, Denmark, 1994. Automotive Engineers. Dec. 1996.
[16] Bridal, O. “A methodology for reliability analysis of fault- [36] Constantinscu, C. “Dependability evaluation of a fault-
tolerant systems with repairable subsystems”, In Proc. of the tolerant processor by GSPN modeling”, IEEE Transactions on
2nd int. Conf. on Mathematics of Dependable Systems II (Univ. Reliability. Vol. 54 No 3 pp. 468-474. 2005.
of York, England). V. Stavridou, Ed. Oxford University Press, [37] Reschovsky, C. Journey to Work: 2000. Census 2000
New York, NY, 195-208. 1997. Brief. United States Census 2000. U.S. Dept. of Commerce.
[17] Hall, B., M. Paulitsch, and K. Driscoll, FlexRay BRAIN March, 2004
Fusion—A FlexRay-Based Braided Ring Availability Integrity
Network, submitted to SAE Congress. 2007.

37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
0-7695-2855-4/07 $20.00 © 2007

You might also like