Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

RTOS for Fault Tolerant Application

Abstract:

Increasing complexity of safety-critical systems that support real-time
multitasking applications requests the concurrency management offered by real-
time operating systems (RTOS). Real-time systems can suffer severe
consequences if the functional as well as the time specifications are not met. In
addition, real-time systems are subject to transient errors originating from
several sources, including the impact of high energy particles on sensitive areas
of integrated circuits. Therefore, the evaluation of the sensitivity of RTOS to
transient faults is a major issue. This paper explores sensitivity of RTOS kernels
in safety-critical systems. We characterize and analyze the consequences of
transient faults on key components of the kernel of MicroC, a popular RTOS.
We specifically focus on its task scheduling and context switching modules.
Classes of fault syndromes specific to safety-critical real-time systems are
identified. Results reported in this paper demonstrate that 34% of faults that
affect the scheduling and context switching functions led to scheduling
dysfunctions. This represents an important fraction of faults that cannot be
ignored during the design phase of safety-critical applications running under an
RTOS. Index TermsContext switch, fault injection, fault syndromes, real-time
operating systems (RTOS), scheduler, safety-critical systems.

I ntroduction:

TODAY, many safety-critical embedded systems support real-time
multitasking applications (e.g., nuclear power stations applications, aerospace
applications, traffic control or medical life support, etc.). The complexity of
these systems requires real-time operating systems (RTOS). Due to the time
criticality factor, the design of real-time systems becomes challenging. In real-
time systems, critical tasks must never miss their deadlines and never produce
incorrect output results. If their time responses exceed a given time period
(deadline) or if they provide incorrect results, the consequences can be
catastrophic (e.g., loss of human lives or economical disaster). Therefore, the
correct real-time functionality of safety-critical systems is mandatory in order to
guarantee the correctness of output results and the required response time of
critical tasks, even in the worst situations. Real-time systems, like all electronic
systems, are subject to transient errors due to cosmic rays and alpha particles.
These errors can cause undesired modifications of storage memory cells. The
consequences of transient errors are currently a well known concern in
microelectronic systems. International technology roadmap for semiconductor
(ITRS) predicts increasing system failure rates due to transient errors for future
generations.
These errors affect applications running on embedded systems as well as
the RTOS under which it executes. Consequently, they affect both correctness
of output results and the timing of the tasks response. In real-time applications,
the time correctness can be more important than the correctness of output
results. For instance, if a system is able to provide correct output results, but
later than some deadline, the system behaviour may be incorrect, with
consequences more significant than if a result with a minor error is provided on
time. The main services provided by an RTOS kernel are task scheduling
(taking into account several factors - tasks priorities, resources and time
management, etc.) and context switching. The scheduler decides which task is
to be executed, while the context switch module loads the context (variables,
stack, etc.) of the selected task. If these two services do not work properly, the
tasks execution order may be affected, and some critical tasks could miss their
deadlines or provide incorrect output results. A real-time scheduler must be
extremely reliable and safe, in order to ensure correctness of the real-time
system response. This is a major concern to RTOS providers, and several
standards for safety and reliable implementations were proposed. For instance,
RTCA DO-178B is a standard for software used in avionics equipments. This
standard approach reliability and safety from the software development
perspective, ensuring RTOS fault tolerance in case of software bugs. However,
implementations respecting this standard may also be subject to transient errors,
and the study of their sensitivity to these errors becomes an important issue for
safety-critical real-time applications. The majority of existing works propose
fault injection techniques to evaluate the robustness of kernels that are not
real-time. In, a fault injection tool was developed to study error propagation in
UNIX systems. Reported results show that most injected faults lead to system
failure. A similar result has been reported , the authors propose a fault injection
tool that corrupts the system calls parameters. The results show a high failure
rate of POSIX1 functions. Representative studies reported in propose the
MAFALDA tool to inject faults in the microkernel object code and the
application data segment. The results report not only system crashes, but also
error propagation to the application level.

However, none of the cited works addresses the real-time aspect, which is
the key reason for using RTOS in safety-critical real-time systems. 1POSIX
(Portable Operating System Interface) is standards specified by the IEEE to
define the API (Application Program Interface) for software designed to run on
variants of the UNIX OS There is a lack of contributions in the specialized
literature that consider sensitivity of real-time features of RTOSs subject to
transient faults. The work proposed in is to our knowledge the only existing
research that investigates the temporal aspects of injected faults. In this work,
the authors propose a tool that aims at evaluating the time correctness of the
Chorus microkernel. They study the consequences of faults injected on the
scheduler code. Experimental results show that about 7% of injected faults are
propagated to the application level. It is of interest that modern RTOSs are
PROMable, which means that a CPU can execute the RTOS services directly
from the PROM. Since PROMs are less sensitive to transient errors than RAMs,
faults in the scheduler code are less of a concern.

However, the PROMable RTOSs are still subject to transient errors
during their execution, as the CPU registers are intensively used. Therefore, to
assess the robustness of RTOSs to transient faults, it is mandatory to investigate
their sensitivity to faults injected in CPU registers. With respect to the presented
state-of-the-art, the main contributions of this paper are: 1) the definition of
different types of syndromes caused by transient faults occurring in safety-
critical systems, including RTOS; 2) the proposal of a fault injection
methodology allowing to asses MicroC RTOS sensitivity to register level
transient faults taking into account both functional correctness and real-time
aspects; and 3) a detailed analysis of reasons for scheduling dysfunctions caused
by transient errors.



The choice of MicroC in order to evaluate the sensitivity of RTOS kernels in
safety-critical systems was motivated by several aspects. MicroC is an open
source kernel and it is widely used in real-time applications. In addition,
MicroC was certified for use in safety-critical systems (in conformity to RTCA
DO-178B). Moreover, the current trends in real-time systems is to adopt less
complex RTOSs running on multiprocessor-based architectures, instead of
using a complex RTOS running on a single processor.




The transient fault model considered in our experiments is bit-flips in the
processor registers, while the key components of the MicroC kernel (the task
scheduling and the context switch) are active. Comparing our results to those
reported in, we observed that faults corrupting the CPU registers during the
execution of the scheduling and context switching functions have a significant
impact on the real-time systems reliability. In our experiments, we recorded that
34% of injected faults caused scheduling dysfunctions while an additional 17%
led to system crashes. This represents an important fraction of faults that
cannot be ignored during the design stage of safety-critical applications running
under an RTOS. The paper is structured as follows. Section II identifies fault
syndromes for safety-critical systems including an RTOS. Section III briefly
describes the main features of the MicroC kernel. The conceptual framework of
the proposed fault injection technique is depicted in Section IV. Fault injection
results are analyzed and discussed in Section V. Section VI provides some
lessons learned concerning the fault injection experiments and results analysis.
Finally, Section VII presents our concluding remarks.




Fault Syndromes For Safety-Critical Systems I ncluding An RTOS:

Transient faults in the RTOS kernel of a safety-critical system may cause
several syndromes. The main classes of syndromes caused by the transient
faults occurring in an RTOS kernel are presented in Fig. 1. As illustrated in the
figure, when affected by transient faults, an RTOS may present two main
classes of syndromes.

Syndromes that may also be observed in classical systems.
Effect-lessno observable effect on system functionality;
Application hangthe system application stops responding (e.g., it enters an
infinite loop);
Exceptionthe program triggers some exception routine (e.g., illegal
instruction, division by zero, etc.);
Memory access dysfunctionthe system tries to access a non-valid physical
memory address;
System crashthe system stops functioning. This syndrome may be a
consequence of a memory access dysfunction;
Incorrect output resultsthe systems provides results, but they are different
from the expected ones.
Syndromes specific to real-time systems using an RTOS may be classified as
follows.
Real-time problemthe real-time constraints specified for the system are not
respected;
Scheduling dysfunctionthe scheduling of the tasks composing the
application running on the system is not correct. This syndrome may cause real-
time problems, incorrect output results problems, or system crashes.


Microc OS-I i Real-Time Kernel: Basic Considerations:

MicroC is a reliable, flexible, pre-emptive, real-time multitasking kernel.
It has been certified by the Federal Aviation Administration for use in
commercial aircrafts. The source code of MicroC kernel is mainly written in
standard C, which makes it portable to different processor architectures. Only a
small portion of the code has to be adapted to the target processor. The main
services offered by MicroC are task scheduling, intertask communication by
semaphores, message mailboxes and message queues, time management
functions, etc. MicroC can manage up to 64 tasks; each task is associated
with a unique priority. A task can be in one of five states (dormant, ready to run,
running, waiting and interrupted). The dormant state corresponds to a task that
has not been made available to the multitasking kernel. The waiting state
corresponds to a task that waits for the occurrence of an event.

A task is in the interrupted state when an interrupt has occurred and the
CPU is handling the interrupt service routine (ISR). A task is running when it
has exclusive control of the CPU. The ready to run state corresponds to a task
that can be executed once the CPU becomes available (the running task
terminates). Generally, a task is an infinite loop function that executes user
code. MicroC associates to each task a task control block (TCB) that contains
essential information about the task (e.g., delay, state, priority, address to the
current top of the stack, etc.). MicroC uses the TCB to preserve the tasks state
when it is suspended, and to resume its execution exactly where it was when the
task becomes ready to run again. All TCBs are located in RAM. Another
characteristic of the considered multitasking application is that each task has its
own stack, which contains tasks variables and the tasks running context (the
content of all the CPU registers).

The scheduling function is activated every time a task calls the kernels
services and when the system returns from an interrupt service routine. When
invoked, the scheduling function verifies if a higher priority task than the
currently running task is ready to run. In this case, a context switch is
performed. The context switch saves the context of the task being suspended
and loads into the CPU the values of the registers for the task to resume. The
ready to run tasks are placed in the ready list that is stored in memory in two
structures: in which, each bit is associated to a priority level. OSRdyGrp is an 8-
bit vector. Each bit in OSRdyGrp corresponds to a row in OSRdyTbl. If at least
one of the tasks whose priorities are grouped in a row is ready to run, the
corresponding bit in OSRdyGrp is set to 1. The scheduler uses OSRdyGrp
and OSRdyTbl structures to determine the highest priority task allowed to run.
The values of OSRdyGrp and of OSRdyTbl row corresponding to the first 1 in
OSRdyGrp are used as indexes in a lookup table helping to determine the
highest priority task. This operation is deterministic (its execution time is
constant for all contexts). Taking into account this functionality, transient faults
occurring in the OSRdyGrp and OSRdyTbl structures may have major
implications on the correct behavior of the MicroC RTOS and consequently on
the global system (as explained in Section II in the definition of scheduling
dysfunction syndromes).



Fault I njection Framework:

In order to asses the robustness of the MicroC RTOS kernel scheduler,
we developed an environment able to inject faultsthat corrupt CPUs registers at
random instants, while the scheduler and the context switch functions are
executed. The studied system architecture is organized as illustrated in Fig. 3.
The adopted system architecture is simulated by an Instruction Set Simulator
(ISS) tool. The fault injection tool uses temporal breakpoint features available in
the ISS to inject faults by software means. Once a temporal breakpoint is
reached, global execution is suspended and the ISS tool activates a Fault
Injection Manager (FIM) that comprises three modules: a fault parameters
generator, a fault tracer and a results analyzer. After the fault has been injected,
the global execution is resumed.


The injection process is depicted in Fig. 4. The fault parameters generator
calculates when and where the fault will be injected. In our experiments, faults
consist of single bit-flips affecting only the main MicroC kernel features: task
scheduling and context switching. Accordingly, the fault instant must coincide
to the time intervals when these functions are active, as illustrated.

Conclusion:

Today, many safety-critical embedded systems execute realtime
multitasking applications. The complexity of these systems typically sets a
requirement for an RTOS. These systems are subject to transient errors induced
by parasitic phenomena that may both affect the correctness of logical results
and the timing of the tasks response. In this paper, we analyzed the sensitivity of
MicroC RTOS to transient faults. We presented a classification of syndromes
caused by the transient faults occurring in safety-critical systems including
RTOSs. We identified syndromes specific to real-time systems including
RTOS: real-time problems (when real-time constraints specified for the system
are not respected) and scheduling dysfunction (when scheduling of the different
tasks composing the application running on the system is not correct).We also
presented a methodology based on fault injection that allows assessing MicroC
RTOS sensitivity to transient faults taking into account both logical correctness
and real-time aspects. In addition, this paper presents an analysis of reasons
for scheduling dysfunctions, which may allow designers to improve the RTOS
robustness to transient faults.

You might also like