Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Towards Self-Adaptive Scheduling in Time- and Space-Partitioned Systems

Joao Pedro Craveiro, Joaquim Rosa and Jose Rufino


Universidade de Lisboa, Faculdade de Ciencias, LaSIGE / Lisbon, Portugal
Email: jcraveiro@lasige.di.fc.ul.pt, jrosa@lasige.di.fc.ul.pt, ruf@di.fc.ul.pt

AbstractTime- and space-partitioned systems (TSP) are a


current trend in safety-critical systems, most notably in the
aerospace domain; applications of different criticalities and
origins coexist in an integrated platform, while being logically
separated in partitions. To schedule application activities, TSP
systems employ a hierarchical scheduling scheme. Flexible
(self-)adaptation to abnormal events empowers the survivability
of unmanned space vehicles, by preventing these events from
becoming severe faults and/or permanent failures. In this paper,
we propose adding self-adaptability mechanisms to the TSP
scheduling scheme to cope with timing faults. Our approach
consists of having a set of feasible and specially crafted partition
scheduling tables, among which the system shall switch upon
detecting temporal faults within a partition. We also present
the preliminary experimental evaluation of the proposed mechanisms.
Keywords-Adaptive control; adaptive scheduling; adaptive
systems; aerospace and electronic systems; hierarchical systems;
real time systems.

I. I NTRODUCTION
The computing infrastructures supporting onboard
aerospace systems, given the criticality of the mission
being pursued, have strict dependability and real-time
requisites. They also require flexible resource reallocation,
and reduced size, weight and power consumption (SWaP). A
typical spacecraft onboard computer hosts multiple avionics
functions, to each of which separate dedicated hardware
resources were traditionally allocated. To cater to resource
allocation and SWaP requirements, there has been a trend in
the aerospace industry towards integrating several functions
in the same computing platform. As these functions may
have different degrees of criticality and predictability, and
originate from multiple providers or development teams,
safety issues might arise, which are mitigating by employing
time and space partitioning (TSP) [1], whereby applications
are separated into logical partitions.
Achieving time partitioning by scheduling partitions in a
cyclic fashion, assigning them predefined execution time windows (in a partition scheduling table PST), has advantages
related to system verification, validation and certification.
Time windows are defined at system configuration time,
This work was developed in the context of ESA ITI project AIR-II
(ARINC 653 In Space RTOS, http://air.di.fc.ul.pt). This work was partially
supported by the EC, through project IST-FP7-STREP-288195 (KARYON),
and by FCT (Fundaca o para a Ciencia e a Tecnologia), through the Multiannual and CMU|Portugal programs and the Individual Doctoral Grants
SFRH/BD/60193/2009 and SFRH/BD/80026/2011.

according to the applications known timing requirements


namely, estimated worst-case execution times (WCET). However, if the time assigned to a partition with real-time requirements turns out to be insufficient due to unforeseen
events or conditions, the application hosted in this partition
will probably miss its deadlines. Depending on the function
performed by the application, these timing faults may have
serious consequences for the mission.
In this paper, we propose a mechanism to mitigate these
timing faults in TSP systems through self-adaptation, while
still maintaining the advantages of cyclic partition scheduling
for verification, validation and certification purposes. The system reacts to the detection of deadline violations by switching
among multiple predefined PSTs, especially configured to
meet the detected additional demand. Our approach takes
advantage of functionality provided by the AIR architecture
for TSP systems. AIR (ARINC 653 In Space Real-time operating system) was developed in an international consortium
sponsored by ESA with the requirements of the aerospace
industry in mind [2], but is applicable to other safety-critical
systems as well (such as aviation and automotive).
In Section II we describe the AIR architecture for timeand space-partitioned systems, with emphasis on the features
which empower our current proposal. Then, in Section III,
we present the proposed self-adaptive scheduling mechanism.
In Section IV, we describe our experimental validation and
the results obtained. In Section V, we present already developed features, based on reconfiguration, that increase the
usefulness of our present approach. Section VI surveys related
work. Finally, Section VII closes the paper with concluding
remarks and future work.
II. AIR ARCHITECTURE
The AIR architecture has been designed to fulfil the
requirements for robust TSP, and foresees the use of different operating systems among the partitions, either real-time
operating systems or generic non-real-time ones.
The modular design of the AIR architecture is shown
in Figure 1. The AIR Partition Management Kernel (PMK)
ensures robust temporal and spatial partitioning. The Application Executive (APEX) provides a standard interface between
applications and the underlying core software layer [3]. The
AIR Health Monitor is responsible for handling hardware and
software errors (like deadlines missed, memory protection
violations, or hardware failures). The aim is to isolate errors

Figure 1.

AIR architecture for TSP systems

abandonding the two-level hierarchical scheme (which brings


benefits in terms of certification, verification and validation).
It profits from AIRs mode-based schedules and process
deadline violation monitoring we described.
In our approach, the system integrator includes several
PSTs that fulfil the same set of temporal requirements for
the partitions, but distribute additional spare time inside the
MTF among these partitions in different ways. This set of
PSTs, coupled with an appropriate Health Monitoring (Section II) handler for the event of process deadline violation,
can be used to temporarily or permanently assign additional
processing time to a partition hosting an application which
has repeatedly observed to be missing deadlines [4].
This approach with static precomputed PSTs is a first step
towards a lightweight, safe, and certifiable dynamic approach
to self-adaptability in TSP systems. In this paper, we resort
to experimental evaluation to validate this initial proof of
concept. The next steps to enhance the coverage and efficacy
of our proposal are already planned and are described in
Section VII.
IV. E XPERIMENTAL EVALUATION
A. Sample generation

Figure 2. AIR two-level hierarchical scheduling, with support for modebased schedules

within its domain of occurrence: process level errors will


cause an application error handler to be invoked, while
partition level errors trigger a response action defined at
system integration time [2].
Mode-based schedules: Temporal partitioning is achieved
through two-level hierarchical scheduling (Figure 2). In the
first level, partitions are scheduled cyclically over a Major
Time Frame (MTF), according to a partition scheduling table
(PST). In each partition, processes compete in the second
hierarchy level according to the native process scheduler of
each partitions operating system [2]. AIR supports modebased partition schedules, among which the system can
switch throughout its execution. This feature may be used for
(self-)adaptation to different mission phases, or unforeseen
events [4]. After being requested, a schedule switch becomes
effective at the end of the current MTF.
Process deadline violation monitoring: During system
execution, it may be the case that a process exceeds its
deadline. This can be caused by a malfunction, by transient
overload (e. g., due to abnormally high event signalling rates),
or by the underestimation of a processs WCET. AIR features
a lightweight approach to the timely detection of process
deadline violation. The upper bound on the deadline violation
detection latency is, for each partition, the maximum interval
between two consecutive time windows [2], [4].
III. S CHEDULING SELF - ADAPTATION TO TIMING FAULTS
The self-adaptation mechanisms we hereby propose aim for
the ability to cope with timing faults in TSP systems, without

We ran our experiments on a set of 200 samples, each


of which corresponding to a TSP system comprising 3
partitions. Each partition in a sample system was generated by (i) drawing a taskset from a pool of 500 random tasksets with total utilizations between 0.19 and 0.25;
tasks periods were randomly drawn from the harmonic set
{125, 250, 500, 1000, 2000}; these values are in milliseconds,
and correspond to typical periods found in real TSP systems;
(ii) computing the partitions timing requirements from the
timing characteristics of the taskset, using Shin and Lees
periodic resource model [5] with the partition cycle from the
set {25, 50, 75, 100} which yielded the lowest utilization. The
samples MTF is set to the least common multiple of the
partitions cycles.
B. Simulation process
We implemented a simulator of the described two-level
scheduling scheme, with AIRs mode-based scheduling and
deadline violation monitoring mechanisms [4], supporting
the proposed self-adaptive schedule switch mechanisms, in
the Java language. The choice of Java was driven by the
development of a more complete modular simulator, taking
advantage of the object-oriented paradigm.
For each sample, the first step of the simulator is to
generate two PSTs using an extension of the algorithm we
propose in [6]. The algorithm nearly emulates rate monotonic
scheduling, assigning available time slots on a cycle basis to
fulfil the minimum duration requirements. In the end, there
is typically unassigned time slots in the MTF covered by the
PST. The extension we implemented for these experiments
was to generate two PSTs from the assignment made by the

(a) Minimal PST

(b) PST 1

(c) PST 2

Figure 3.

Example

algorithm from [6]: 1 , where all the unassigned time is given


to partition P2 , and 2 , where all the unassigned time is given
to partition P1 . We simulated the execution of each of the
samples for 1 minute (60000 ms), with the following varying
conditions:
Task temporal faultiness: The percentage by which the jobs
of the tasks in partition P1 exceed the parent tasks WCET
was varied between 0% and 100%.
Self-adaptation mechanisms: For each considered task
temporal faultiness, simulations were performed with our
proposed self-adaptive scheduling mechanisms respectively
disabled and enabled. In both, the initial PST by which
partitions are scheduled is 1 . With self-adaptive scheduling
disabled, this PST is used through the entire simulation.
Otherwise, when a task deadline miss is notified to the Health
Monitor, a flag is activated, which will cause the system to
switch, at the end of the current MTF, from PST 1 to 2 .
We choose to limit (i) the timing fault injection to P1 , and
(ii) the execution time trade to happen between P2 and P1 ;
to see how the mechanism behaves in presence of one timing
fault on a known partition. This simplification shall be lifted
in future experiments, where we will analyse the behaviour
when faults are multiple or can occur in any partition.
Example: To help illustrate our approach and our
simulation, let us look at an example sample, comprising a set of three partitions (P1 to P3 ) with
tasksets 1 = {h44, 500i, h62, 1000i, h58, 2000i}, 2 =
{h29, 250i, h28, 1000i, h50, 1000i, h89, 2000i}, and 3 =
{h35, 250i, h99, 2000i}, respectively.1 From each of the
tasksets, we derived the following partition timing requirements for the generation of PSTs 1 and 2 :
P1 : 19 ms per 100 ms cycle;
P2 : 19 ms per 75 ms cycle;
P3 : 5 ms per 25ms cycle.
An excerpt of the minimal PST to serve as the basis of both
1 and 2 is pictured in Figure 3a. The analogous excerpts of
PSTs 1 and 2 , with unassigned time given to, respectively,
partitions P2 and P1 , are shown in Figures 3b and 3c.
C. Results
We selected two deadline miss measures to evaluate the
efficacy of our approach. First we measured the percentage
1 Each

task m,q is a tuple hCm,q , Tm,q i WCET and period.

Figure 4.

Figure 5.

Experimental results: deadline miss rate

Experimental results: deadline misses over time

of all jobs generated for the considered interval (60000 ms)


which missed their deadline the deadline miss rate. The
graph in Figure 4 shows the average deadline miss rate (over
all samples) for each value of injected temporal faultiness,
both without and with our self-adaptation mechanisms. As
expected, the deadline miss rate grows approximately linearly
with the injected faultiness. However, when self-adaptation
based on the detected deadline misses is employed, the
growth of this rate is controlled. Even when the jobs of the
tasks in P1 are taking twice the estimated WCET to finish,
only less than 5% of the released jobs miss their deadlines,
whereas the deadline miss rate amount to more than 60% if
self-adaptation is disabled.
The second measure equates to how long the timing faults
effects take to wear off (i. e., deadline misses cease to occur)
in a system with 100% faultiness (jobs take twice the tasks
WCET). The graph in Figure 5 plots the average number
(over all samples) of deadline misses detected in each 1000
ms interval. When self-adaptation is not used, the timing fault
persists over time, with about 4 misses detected each second.
With our self-adaptation mechanism, P1 s tasks cease to have
their jobs miss deadlines after, at most, 15 seconds.
These results are encouraging and strongly suggest the
efficacy of our self-adaptation mechanisms. This outcome
is, however, only for now, limited to soft real-time systems,
since it still allows for some deadline misses to occur.
Furthermore, the scope of our experiments is limited in terms
of fault coverage. Although solving this still requires some
future work (Section VII), let us look at already supported
extensions to self-adaptation mechanisms which allow better
and more widely applicable improvements.

V. I NCREASE SURVIVABILITY WITH ONLINE


RECONFIGURATION

As shown in the results, the use of self-adaptation mechanisms significantly decreases the number of deadline misses,
when comparing to the same system execution without the use
of such mechanisms. However, besides the good contributions
to mitigate this problem, these self-adaptation mechanisms
do not cover all the possible cases, neither ensure perpetual
stability of the system. For example, temporal faultiness may
further increase, due to the additional execution time provided
to the faulty partition not being enough. Furthermore, this
problem can be even more serious if the system encounters
deadline misses in more than one partition.
As it is impossible to foresee all the cases, i.e., all the
suitable PSTs associated to different temporal requirements,
we need to provide ways to reconfigure the system, during
its execution, when facing this kind of problems. This reconfiguration can be done replacing the original set of PSTs
by a redefined and updated set that covers the new temporal
requirements and encounters issues [7]. Self-adaptability can,
in such more serious scenario, provide a temporary mitigation
to allow time for a more targeted solution through online
reconfiguration which enables mission survival.
VI. R ELATED WORK
Previous adaptive approaches to overload management
were based on load adjustment [8] and task rate (period)
adaptation [9], [10]. These are however not applicable to the
safety-critical TSP systems we target. In [11], tasks migrate
between the conditions of software task (running on a
CPU) and hardware task (running on a reconfigurable hardware device, an FPGA) to adaptively cope with a dynamic
workload. The notion of switching among predefined PSTs
in reaction to process deadline misses, was first mentioned
in [4], where we presented the mode-based schedules and
process deadline violation monitoring features in the context
of AIR. Later, Khalilzad et al. [12] independently proposed
an adaptive hierarchical scheduling framework based on
feedback control loops. The hierarchical scheduling model is
similar to most server-based approaches, with each subsystem
(analogous to partition) being temporally characterized by a
budget and a period. For each subsystem, two variables are
controlled (number of deadline misses and the ratio between
total budget and consumed budget), and the total budget
is manipulated accordingly. Recently, Bini et al. [13] have
proposed the ACTORS (Adaptivity and Control of Resources
in Embedded Systems) software architecture.
VII. C ONCLUSION AND FUTURE WORK
In this paper, we have proposed self-adaptability mechanisms to cope with temporal faults in time- and spacepartitioned (TSP) systems. Our approach consists of having
a set of feasible and specially crafted partition scheduling
tables, among which the system shall switch upon detecting

temporal faults within a partition. The added self-adaptability


does not void the certification advantages brought by the
use of precomputed partition schedules. The results of our
experimental evaluation show the self-adaptation to timing
faults we propose achieves a degree of temporal fault control
suitable for soft real-time guarantees. Future work includes a
more complete evaluation with a wider fault coverage, namely
faults in more than one partition. Furthermore, as mentioned,
this preliminary approach shall evolve towards a dynamic
approach that does not compromise the missions safety and
the systems certifiability. Other enhancements include better
heuristics, such as taking profit from multicore (detecting that
some temporal fault is bound to a core and performing a
schedule switch that simply reassigns the activities of that
core to another) and attempting to foresee (and thus avoid,
not mitigate) imminent deadline misses. The latter should lead
us closer to a self-adaptation policy for hard real-time.
R EFERENCES
[1] J. Rushby, Partitioning in avionics architectures: Requirements, mechanisms and assurance, SRI Intl, California, USA,
NASA Contractor Rep. CR-1999-209347, Jun. 1999.
[2] J. Rufino, J. Craveiro, and P. Verissimo, Architecting robustness and timeliness in a new generation of aerospace
systems, in Architecting Dependable Systems VII, ser. LNCS,
A. Casimiro, R. de Lemos, and C. Gacek, Eds. Springer
Berlin / Heidelberg, 2010, vol. 6420, pp. 146170.
[3] AEEC, Avionics application software standard interface, part
1, Aeronautical Radio, Inc., ARINC Spec. 653P1-2, 2006.
[4] J. Craveiro and J. Rufino, Adaptability support in time- and
space-partitioned aerospace systems, in Proc. 2nd Int. Conf.
Adaptive and Self-adaptive Systems and Applications, 2010.
[5] I. Shin and I. Lee, Periodic resource model for compositional
real-time guarantees, in Proc. 24th IEEE Real-Time Systems
Symp. (RTSS 03), Dec. 2003.
[6] J. P. Craveiro and J. Rufino, Applying real-time scheduling
theory to achieve parallelism in time- and space-partitioned
systems, AIR-II Tech. Rep. RT-11-01, 2011, in submission.
[7] J. Rosa, J. Craveiro, and J. Rufino, Safe online reconfiguration
of time- and space-partitioned systems, in Proc. 9th IEEE Int.
Conf. Industrial Informatics, Lisbon, Portugal, Jul. 2011.
[8] T.-W. Kuo and A. Mok, Incremental reconfiguration and
load adjustment in adaptive real time systems, IEEE Trans.
Computers, vol. 46, no. 12, pp. 13131324, Dec. 1997.
[9] G. Buttazzo, G. Lipari, and L. Abeni, Elastic task model for
adaptive rate control, in Proc. 19th IEEE Real-Time Systems
Symp. (RTSS 98), Dec. 1998.
[10] G. Beccari, S. Caselli, and F. Zanichelli, A technique for adaptive scheduling of soft real-time tasks, Real-Time Systems,
vol. 30, pp. 187215, 2005.
[11] R. Pellizzoni and M. Caccamo, Adaptive allocation of software and hardware real-time tasks for FPGA-based embedded
systems, in Proc. 12th IEEE Real-Time and Embedded Technology and Applications Symp. (RTSS 06), Apr. 2006.

[12] N. M. Khalilzad, M. Behnam, T. Nolte, and M. Asberg,


On
adaptive hierarchical scheduling of real-time systems using
a feedback controller, in 3rd Workshop on Adaptive and
Reconfigurable Embedded Systems (APRES11), April 2011.
[13] E. Bini, G. Buttazzo, J. Eker et al., Resource management
on multicore systems: The ACTORS approach, IEEE Micro,
vol. 31, no. 3, pp. 781, MayJun. 2011.

You might also like