Genetic Algorithm Techniques To Solve Routing and Wavelength Assignment Problem in Wavelength Division Multiplexing All-Optical Networks

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

15th Annual IEEE International Conference and Workshop on the Engineering of Computer Based Systems

A Case Study: A Model-Based Approach to Retrofit a Network Fault


Management System with Self-Healing Functionality

Yan Liu Jing Zhang Michael Jiang David Raymer John Strassner
Motorola Labs, Motorola Inc.
Schaumburg, IL 60133, USA
{yanliu, j.zhang, michael.jiang, david.raymer, john.strassner}@motorola.com

Abstract der real-time constraints, care must be taken to avoid re-


source exhaustion. However, events such as alarm storms
Adding self-healing capabilities to network management and unexpected transient network faults caused by unfore-
systems holds great promise for delivering important goals, seen events result in sudden loss of functionality, perfor-
such as QoS, while simultaneously lowering capital expen- mance degradation, and sometimes even the failure of the
diture, operation cost, and maintenance cost. In this pa- entire system under extreme conditions. Hence, the ability
per, we present a model-based approach to add self-healing of a network fault management system to capture software
capabilities to a fault management system for cellular net- failures effectively, react to them rapidly, and recover el-
works. We propose a generic modeling framework to cate- egantly when failures occur, is critical to assuring highly
gorize software failures and specify their dispositions at the dependable network system performance.
model level for the target system. This facilitates the deploy- As an emerging discipline, the purpose of autonomic
ment of a control loop for adding autonomic capabilities computing is to manage complexity. By transferring more
into the system architecture, which include self-monitoring, manual functions to involuntary control, additional re-
self-healing, and self-adjusting functionality. While self- sources are made available to manage higher-level pro-
monitoring oversees the environmental conditions and sys- cesses. The fundamental management element of an au-
tem behavior, self-healing is accomplished by instrument- tonomic computing architecture is a control loop. Figure 1
ing the system with self-adjusting operations. We include a illustrates the concept. The idea is to instrument a Man-
case study on a prototype Intelligent Network Fault Man- aged Resource so that an Autonomic Manager can com-
agement system to illustrate this approach by showing how municate with it. This is done using sensors that retrieve
these autonomic capabilities can be added and deployed. data, which is then analyzed to determine if any correc-
Specifically, these autonomic capabilities are derived from tion to the managed resource(s) being monitored is needed
self-* model specifications, and are used to mitigate the risk (e.g., to correct “non-optimal”, “failed” or “error” states). If
of specified failures and maintain the health of the system in so, then those corrections are planned, and appropriate ac-
response to different types of faults encountered. tions are executed using effectors that translate commands
back to a form that the managed resource(s) can under-
stand. This usually results in the reconfiguration of that
managed resource, though it can cause the reconfiguration
1 Introduction
of other managed resources that are affecting the state of the
managed resource that is being monitored. The Autonomic
The highly competitive nature of the telecommunica- Computing Element embodies the control loop, and enables
tions industry demands an extremely high-level of system the autonomic manager to communicate with other types of
availability, reliability, and survivability for network man- autonomic and non-autonomic managers using its sensors
agement systems. The presence of a large variety of ser- and effectors.
vices, each with their own needs and interactions, further Autonomic computing and networking are both char-
exacerbates the design and performance of network man- acterized by the set of capabilities often represented by
agement systems. Fault management is considered one of the phrase self-*; which include self-protection, self-
the most crucial features for reliable and efficient network configuration, self-healing, and self-optimization, all aim-
performance. As a network fault management system in- ing to reduce the business, system, and operational com-
volves extensive data processing and troubleshooting un- plexity of a system [1, 2]. From a system perspective, an

978-0-7695-3141-0/08 $25.00 © 2008 IEEE 9


DOI 10.1109/ECBS.2008.30
Management(INFM) system, by specifying and implement-
ing the self-healing capabilities to correct the software fail-
ures. UML models are first extracted from the critical com-
ponents of the software system. Then, model constructs are
used to classify software failures, specify the dispositions
of such failures apart from the models that capture the base
functionality of the system, and define the healing actions
to be taken in the form of self-adjusting operations. The
intended self-healing functionality is thus achieved by inte-
grating the base code with the needed self-* functionality
instrumented with run-time failure detections and mitiga-
tion strategies that are derived from the self-* model speci-
fications.
The paper is organized as follows. Section 2 describes
the background of the targeted system and the proposed
architectural alteration for adding autonomic capabilities.
Section 3 presents the proposed framework for facilitat-
Figure 1. An Autonomic Computing Ele- ing model-driven autonomic computing for failure miti-
ment(ACE) gation. Section 4 presents the application of the model-
based approach to deploying the autonomic control loop
within INFM and implementing the self-monitoring, self-
autonomic computing system is an adaptive system, which
healing, and self-adjusting functionality using the model-
reacts to changes in user needs, business goals, environmen-
driven framework. Section 5 summarizes related work, and
tal conditions, and system functionality, and automatically
Section 6 concludes the paper and describes some future
adapts its functionality to meet new demands on its func-
work.
tionality while protecting business goals. Self-awareness is
necessary for supporting the adaptive behavior of complex
systems. This notion of behavior includes expected behav- 2 Adding Self-Healing Functionality to
ior, in terms of the correct, safe, and expected operation of INFM
the system as well as unexpected deviations from such be-
havior. Self-healing capabilities provide rapid recovery and To illustrate our approach, a network fault management
recuperation against system failures to support survivability system for a real world telecommunication network is stud-
and enhance reliability in real-time system operation. ied. In this section, we introduce the targeted system and
In this paper, we propose a model-based approach for describe the architectural alteration needed for supporting
adding selected autonomic capabilities, which adds self- the autonomic capabilities needed for self-healing.
healing functionality, into the critical components of a net-
work fault management system that have a significant im- 2.1 Background
pact on system dependability. Self-monitoring oversees
the current environmental conditions and system behavior, We study the fault management framework of a CDMA
building a baseline to add self-awareness capabilities. It is cellular network management system, namely the Inte-
responsible for monitoring the system states and environ- grated Network Management System (INMS). The main
mental conditions, analyzing these states and the changes, components of the INMS provide an integrated solution
and thus identifying and detecting system failures. Once a to the core network management working objectives, i.e.,
failure is detected and identified, self-healing operations are FCAPS (fault-management, configuration, accounting, per-
enabled for dynamically responding to the identified failure formance, and security). Alarm correlation is the most im-
to either fix the failure or mitigate it in some way. These portant feature of the fault management system. On a daily
actions are usually accomplished by self-adjusting the cor- basis, an operation console can receive more than 100,000
responding system configurations and operations. Together, network alarms, all of which are processed by the fault man-
all self-* approaches complete an adaptive self-healing agement module. The functionalities provided by the alarm
framework and offer a sound solution towards achieving correlation feature consist of 1) filtering out the informa-
high system assurance. tional alarms, such as ”Cleared” and ”Warning” types of
Furthering our previous case study [3], a modeling alarms; 2) reporting meaningful alarms that are regarded as
framework is facilitated to retrofit a prototype network fault actionable or as requiring operator attention; and 3) provid-
management system, called the Intelligent Network Fault ing assistance in troubleshooting. Network operators rely

10
on the alarm correlation feature to reduce the number of large, the time complexity of the algorithm increases dra-
alarms to a limited number that could be handled within the matically, although the value of the minimum support only
required time constraints. The significant reduction, usually decreases by a small magnitude. For the on-line algorithm,
greater than 80% on average, is achieved by correlating the this causes a violation of real-time constraints and fails to
alarms using alarm patterns and hidden correlations discov- provide prompt correlation operation. For the off-line al-
ered by machine learning algorithms. gorithm, the large volume of alarms would likely cause re-
source exhaustion or crash of the algorithm. This motivates
us to add autonomic capabilities to resolve such failures and
maintain the healthy state of the fault management system.

2.2 Adding Autonomic Capabilities

Inspired by adaptive control schemes, we add an au-


tonomic control loop to the architecture to support self-
healing functionality, taking the form of “monitor” and
“control” as shown in Figure 2. Inside this loop, the self-
monitoring activities collect information from the system
on algorithm executions and check against its pre-defined
failure conditions. When a failure is detected and identi-
fied, a signal is sent to the autonomic controller to deter-
mine the proper healing action. The autonomic controller
takes control and directs the necessary actions to be taken
Figure 2. Adding self-healing functionality to on the algorithms. The proposed self-healing mechanisms
the Intelligent Network Fault Management ar- are mainly different forms of rejuvenation, which is com-
chitecture posed of shutting down the current correlation process prop-
erly, adjusting the parameters and configuring the algorithm
Figure 2 shows the architecture of the INFM component input if necessary, and restarting the execution of the algo-
that carries out the alarm correlation using frequent pattern rithm. As long as the targeted system is well modularized
discovery algorithms. By design, there exist two types of and maintains an architectural consistency, recent studies in
correlation algorithms, on-line and off-line for real-time and modeling profiles and fault tolerance using UML [6, 7] sug-
after-hours correlation operations, respectively. The on-line gest that the adaptive framework could be realized with the
algorithm is employed during real-time operation to corre- assistance of modeling techniques. More specifically, for a
late the incoming alarm streams using a sliding time win- legacy fault management system, we propose a three-step
dow, usually by minutes. It is used to find transient patterns approach as follows:
as well as time and domain specific correlations among spe-
cific alarms. The off-line algorithm runs on a much larger 1. Identify the critical features and extract UML models
set of alarms that have been recorded during a certain period from the legacy code for these features;
of time, usually in days, and discovers the common patterns
2. By extending the UML specifications and capabili-
and correlations among certain types of alarms. Both pat-
ties with profiles and domain-specific modeling con-
tern discovery algorithms are parameterized by minimum
structs, model the self-monitoring, self-healing, and
support and minimum confidence. Minimum support is the
self-adjusting aspects for these models and validate the
minimum value in terms of frequency a pattern must ex-
integrated models;
ceed to be calculated while mining from the learning data.
With the pattern in the common form of A ⇒ B, a con- 3. Generate the code for these autonomic aspects and in-
fidence value is defined as the percentage of instances in sert it into the legacy code.
the learning data containing B in addition to A with regard
to the overall number of instances containing A in the data All three steps can be either automated or semi-automated
set. The minimum confidence value is hence used to de- with tool support. These three steps are explained in detail
fine the minimum value of confidence a pattern must ex- in deeper sections.
ceed to be calculated. Usually, both parameters are set on
a scale of 0..1 and are used to control how candidate pat- 3 Modeling Self-* Capabilities
terns are constructed. The analysis provides their frequency
of occurrence as well as a confidence factor in the discov- We propose a generic modeling framework to facilitate
ered patterns [4, 5]. When the number of instances is fairly the integration of self-* capabilities with legacy software

11
systems. The aim of this framework is to define failures sification of failures [10], which considers the component
from high-level model specifications in order to provide a or system as the provider of a service or a set of services,
unified platform to integrate various existing self-healing each consisting of a particular type of value delivered within
techniques (failure mitigation strategies in particular), as a defined interval. The model shown in Figure 3 specifies
well as to add self-* capabilities to existing software sys- Pumfrey’s categories of software failures that may occur:
tems. provision failures, timing failures, and value failures.

3.1 Failure Specification and Modeling • Provision Failure. Both omission failures and commis-
sion failures are considered provision failures, as the
Figure 3 shows a general model representation for soft- former indicates no service is delivered while the lat-
ware failures and self-healing mechanisms. Different types ter delivers services that are not required.
of failures may be present for a software, which is modeled
as either an atomic software component or a composite of • Timing Failure. This includes late or early delivery
software component(s) [8]. A software failure can be de- of service. The timing failure is usually associated
tected by one or more different types of detectors; failures with certain time constraints, which can be a real time
are then analyzed by one or more failure analyzers. More constraint or a relative deadline with respect to cer-
than one failure might be analyzed by an analyzer to dis- tain events. For example, a time limit can be imposed
cover the relationships among these failures and help bet- on the alarm correlation process in a fault manage-
ter characterize these failures and their underlying causes. ment system, as opposed to the relative deadline which
A failure analyzer also predicts and/or determines the risk states that “the alarm correlation must be completed
raised by the failure(s). In accordance with the different ex- before the next batch of alarms are received”.
tents of damage that a failure might bring to the system, the
mitigator invokes different failure mediation strategies and • Value Failure. This type of failure is defined by the
takes different sets of actions that transform the software incorrect value provided by the service.
from its failure state to a specified operational state [9].
Each type of failure requires its specific failure detec-
tors. The effects of such failures can be divided into three
categories based on their risk index assigned by the ana-
lyzer: insignificant, moderate, and significant. Although
most methods proposed and used for failure risk analyses
are mainly analytical, thresholds and criteria for such di-
visions are very domain specific and usually obtained em-
pirically.This implied that certain adaptive behavior includ-
ing learning and reasoning is needed to support the deci-
sion making. Similarly, the invocation of failure mitigation
strategies also requires adaptivity. In the proposed model,
based on the notion of self-healing, we divide the existing
mitigation strategies into three categories: report, rejuve-
nate, and recover.

• Report. When the risk of a failure is considered in-


Figure 3. The model of failure, risk, and miti- significant (i.e., it will not affect the normal operations
gation of the system), it is only required that such a failure be
reported and recorded. No immediate action is neces-
sary in this case.
A number of software failure classifications are defined
in reliability engineering literature. Some of them simply • Rejuvenate. Software rejuvenation, defined by Huang
classify failures into different types based on their risk lev- et. al. [11], is basically resolving the failure by el-
els, while others focus on the causes of the failures. In our egantly stopping the current operation and restarting
proposed self-healing framework for failure detection and the application. This strategy usually is more powerful
mitigation, we need a classification that can be used to ef- when a failure is highly likely or present that is haz-
fectively categorize the failures while stressing the domain ardous yet not fatal to the system operation. Predictive
specific constraints and operation critical requirements that models are thus particularly useful for rejuvenation ac-
a failure impacts. This is consistent with Pumfrey’s clas- tions.

12
• Recover. Recovery is needed after a failure occurs.
This usually includes failure quarantine/masking, con-
sequence/risk mitigation, and removal of the causal
faults.

3.2 Modeling and Deploying Self-*

Figure 4 shows the flow chart for implementing the pro-


posed model-based approach. A self-* model is described
as a set of constructs that are instantiated and integrated
with the base system models. For many legacy systems
whose models are unavailable, the base model can be ex-
tracted from the code. It should be noted that the self-
healing feature is kept separate from the base functional-
ity of the target software system. This decomposition of
the base functionality and the self-healing feature improves
modularity and re-usability of the self-* modules. The main Figure 4. Adding self-* capabilities: a flow
advantage of this approach is that it allows developing and chart
maintaining the self-* models independently from the base
model. Furthermore, many of these self-* models can be ap-
plied to all levels in the software system hierarchy through
model composition. 4 Modeling and Deploying Self-Healing for
Integrating and deploying a self-* model to the base INFM
application model is done by model composition tech-
niques. Typically, model composition involves merging two Following the modeling approach, we first specify the
or more models to obtain a single integrated model. Aspect- failures and their corresponding resolutions. Then, with the
Oriented Modeling (AOM)[12] is one of the promising help of reverse modeling tools, we extract the UML models
means to support model composition. Self-* functionalities from the existing source code of INFM. After that, we add
can be specified in separate aspect modules, which can be the self-healing aspects into the base models and complete
instantiated and bound to the base models via a specialized the new implementation of the monitor and control loop.
aspect weaver. The code used to implement the self-monitoring function-
It is thus suggested that the self-* models be specified ality is automatically generated using model driven archi-
in a platform-independent manner. For instance, to sup- tecture (MDA)-like techniques. The self-healing code for
port a complete self-healing application, a family of code failure mitigation, on the other hand, is manually developed
generators need to be employed for transforming platform- and integrated with the base code. The following subsec-
independent models into platform-specific code, each defin- tions explain this process in detail.
ing how the self-healing features are implemented on a dif-
ferent platform. As shown in Figure 4, the self-* model 4.1 Failures and Resolution
can be translated into platform-specific implementations by
a code generator or by manual development (when cer- The performance of alarm correlation algorithms is very
tain features are beyond model-driven software capabili- sensitive to the settings of both parameters of the pattern
ties). Specifically, the representation of failures and their discovery algorithms, namely the minimum support value
responding resolutions are transformed into low-level im- and the minimum confidence value, denoted by supp and
plementations that detect the failures and take the appropri- conf in the rest of the paper. Table 1 lists the run-time
ate actions to restore the system to the specified operational failures that have occurred during operations that were not
state. The base code is then augmented with the self-healing covered in the original system design, the possible causes of
instrumentation by integrating both code bases. As a result, these failures, and the intended resolutions for the failures.
autonomic capabilities are added to the legacy software sys- These three major failure conditions listed in the table are
tem by first constructing the self-* models, followed by di- considered dependability critical because:
rectly mapping the self-* aspects to the representation of the
composed self-healing enabled software model, upon which • it is imperative not to violate the real time constraints
model-based analysis can be performed for system verifica- on operations for the online correlation algorithm as
tion and validation. well as to avoid the possible operational problems

13
Table 1. INFM Software Failures and Resolution Table for Alarm Correlation
Failure Possible Causes Resolution
Timeout supp set too low Increase supp and re-run
1) Increase conf and recheck patterns;
Too many(> U ) patterns returned supp or conf set too low if too many(> U ) patterns returned;
2) increase supp and re-run.
1) Decrease conf and recheck patterns;
Too few(< L) patterns returned supp or conf set too high if too few(< L) patterns returned,
2) decrease supp and re-run.

caused by resource exhaustion for the off-line corre-


lation algorithm.

• relatively small values of supp and conf might cause


the algorithms to return a large number of patterns.
Although small values of both parameters would pro-
vide a more comprehensive list of candidate patterns,
it raises issues in handling the large quantity of pat-
terns. Furthermore, low supp and conf values in-
crease the likelihood of applying false positive patterns
to the succeeding correlation procedures.

• relatively large value of supp and conf , on the other


hand, might cause the algorithms to return too few pat-
terns. This can negate the effectiveness of the correla-
tion algorithms and increase the false negative ratio in
pattern validation.

Based on the failure classification in Section 3, the time- Figure 5. Timeout failure management aspect
out failure can be classified as a typical timing failure and model
is considered critical for on-line execution. Since it is less
hazardous for off-line alarm correlation, the allowed times
for re-run could be adjusted to a larger number. Similarly, be analyzed upon the timeout failure. The aspect model
too few patterns and too many patterns might have differ- intercepts and wraps this operation with a sequence of fail-
ent risks under different operational modes. The adjust- ure management activities. The aspect process first initiates
ment of parameters in both situations needs further analy- the counter for the allowed number of re-run times. If the
sis, whereby a failure analyzer is required to carry out the counter already exceeds the allowed number of times for re-
predictive reasoning based on current environmental con- run, it means that the failure cannot be resolved and has to
ditions and constraints as well as the history performance be reported. The whole application thus must be aborted.
of the algorithms. In other words, the failure analyzer is Otherwise, the process will start a timer with a value T,
indeed an adaptive component that fine tunes the behavior in sync with the execution of the proceed operation. The
and strategy of the self-healing actions. proceed operation and the timer are surrounded by an in-
terruptible activity region (denoted as a dashed rectangle
4.2 Modeling Timeout Failure Manage- with rounded corners), representing that whenever the flow
ment Aspect leaves the region via interrupting edges, all of the activities
in the region will be terminated. Specifically, if the proceed
Figure 5 shows an aspect model for managing timeout operation completes execution successfully before time out,
failure. The reserved keyword proceed refers to a certain the control of the flow will return to the base process (via
operation (e.g., the alarm correlation algorithm in the INFM a bull’s eye symbol) and continue the next activity that fol-
case study) that has sensitive timing concerns and needs to lows the proceed operation. Otherwise, the proceed pro-

14
cess will be shut down properly and a timeout failure will be
captured and passed to the failure analyzer/mitigator, which
is responsible for determining the failure risk and calculat-
ing the corresponding mitigation strategies for reconfigur-
ing algorithm parameters. The control loop is thus realized
by re-running the algorithm with the new parameter values.

4.3 Modeling Pattern Failure Manage-


ment Aspect
Figure 7. Pattern failure analyzer/mitigator
Similarly, the pattern failure management aspect is mod- model
eled as in Figure 6. In contrast to the timeout failure, the
pattern failure is identified by a special component after the
proceed algorithm is completed. If a failure is detected, are extracted from the source code with the aid of UML
RecheckP atterns will filter the returned patterns by ad- reverse engineering tools. Figure 8 illustrates a fragment
justing the value of conf . If the failure still exists, the con- of the extracted UML activity diagram, which models the
trol of the flow will be passed to the pattern failure ana- flow of the alarm correlation process logic in INFM. The
lyzer/mitigator, which will adjust the algorithm parameters control flow starts with database connection, followed by
based on the reasoning of different failure types, as shown the initiation of the parameters (i.e., supp and conf ) for
in Figure 7. the pattern discovery algorithm. After the algorithm exe-
cution is completed, a special operation will be performed
to process the patterns that are generated. The timeout and
pattern failure management aspect models are attached to
the ExecuteAlgo operation with particular parameter val-
ues specified. The base models and the autonomic aspect
models are then integrated through an underlying aspect-
modeling weaver, as shown in Figure 9. The proceed oper-
ation is replaced by the ExecuteAlgo operation. The initial
and return symbol of the aspect models are connected to the
pre- and post- flow of the ExecuteAlgo, respectively.

Figure 8. Applying failure management as-


pects to the extracted base models of alarm
correlation process in INFM

Figure 6. Pattern failure management aspect 4.5 Code Generation from the Integrated
model Models

4.4 Applying Autonomic Aspects on Ex- We apply Aspect-Oriented Programming (AOP)[13]


tracted Base Models technique to support source code integration. AOP is an
effective way that enables code instrumentation by encap-
The autonomic aspects (e.g., timeout and pattern failure sulating additional (mostly crosscutting) functionalities in
management) are applied to the base models of INFM that a self-contained aspect module. In AOP terminology, an

15
well-defined self-healing functionality is centralized to one
particular location. By adopting the aspect-oriented ap-
proach to specifying self-healing activities, the aspect mod-
els can be defined independently from the base functional-
ity. Such kind of separation of concerns greatly improves
the reusability, changeability, and maintainability of the
system. Our experience has led us to believe that modeling
based retrofitting is a promising approach to support adding
self-healing functionality to legacy softwares in a more ef-
ficient and effective way based on modularization. In the
meantime, our study also shows that a moderate amount of
effort has to be put onto the model extraction and verifica-
tion process. In general, the required effort increases when
the level of modularization of the legacy code decreases.

5 Related Work

In the arena of fault tolerant computing, there are a num-


ber of approaches that employ on-line monitoring tech-
niques for failure detection and identification, followed by
failure resolution external to the monitored system. Projects
in the DARPA DASADA program [15, 16] describe an
architecture that uses probes and gauges to monitor the
execution of programs, generate events containing mea-
surements, and take actions based on the interpretation of
Figure 9. Integrating autonomic behavior
events. Effectors interact with the system to ensure that
with INFM alarm correlation base model
system’s run-time behavior fits within the envelope of ac-
ceptable behavior. Authors in [17] propose the generation
aspect is composed of pointcuts and advice. P ointcuts of proxies and wrappers to add autonomic functionalities
denote a set of specific points in the control flow of the sys- to object-oriented applications to cope with failures with-
tem, and advice contains the code that are going to be wo- out source code adaptation. In [18], authors describe the
ven into the source base. The aspect and the base modules use of code transformation to instrument Java byte-code by
are integrated by an underlying aspect weaver. adding fault analysis and healing actions. Authors in [7]
The autonomic aspect models are translated into an as- present a connector-based self-healing mechanism for soft-
pect by a specialized self-* code generator. The pointcut ware components. A component in a self-healing system
in this case is the operation ExecuteAlgo. The autonomic is designed as a layered architecture, structured with the
behavior is captured in advice. The base code for alarm healing layer and the service layer. The role of connectors
correlation process is then augmented with the self-* in- between tasks in a component is extended to support the
strumentation via an aspect weaver. (In this particular case self-healing mechanism in detecting, re-configuring, and re-
study, we choose AspectJ[14] as the underlying weaver, be- pairing anomalous objects in components. In our proposed
cause INFM is implemented in Java and AspectJ is consid- approach, we separate the autonomic capabilities from base
ered as the most popular and mature aspect weaver in the model functionalities and use the modeling framework to
market.) As a result, a self-healing enabled INFM system is integrate the self-* specifications with functional specifica-
constructed from the high-level model specifications. tions.
A fault tolerant network usually refers to a network that
4.6 Discussion is resilient against the failure of its component such as a
network element, for example, a router or a base station
One of the ultimate goals of software engineering is to controller. However, the dependability of a telecommuni-
construct software that is easily modified and extended. cation network system rely on not only the reliability and
Retrofitting legacy software systems is never an easy task availability of its hardware components but also the surviv-
as it usually involves extensive rework over the legacy code. ability and sustainability of its software components. It has
A desired solution is to leverage the benefits of modulariza- been proposed in the past that fault tolerance could be de-
tion such that a change in a design decision such as adding signed into system architecture to improve the network re-

16
liability and performance [19]. A variety of service failures for self-healing operations, all of which will be based on
and hardware failures are pre-defined and associated with more general and advanced techniques for failure predic-
well-categorized faults and then addressed by fault isola- tion, classification, reasoning, and intelligent strategy se-
tion, detection, and restoration [20]. Although some of the lection for risk mitigation.
issues with software failures can be tackled using service
replication as proposed by the authors in [21], adaptable on- References
line monitoring and self-healing approaches are yet rarely
defined for software failures simply because many of un-
expected and unforeseen software failures can only be de- [1] John Strassner and Jeffrey O. Kephart. Autonomic
tected during run-time execution. systems and networks: Theory and practice. In Pro-
The IEEE Standard Classification for Software Anoma- ceeding of the Network Operations and Management
lies [22] provides a comprehensive list of software anomaly Symposium (NOMS) 2006, 2006.
classification and related data items that are helpful to iden-
[2] Jeffrey O. Kephart and David M. Chess. The vision of
tify and track anomalies. The methodology of this standard
autonomic computing. IEEE Computer, 36(1), 2003.
is based on a process (sequence of steps) that pursues a log-
ical progression from the initial recognition of an anomaly [3] Yan Liu, Jing Zhang, Michael Jiang, David Raymer,
to its final disposition. [23] describes the taxonomy of fail- and John Strassner. A model-based approach to
ures, errors, and faults for dependable and secure computing adding autonomic capabilities to network fault man-
as a basis for attaining dependability and security. In [24], agement system. In to be published in Proceedings of
tree-based techniques are proposed for the classification of the IEEE/IFIP Network Operations and Management
software failures based on execution profiles. The UML Symposium 2008, Salvardo, Brazil, 2008.
profile for modeling QoS and fault tolerance [7] defines a
set of UML extensions to represent QoS and fault tolerance [4] Rakesh Agrawal, Tomasz Imielinski, and Arun
concepts based on object replications. Swami. Mining association rules between sets of
The novelty of our model based approach for adding self- items in large databases. SIGMOD, 22(2), 1993.
healing functionality to network fault management system
lies in the modeling framework to integrate the self-* spec- [5] Rakesh Agrawal and Ramakrishnan Srikant. Fast al-
ifications with functional specifications, as well as the ap- gorithms for mining association rules. VLDB, 22(2),
plication of model integration to realize these capabilities. 1994.
The presented modeling framework aims to be more general
[6] Object Management Group. Uml profile for model-
than the prior approaches in that we synthesize self-healing
ing quality of service and fault tolerance character-
techniques and employ model composition and transforma-
istics and mechanisms. http://www.omg.org/
tion mechanisms to support adding autonomic capabilities
cgi-bin/doc?formal/06-05-02, 2006.
to software systems from the abstract model specifications
down to the low-level system implementations. [7] Michael E. Shin and Daniel Cooke. Connector-based
self-healing mechanism for components of a reliable
6 Conclusion system. SIGSOFT Softw. Eng. Notes, 30(4):1–7, 2005.

[8] John Strassner. Policy-Based Network Management:


Autonomic aspects like self-monitoring, self-healing,
Solutions for the Next Generation (The Morgan Kauf-
and self-adjusting are highly desirable capabilities to meet
mann Series in Networking). Morgan Kaufmann Pub-
the dependability requirements for network fault manage-
lishers Inc., San Francisco, CA, USA, 2003.
ment systems. We propose a model based approach to
retrofit a network fault management system with a moni- [9] Vina Ermagan, Jun ichi Mizutani, Kentaro Oguchi,
tor and control loop to achieve self-healing. Although our and David Weir. Towards model-based failure-
work focuses on network fault management, we believe the management for automotive software. In SEAS ’07:
modeling framework can be generalized to add autonomic Proceedings of the 4th International Workshop on
capabilities extensively for legacy systems in other applica- Software Engineering for Automotive Systems, page 8,
tion domains. Washington, DC, USA, 2007. IEEE Computer Soci-
Our future work will be focused on modeling and de- ety.
ploying policy directed self-* capabilities in the context of
model based policy based network management [25], for in- [10] David John Pumfrey. The principled design of com-
stance, specifying and modeling policies at high levels and puter system safety analyses. Department of Com-
then mapping them with various parameters and constraints puter Science, University of York, 2000. PhD Thesis.

17
[11] Yennun Huang, Chandra Kintala, Nick Kolettis, and [21] Rachid Guerraoui and André Schiper. Software-based
N. Dudley Fulton. Software rejuvenation: analy- replication for fault tolerance. Computer, 30(4):68–
sis, module and applications. In Twenty-Fifth In- 74, 1997.
ternational Symposium on Fault-Tolerant Computing,
FTCS-25. Digest of Papers, volume 30, pages 381– [22] Roy Sterritt and Michael G. Hinchey. Autonomicity -
390, 1995. an antidote for complexity? pages 283–291, 2005.
[23] IEEE standard classification for software anomalies,
[12] Aspect-oriented modeling. http://www.
1993.
aspect-modeling.org/.
[24] Patrick Francis, David Leon, Melinda Minch, and
[13] Gregor Kiczales, John Lamping, Anurag Menhd- Andy Podgurski. Tree-based methods for classify-
hekar, Chris Maeda, Cristina Lopes, Jean-Marc Lo- ing software failures. In ISSRE ’04: Proceedings of
ingtier, and John Irwin. Aspect-oriented program- the 15th International Symposium on Software Relia-
ming. In Proceedings European Conference on bility Engineering (ISSRE’04), pages 451–462, Wash-
Object-Oriented Programming, volume 1241, pages ington, DC, USA, 2004. IEEE Computer Society.
220–242. Springer-Verlag, 1997.
[25] Dave Raymer, John Strassner, Elyes Lehtihet, and
[14] Gregor Kiczales, Erik Hilsdale, Jim Hugunin, Mik Sven van der Meer. End-to-end model driven pol-
Kersten, Jeffrey Palm, and William Griswold. Get- icy based network management. In POLICY ’06:
ting started with aspectj. Communications of ACM, Proceedings of the Seventh IEEE International Work-
44(10):59–65, 2001. shop on Policies for Distributed Systems and Networks
(POLICY’06), pages 67–70, Washington, DC, USA,
[15] David Garlan, Bradley Schmerl, and Jichuan Chang. 2006. IEEE Computer Society.
Using gauges for architecture-based monitoring and
adaptation. December 2001.

[16] Janak Parekh, Gail Kaiser, Philip Gross, and Giuseppe


Valetto. Retrofitting autonomic capabilities onto
legacy systems. Cluster Computing, 9(2):141–159,
2006.

[17] A.R. Haydarlou, B.J. Overeinder, and F.M.T. Brazier.


A self-healing approach for object-oriented applica-
tions. In Proceedings. of Sixteenth International Work-
shop on Database and Expert Systems Applications,
pages 191–195, 2005.

[18] M. Muztaba Fuad and Michael J. Oudshoorn. Trans-


formation of existing programs into autonomic and
self-healing entities. In ECBS ’07: Proceedings of
the 14th Annual IEEE International Conference and
Workshops on the Engineering of Computer-Based
Systems, pages 133–144, Washington, DC, USA,
2007. IEEE Computer Society.

[19] Tomohiro Fujisaki, Masaki Hamada, and Katsuyoshi


Kageyama. A scalable fault-tolerant network man-
agement system built using distributed object technol-
ogy. In EDOC ’97: Proceedings of the 1st Interna-
tional Conference on Enterprise Distributed Object
Computing, pages 140–148, Washington, DC, USA,
1997. IEEE Computer Society.

[20] D. Medhi. Network reliability and fault tolerance,


1999.

18

You might also like