Professional Documents
Culture Documents
Genetic Algorithm Techniques To Solve Routing and Wavelength Assignment Problem in Wavelength Division Multiplexing All-Optical Networks
Genetic Algorithm Techniques To Solve Routing and Wavelength Assignment Problem in Wavelength Division Multiplexing All-Optical Networks
Genetic Algorithm Techniques To Solve Routing and Wavelength Assignment Problem in Wavelength Division Multiplexing All-Optical Networks
Yan Liu Jing Zhang Michael Jiang David Raymer John Strassner
Motorola Labs, Motorola Inc.
Schaumburg, IL 60133, USA
{yanliu, j.zhang, michael.jiang, david.raymer, john.strassner}@motorola.com
10
on the alarm correlation feature to reduce the number of large, the time complexity of the algorithm increases dra-
alarms to a limited number that could be handled within the matically, although the value of the minimum support only
required time constraints. The significant reduction, usually decreases by a small magnitude. For the on-line algorithm,
greater than 80% on average, is achieved by correlating the this causes a violation of real-time constraints and fails to
alarms using alarm patterns and hidden correlations discov- provide prompt correlation operation. For the off-line al-
ered by machine learning algorithms. gorithm, the large volume of alarms would likely cause re-
source exhaustion or crash of the algorithm. This motivates
us to add autonomic capabilities to resolve such failures and
maintain the healthy state of the fault management system.
11
systems. The aim of this framework is to define failures sification of failures [10], which considers the component
from high-level model specifications in order to provide a or system as the provider of a service or a set of services,
unified platform to integrate various existing self-healing each consisting of a particular type of value delivered within
techniques (failure mitigation strategies in particular), as a defined interval. The model shown in Figure 3 specifies
well as to add self-* capabilities to existing software sys- Pumfrey’s categories of software failures that may occur:
tems. provision failures, timing failures, and value failures.
3.1 Failure Specification and Modeling • Provision Failure. Both omission failures and commis-
sion failures are considered provision failures, as the
Figure 3 shows a general model representation for soft- former indicates no service is delivered while the lat-
ware failures and self-healing mechanisms. Different types ter delivers services that are not required.
of failures may be present for a software, which is modeled
as either an atomic software component or a composite of • Timing Failure. This includes late or early delivery
software component(s) [8]. A software failure can be de- of service. The timing failure is usually associated
tected by one or more different types of detectors; failures with certain time constraints, which can be a real time
are then analyzed by one or more failure analyzers. More constraint or a relative deadline with respect to cer-
than one failure might be analyzed by an analyzer to dis- tain events. For example, a time limit can be imposed
cover the relationships among these failures and help bet- on the alarm correlation process in a fault manage-
ter characterize these failures and their underlying causes. ment system, as opposed to the relative deadline which
A failure analyzer also predicts and/or determines the risk states that “the alarm correlation must be completed
raised by the failure(s). In accordance with the different ex- before the next batch of alarms are received”.
tents of damage that a failure might bring to the system, the
mitigator invokes different failure mediation strategies and • Value Failure. This type of failure is defined by the
takes different sets of actions that transform the software incorrect value provided by the service.
from its failure state to a specified operational state [9].
Each type of failure requires its specific failure detec-
tors. The effects of such failures can be divided into three
categories based on their risk index assigned by the ana-
lyzer: insignificant, moderate, and significant. Although
most methods proposed and used for failure risk analyses
are mainly analytical, thresholds and criteria for such di-
visions are very domain specific and usually obtained em-
pirically.This implied that certain adaptive behavior includ-
ing learning and reasoning is needed to support the deci-
sion making. Similarly, the invocation of failure mitigation
strategies also requires adaptivity. In the proposed model,
based on the notion of self-healing, we divide the existing
mitigation strategies into three categories: report, rejuve-
nate, and recover.
12
• Recover. Recovery is needed after a failure occurs.
This usually includes failure quarantine/masking, con-
sequence/risk mitigation, and removal of the causal
faults.
13
Table 1. INFM Software Failures and Resolution Table for Alarm Correlation
Failure Possible Causes Resolution
Timeout supp set too low Increase supp and re-run
1) Increase conf and recheck patterns;
Too many(> U ) patterns returned supp or conf set too low if too many(> U ) patterns returned;
2) increase supp and re-run.
1) Decrease conf and recheck patterns;
Too few(< L) patterns returned supp or conf set too high if too few(< L) patterns returned,
2) decrease supp and re-run.
Based on the failure classification in Section 3, the time- Figure 5. Timeout failure management aspect
out failure can be classified as a typical timing failure and model
is considered critical for on-line execution. Since it is less
hazardous for off-line alarm correlation, the allowed times
for re-run could be adjusted to a larger number. Similarly, be analyzed upon the timeout failure. The aspect model
too few patterns and too many patterns might have differ- intercepts and wraps this operation with a sequence of fail-
ent risks under different operational modes. The adjust- ure management activities. The aspect process first initiates
ment of parameters in both situations needs further analy- the counter for the allowed number of re-run times. If the
sis, whereby a failure analyzer is required to carry out the counter already exceeds the allowed number of times for re-
predictive reasoning based on current environmental con- run, it means that the failure cannot be resolved and has to
ditions and constraints as well as the history performance be reported. The whole application thus must be aborted.
of the algorithms. In other words, the failure analyzer is Otherwise, the process will start a timer with a value T,
indeed an adaptive component that fine tunes the behavior in sync with the execution of the proceed operation. The
and strategy of the self-healing actions. proceed operation and the timer are surrounded by an in-
terruptible activity region (denoted as a dashed rectangle
4.2 Modeling Timeout Failure Manage- with rounded corners), representing that whenever the flow
ment Aspect leaves the region via interrupting edges, all of the activities
in the region will be terminated. Specifically, if the proceed
Figure 5 shows an aspect model for managing timeout operation completes execution successfully before time out,
failure. The reserved keyword proceed refers to a certain the control of the flow will return to the base process (via
operation (e.g., the alarm correlation algorithm in the INFM a bull’s eye symbol) and continue the next activity that fol-
case study) that has sensitive timing concerns and needs to lows the proceed operation. Otherwise, the proceed pro-
14
cess will be shut down properly and a timeout failure will be
captured and passed to the failure analyzer/mitigator, which
is responsible for determining the failure risk and calculat-
ing the corresponding mitigation strategies for reconfigur-
ing algorithm parameters. The control loop is thus realized
by re-running the algorithm with the new parameter values.
Figure 6. Pattern failure management aspect 4.5 Code Generation from the Integrated
model Models
15
well-defined self-healing functionality is centralized to one
particular location. By adopting the aspect-oriented ap-
proach to specifying self-healing activities, the aspect mod-
els can be defined independently from the base functional-
ity. Such kind of separation of concerns greatly improves
the reusability, changeability, and maintainability of the
system. Our experience has led us to believe that modeling
based retrofitting is a promising approach to support adding
self-healing functionality to legacy softwares in a more ef-
ficient and effective way based on modularization. In the
meantime, our study also shows that a moderate amount of
effort has to be put onto the model extraction and verifica-
tion process. In general, the required effort increases when
the level of modularization of the legacy code decreases.
5 Related Work
16
liability and performance [19]. A variety of service failures for self-healing operations, all of which will be based on
and hardware failures are pre-defined and associated with more general and advanced techniques for failure predic-
well-categorized faults and then addressed by fault isola- tion, classification, reasoning, and intelligent strategy se-
tion, detection, and restoration [20]. Although some of the lection for risk mitigation.
issues with software failures can be tackled using service
replication as proposed by the authors in [21], adaptable on- References
line monitoring and self-healing approaches are yet rarely
defined for software failures simply because many of un-
expected and unforeseen software failures can only be de- [1] John Strassner and Jeffrey O. Kephart. Autonomic
tected during run-time execution. systems and networks: Theory and practice. In Pro-
The IEEE Standard Classification for Software Anoma- ceeding of the Network Operations and Management
lies [22] provides a comprehensive list of software anomaly Symposium (NOMS) 2006, 2006.
classification and related data items that are helpful to iden-
[2] Jeffrey O. Kephart and David M. Chess. The vision of
tify and track anomalies. The methodology of this standard
autonomic computing. IEEE Computer, 36(1), 2003.
is based on a process (sequence of steps) that pursues a log-
ical progression from the initial recognition of an anomaly [3] Yan Liu, Jing Zhang, Michael Jiang, David Raymer,
to its final disposition. [23] describes the taxonomy of fail- and John Strassner. A model-based approach to
ures, errors, and faults for dependable and secure computing adding autonomic capabilities to network fault man-
as a basis for attaining dependability and security. In [24], agement system. In to be published in Proceedings of
tree-based techniques are proposed for the classification of the IEEE/IFIP Network Operations and Management
software failures based on execution profiles. The UML Symposium 2008, Salvardo, Brazil, 2008.
profile for modeling QoS and fault tolerance [7] defines a
set of UML extensions to represent QoS and fault tolerance [4] Rakesh Agrawal, Tomasz Imielinski, and Arun
concepts based on object replications. Swami. Mining association rules between sets of
The novelty of our model based approach for adding self- items in large databases. SIGMOD, 22(2), 1993.
healing functionality to network fault management system
lies in the modeling framework to integrate the self-* spec- [5] Rakesh Agrawal and Ramakrishnan Srikant. Fast al-
ifications with functional specifications, as well as the ap- gorithms for mining association rules. VLDB, 22(2),
plication of model integration to realize these capabilities. 1994.
The presented modeling framework aims to be more general
[6] Object Management Group. Uml profile for model-
than the prior approaches in that we synthesize self-healing
ing quality of service and fault tolerance character-
techniques and employ model composition and transforma-
istics and mechanisms. http://www.omg.org/
tion mechanisms to support adding autonomic capabilities
cgi-bin/doc?formal/06-05-02, 2006.
to software systems from the abstract model specifications
down to the low-level system implementations. [7] Michael E. Shin and Daniel Cooke. Connector-based
self-healing mechanism for components of a reliable
6 Conclusion system. SIGSOFT Softw. Eng. Notes, 30(4):1–7, 2005.
17
[11] Yennun Huang, Chandra Kintala, Nick Kolettis, and [21] Rachid Guerraoui and André Schiper. Software-based
N. Dudley Fulton. Software rejuvenation: analy- replication for fault tolerance. Computer, 30(4):68–
sis, module and applications. In Twenty-Fifth In- 74, 1997.
ternational Symposium on Fault-Tolerant Computing,
FTCS-25. Digest of Papers, volume 30, pages 381– [22] Roy Sterritt and Michael G. Hinchey. Autonomicity -
390, 1995. an antidote for complexity? pages 283–291, 2005.
[23] IEEE standard classification for software anomalies,
[12] Aspect-oriented modeling. http://www.
1993.
aspect-modeling.org/.
[24] Patrick Francis, David Leon, Melinda Minch, and
[13] Gregor Kiczales, John Lamping, Anurag Menhd- Andy Podgurski. Tree-based methods for classify-
hekar, Chris Maeda, Cristina Lopes, Jean-Marc Lo- ing software failures. In ISSRE ’04: Proceedings of
ingtier, and John Irwin. Aspect-oriented program- the 15th International Symposium on Software Relia-
ming. In Proceedings European Conference on bility Engineering (ISSRE’04), pages 451–462, Wash-
Object-Oriented Programming, volume 1241, pages ington, DC, USA, 2004. IEEE Computer Society.
220–242. Springer-Verlag, 1997.
[25] Dave Raymer, John Strassner, Elyes Lehtihet, and
[14] Gregor Kiczales, Erik Hilsdale, Jim Hugunin, Mik Sven van der Meer. End-to-end model driven pol-
Kersten, Jeffrey Palm, and William Griswold. Get- icy based network management. In POLICY ’06:
ting started with aspectj. Communications of ACM, Proceedings of the Seventh IEEE International Work-
44(10):59–65, 2001. shop on Policies for Distributed Systems and Networks
(POLICY’06), pages 67–70, Washington, DC, USA,
[15] David Garlan, Bradley Schmerl, and Jichuan Chang. 2006. IEEE Computer Society.
Using gauges for architecture-based monitoring and
adaptation. December 2001.
18