Professional Documents
Culture Documents
A Multi-Agent Systems Approach To Autonomic Computing
A Multi-Agent Systems Approach To Autonomic Computing
A Multi-Agent Systems Approach To Autonomic Computing
1 In a commercial-grade version of this technique, each element would 2 In the current Unity implementation, only some of these policies are
also be provided with the security credentials needed to prove its iden- actually stored in and retrieved from the policy repository. We intend
tity to the other elements in the system. to increase that fraction in the coming year.
Permission to make digital or hard copies of all or part of
this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or
commercial advantage and that copies bear this notice and the
full citation on the first page. To copy otherwise, to republish,
to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
466
AAMAS'04, July 19-23, 2004, New York, New York, USA.
Copyright 2004 ACM 1-58113-864-4/04/0007...$5.00
4. Self-Healing for Clusters quired policy repositories, each of which is supplied with
its intended role (including the identifier of the cluster that
Another main goal of Unity is to demonstrate and study it should join) and the registry address. As each one initial-
self-healing clusters of autonomic elements. We have ini- izes, it consults the registry to contact the already registered
tially implemented this style of self-healing in a single ele- members of the cluster and thereby join the cluster itself,
ment: the policy repository. using a simple serial algorithm that avoids most race con-
The purpose of a self-healing system is to provide re- ditions. The resource arbiter also contracts with the sentinel
liability and data integrity in the face of imperfect under- to monitor these policy repositories.
lying software and hardware. We approach this by adding From this point, whenever one of the policy repositories
function to the policy repository to support joining an exist- receives changes to its policy set, the changes are commu-
ing cluster of synchronized policy repositories, and replicat- nicated to the rest of the cluster, as discussed above. Simi-
ing data changes within that cluster. It is also necessary for larly, policy repositories within the cluster exchange infor-
the cluster as a whole to detect the failure of one of its ele- mation about which elements are subscribers to the policy
ments, and to create a replacement element. Care and con- data, and to which policy data they are subscribed.
sideration must be given to which machine should host this Now, let us assume that the sentinel determines that one
new element. For instance, the current policy in Unity is of the policy repositories has failed. Perhaps the software
that two elements in the same cluster should not be hosted has failed, or perhaps the network connection has been sev-
on the same machine, and that elements in a cluster should ered. The resource arbiter will be notified by the sentinel of
not be instantiated on machines that have previously hosted this failure, and will first choose one of the still-functioning
failed elements in that same cluster. policy repositories to take over subscriptions previous han-
Unity supports clustering with two operations added to dled by the failed one, and notify all cluster members of
the policy repository element. First, whenever a new or this reassignment. Then, typically, it will determine that it
modified piece of policy data is received by one of the pol- should replace the failed policy repository. In this case, it
icy repositories in the cluster, it is sent to all other reposi- will examine the available hosts, and select one upon which
tories. This maintains a consistent view (within a few sec- to deploy a replacement policy repository. The policy repos-
onds) amongst all policy repositories. (The algorithm cur- itory is deployed (via a standard interaction with the OS-
rently used does not have transactional integrity, and race Container on the target host), and upon initialization, the
conditions can lead to desynchronization in rare conditions. new policy repository joins the cluster and retrieves a copy
We intend to address this in the near future.) of the current cluster state data (policies and subscriptions).
Secondly, we have modified the subscription system by Such clusters are still subject to a significant data repli-
which Unity elements learn of changes to their policies. In cation problem. A more complete solution could be assisted
standard OGSI [13] notification, a single OGSA service (the by the use of the failover and data replication features of a
subscriber) subscribes to a given Service Data Element on database management system. The method is also most ef-
a single other OGSA service (the publisher). In Unity, the fective in the case of simple single-element failures; it is not
publisher would be the policy repository. In the event of robust against network partitions or similar problems.
that policy repository failing, even if its data is still safe and The above description still contains a central point of
available from the other policy repositories in the cluster, failure, if the system contains only one sentinel. This prob-
the subscriber is left with no subscription, and is never no- lem is easily solved with a cluster of sentinels similar to the
tified of policy changes. In our modified system, subscrip- cluster of policy repositories. Sentinels within a cluster keep
tions themselves (including the identity of the subscriber, each other’s state up-to-date, and provide resiliency against
the class of data subscribed to, and the member of the clus- the failure of a single sentinel.
ter currently servicing the subscription) are part of the data
replicated between elements of the cluster. When a member 5. Self-Optimization Scenario
of the cluster fails, all of its subscriptions are still recorded
in the state data of the surviving cluster members. By reas- In this section, we illustrate how utility functions may
signing those subscriptions to a surviving member, the sys- be effectively used in autonomic systems by means of a
tem can continue providing notifications to the subscribers. data center scenario. The data center manages numerous re-
The self-healing cluster operates as follows. When the sources, including compute servers, database servers, stor-
Unity system is initialized, the resource arbiter determines age devices, etc., and serves many different customers us-
how many policy repositories are required (this is nominally ing multiple applications. We focus in particular on the dy-
done by consulting the system policy, but due to the obvi- namic allocation and management of the compute servers
ous bootstrapping problem this policy is not stored in the within the data center, although our general methodology
policy repository). The resource arbiter then deploys the re- applies to multiple, arbitrary resources.
Permission to make digital or hard copies of all or part of
this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or
commercial advantage and that copies bear this notice and the
full citation on the first page. To copy otherwise, to republish,
to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
467
AAMAS'04, July 19-23, 2004, New York, New York, USA.
Copyright 2004 ACM 1-58113-864-4/04/0007...$5.00
While utility-based resource allocation is widely studied
in MAS and other fields, we find surprisingly that it is little Resource
Arbiter
known in real-world computer systems management. Addi-
U1(R) U2(R)
tionally, those approaches proposed in the research litera-
ture do not fully meet the needs of autonomic computing.
Some approaches [4, 7, 10], require humans to ascribe util- Application Application
ity value to low-level resources. However, a truly autonomic Manager U1(S, D) Manager U2(S, D)
Demand
Demand for the transactional workload D1 consists of re- 100
Resp. Time
2 A1
lante et al. [11] to reset the demand every 5 seconds. 50
Given the service-level utility U1 S1 , A1 uses a sim-
ple system performance model S1 C R1 D¼ S1 R1 D1 0
3
Res. Util.
R=3
for each possible number of servers R1 to estimate the 1000
Utility
mance model by measuring average response time with de- 4
500
mand held constant at certain values, and performing linear A1
0
interpolation between the measured points.
300
Application Environment A2 handles a long-running 5
Utility
200
batch workload. The service level S2 is measured solely in 100 A2
terms of the number of servers R2 allocated to A2 and the 0
the utility function U2 S2 D2 U2 R Û2 R is defined 3 6 A1
as: U2 0 0 U2 1 200 U2 2 300 U2 3 350. Servers 2
1
Results from a typical experiment with the two Applica-
0
tion Environments and three servers are shown in Figure 3.
3 7 A2
The figure shows seven time-series plots over a period of
Servers
2
575 seconds. From top to bottom, they are: (1) Average de- 1
mand D1 on A1; (2) Average response time S1 in A1; (3)
Resource-level utility Û1 R1 for R1 1 2 3 servers for
0
0 100 200 300 400 500
Time (Seconds)
A1; (4) Total utility from the two applications (solid plot)
and the utility U1 S1 obtained from A1 (dashed plot); (5) Figure 3. Times series plots of transactional
Utility U2 R2 obtained from A2; (6) Number of servers R1 demand rate, attained utilities, and server al-
allocated to A1; and (7) Number of servers R2 allocated to locations to each Application Environment,
A2. Notable times are indicated by vertical dashed lines and during a sample Unity experiment.
labeled by letters at the top.
Initially, we set U1 S1 so that it transitions relatively
sharply from 1000 to zero utility, centered at 30ms. The At time f, we change U1 S in the policy repository so
D1 begins low, allowing A1 to get a low response time and that it transitions more gradually, and is centered at 40ms.
utility of nearly 1000 with one server. At time a, D1 rises, A then determines that it can obtain high utility with fewer
and the manager of A1 changes Û R so that two or more servers. Even at the demand peak after time f, the Manager
servers are needed to get a high utility. In response, the Ar- of A1 computes Û1 R1 to reflect that three servers are suf-
biter reoptimizes the allocation to give two servers to A1. ficient to give A1 positive utility.
Just after time c, the demand rises considerably, and even
three servers are not enough to reduce S1 enough to give A1 7. Conclusions and Future Work
nonzero utility. Since no amount of the available servers can
help A1, they are all allocated to A2. The subsequent de- Many researchers in the MAS community have recog-
crease in demand allows A1 to obtain low response times nized the advantages of an agent-based approach to building
with three or fewer servers, and the optimal allocation gives deployable solutions in a number of application domains
some servers to A1. Note that the spikes in response time comprising complex, distributed systems. In this paper we
for A1 at time b and just prior to 300 seconds, which are put forth self-managing distributed computing systems as
too transitory to trigger reallocation, are not caused by an a new and promising application domain for MAS ideas.
increase in demand4. This area is not only natually well-suited to agent-based ap-
proaches, but it is also of immense interest and practical im-
4 Investigation reveals that they are due to Java garbage collection in portance throughout the entire IT industry.
WebSphere. Our decentralized and distributed approach to autonomic
Permission to make digital or hard copies of all or part of
this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or
commercial advantage and that copies bear this notice and the
full citation on the first page. To copy otherwise, to republish,
to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
470
AAMAS'04, July 19-23, 2004, New York, New York, USA.
Copyright 2004 ACM 1-58113-864-4/04/0007...$5.00
computing provided the basic principles underlying the ing queuing theory, simulation, and novel machine learn-
development of our Unity software architecture. Through ing methods that allow the models to be learned and refined
development and experimentation within Unity, we have online, during the operation of the system.
found it to be a valuable platform for studying and testing
ideas about autonomic systems. In particular we found that References
utility functions provide a powerful and flexible way to al-
low systems to manage themselves. [1] W. C. Arnold, D. W. Levine, and E. C. Snible. Auto-
nomic manager toolkit. http://dwdemos.dfw.ibm.
We intend to expand Unity to include a wider range of
com/actk/common/wstkdoc/amts, 2003.
functions and products, and to illuminate more of the large
[2] D. E. Atkins, W. P. Birmingham, E. H. Durfee, E. J. Glover,
and interesting space of self-managing systems. Also, our T. Mullen, E. A. Rudensteiner, E. Soloway, J. M. Vidal,
prototype workload management system highlights many R. Wallace, and M. P. Wellman. Toward inquiry-based
important practical issues that will need to be addressed in education through interacting software agents. Computer,
a deployed real-world system. While our implemented sys- 29(5):69–77, May 1996.
tem shows feasibility on a small scale, there is much work [3] C. Boutilier, R. Das, J. O. Kephart, G. Tesauro, and W. E.
remaining to be done in scaling the system to handle the Walsh. Cooperative negotiation in autonomic systems us-
complexities encountered in real-world data centers. ing incremental utility elicitation. In Nineteenth Conference
In future work, many of Unity’s current features will be on Uncertainty in Artificial Intelligence, pages 89–97, 2003.
generalized. For example, we plan to expand the number [4] J. S. Chase, D. C. Anderson, P. N. Thakar, and A. M. Vahdat.
Managing energy and server resources in hosting centers. In
of supported application environments, and learn what in-
18th Symposium on Operating Systems Principles, 2001.
terface extensions may be required. Likewise, self-healing
[5] I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The phys-
clusters will be expanded to other elements such as reg- iology of the Grid: An Open Grid Services Architecture for
istries and arbiters. (This may require innovation to avoid distributed systems integration. Technical report, Open Grid
bootstrapping problems.) We are also developing synchro- Services Architecture WG, Global Grid Forum, 2002.
nization algorithms for cluster states that have transactional [6] N. R. Jennings. On agent-based software engineering. Arti-
integrity and are robust against race conditions. Addition- ficial Intelligence, 117:277–296, 2000.
ally, we are enhancing the user interface to include ad- [7] T. Kelly. Utility-directed allocation. In First Workshop
vanced policy and utility function tooling methods, to al- on Algorithms and Architectures for Self-Managing Systems,
low greater exploration within this rich space. 2003.
We plan to add flexibility to self-assembly so that the [8] J. O. Kephart and D. M. Chess. The vision of autonomic
parts can form different useful wholes, closer to the ultimate computing. Computer, 36(1):41–52, 2003.
dynamic vision of self-assembly. This will require standard [9] S. Lalis, C. Nikolaou, D. Papadakis, and M. Marazakis.
Market-driven service allocation in a QoS-capbable environ-
languages and taxonomies for services offered, dependen-
ment. In First International Conference on Information and
cies, registry queries, and so on. In complex environments, Computation Economies, 1998.
it will be important to avoid deadlock and circular depen- [10] R. Rajkumar, C. Lee, J. P. Lehoczky, and D. P. Siewiorek.
dencies during self-configuration. It would also be interest- Practical solutions for QoS-based resource allocation prob-
ing to develop ways to query the parts as to what the out- lems. In IEEE Real-Time Systems Symposium, pages 296–
come would be of a hypothetical self-assembly. 306, 1998.
The use of utility functions will be expanded through- [11] M. S. Squillante, D. D. Yao, and L. Zhang. Internet traf-
out the Unity system. The self-assembly process, for in- fic: Periodicity, tail behavior and performance implications.
stance, could use utility functions to decide between various In System Performance Evaluation: Methodologies and Ap-
alternate system configurations. System properties like the plications, 1999.
sizes of self-healing clusters could be derived from higher- [12] P. Thomas, D. Teneketzis, and J. K. MacKie-Mason. A
market-based approach to optimal resource alloctin in
level goals (in terms of estimated reliability, say), rather
integrated-services connection-oriented networks. Opera-
than specified directly. Current hardcoded behaviors, such
tions Research, 50(4), 2002.
as bringing up each member of a self-healing cluster on a [13] S. Tuecke, K. Czajkowski, I. Foster, J. Frey, S. Graham,
different host system, could instead be derived from higher- C. Kesselman, T. Maguire, T. Sandholm, P. Vanderbilt, and
level principles. D. Snelling. Open Grid Services Infrastructure (OGSI) ver-
Finally, we are also investigating more sophisticated ap- sion 1.0. Technical report, Open Grid Services Infrastructure
proaches to resource allocation. One particular focus is on WG, Global Grid Forum, 2002.
developing accurate models of switching costs that may be [14] H. Yamaki, M. P. Wellman, and T. Ishida. A market-based
incurred when reassigning a server, and incorporating such approach to allocating QoS for multimediat applications. In
costs in the optimization problem. We are also develop- Second International Conference on Multi-Agent Systems,
ing more complex and realistic systems models combin- pages 385–392, 1996.
Permission to make digital or hard copies of all or part of
this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or
commercial advantage and that copies bear this notice and the
full citation on the first page. To copy otherwise, to republish,
to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
471
AAMAS'04, July 19-23, 2004, New York, New York, USA.
Copyright 2004 ACM 1-58113-864-4/04/0007...$5.00