Professional Documents
Culture Documents
An Autonomous Agent Approach To Query Optimization PDF
An Autonomous Agent Approach To Query Optimization PDF
An Autonomous Agent Approach To Query Optimization PDF
net/publication/220884210
CITATION READS
1 36
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Srinath Srinivasa on 11 March 2014.
Queries may arrive on any node in the grid requesting for one or l4 CN2
q(s1xs3, t3)
more streams. Queries are represented as relational algebra expres-
sions over the data streams. For the purposes of grid-level optimiza-
tion, this work considers three basic relational operations: projec- q(s1, t2)
l6
tions (q = πsi1 ,...sik (Si )), selections (q = σ(condition) Si ) and joins
(q = Si ./ Sj ). At any given grid node x ∈ X a subset of one
or more streams may be available as part of current query execu- Figure 2: Query Result Generation using Streams Links
tion plan. These streams can be reused to serve other queries in
the vicinity without them having to go all the way to the required
Let Q be the set of all queries incident on the grid G at any in-
stream sources.
stance of time. Let L denote the set of all links that have been
returned in response to queries in Q. We refer to L as the “Estuary
4. OPTIMIZATION OBJECTIVE graph” or the “link graph” of the grid G for the present time. The
From the grid node perspective, the key optimization goal is to Estuary graph is formally defined as:
ensure reduced response times or latency, while from the system
perspective, the key objective is to reduce the required bandwidth. [
These two objectives conflict with one another as, to ensure min- L= L(q) (2)
q∈Q
imal response times, each query would need to be satisfied with a
direct connection to the required source (minimum d) which would The global optimization objective on network usage is to obtain an
lead to multiple connections to the source increasing the bandwidth Estuary graph such that the overall network usage is minimized.
usage. If the data were to be routed sequentially through all the This is stated as:
nodes requiring it, the bandwidth required would be lesser albeit X
at the cost of higher latency. A bandwidth-delay product combines arg min u(li ) (3)
L
li ∈L
both requirements and is termed as network usage in this work. The
optimization objective is to minimize network usage.
The other parameter which influences response time is load on
4.2 Load Distribution
a node. Nodes with heavy loads would be a bottleneck increas- For a given link lp = (xp , yp ), two functions source(lp ) and
ing the overall response time of queries. The load on a node is a sink(lp ) are defined such that, source(lp ) = yp and sink(lp ) =
combination of communication and computational load. Assum- xp .
ing that the computational load on a node is due to processing of Given a grid G, let L be the Estuary graph at a given instance of
incoming and outgoing data, the load on a node is proportional to time. For any grid node x the set of incoming links I{x,L} and the
the communication load itself. A simple measure used to repre- set of outgoing links O{x,L} are given by,
sent the communication load on a node, as used in this work, is the I{x,L} = {lp ∈ L : sink(lp ) = x} (4)
number of incoming and outgoing data links from a node. Hence,
the second optimization objective in this work is to balance load O{x,L} = {lp ∈ L : source(lp ) = x} (5)
distribution.
Although stream optimization is a continuous process, except The set of links L{x,L} contributing to the load, on grid node x
when referring to unfolding behavior, we shall not be concerned is given by,
with the time variable while formally defining optimization objec-
L{x,L} = I{x,L} ∪ O{x,L} − (I{x,L} ∩ O{x,L} ) (6)
tives in the next subsections.
where, I{x,L} ∩ O{x,L} represents the set of self loops in L. A
4.1 Network Usage self loops is a link lp where source(lp ) = sink(lp ). For instance,
A given query q at any node x ∈ X is ultimately answered by link l6 in Figure 2 is a self loop. As the source and destination for
returning a set of stream links L(q) = {l1 , l1 , l2 , . . . , ln }. For the data is the same node, self loops are not considered for load
instance in Figure 2, a query on node CN1 requesting for s1 ./ calculation. Self loops represent situations where the stream being
s2 ./ s3 can be answered by forming the stream link set {l1 , l2 , l3 }. queried for on a grid node x is already available on x. Self loops
A link is a directed edge, represented as an ordered pair, lp = do not add to the load or network usage, although they are required
(xp , yp ). Data flows from data source yp to destination xp , to sat- for completeness of the representation.
isfy in part or completely, a query at xp . In Figure 2, the link l1 In Figure 2, node CN2 answers the query s1 by creating link l4
would be represented as (CN1 , SN1 ). and the subsequent query for s1 ./ s3 by reusing the s1 data al-
For a link li = (xi , yi ), its network usage is given as, ready available (local link l6 ) and fetching data from secondary
source CN1 using link l5 . At time t3 the set of links answer-
u(li ) = Bandwidth(li ) · d(li ) (1)
ing all queries in the grid shown in Figure 2 is given as L =
where, Bandwidth(li ) is the data rate of the stream li , and d(li ) = {l1 , l2 , l3 , l4 , l5 , l6 }. The incoming link set for node CN2 at time
d(xi , yi ) as described earlier, is the latency of the data stream. t3 is I{CN2 ,L} = {l4 , l5 , l6 } and the outgoing link set is O{CN2 ,L} =
{l6 }. Since link l6 is local to node CN2 , it is not considered as a Based on the payoff and fitness functions defined, some nodes
loading factor, resulting in the set of links contributing to load as may switch to a different strategy, thus altering the distribution of
L{CN2 ,L} = {l4 , l5 }. the different strategies in the grid. We call this the grid demograph-
Let Qx be the set of all queries incident on grid node x at any ics. Strategy switch happens after every pre-defined interval called
given instance of time. The instantaneous load on x is given by: a generation. This process continues until the demographics sta-
bilize and the resulting demographics is said to be the emergent
w{x,L} = |L{x,L} | + |Qx | (7)
property which represents the best response characteristics of the
The global optimization objective is to minimize the skew in load grid given the open nature of the LRC queries. A more detailed
distribution at any given instance of time, and is referred to as the explanation of the entire process follows.
instantaneous load distribution. Skew in load distribution is cal-
culated by the variance in instantaneous load across all the grid 5.1 Agents and Strategies
nodes. Optimization for load distribution is hence to build an Es- For emergent optimization, a stream grid is modeled as G =
tuary graph that minimizes the variance in load distribution at any (X, d, χ), where X is a set of agents representing grid nodes, d
given instance. This is defined as: is the distance function as defined earlier and χ is a pool of source
X selection strategies from which agents pick a given strategy in order
arg min (w{x,L} − w{x,L} )2 (8) to make decisions regarding connecting to other agents.
L
x∈X
At any given instance of time t, the state of any agent x ∈ X is
Here, w{x,L} is the mean value of the instantaneous load across given by the following attributes: xt = (Qtx , χtx ). Here, Qtx is the
all nodes. set of queries incident on node x at time t and χtx ∈ χ is the source
selection strategy chosen by node x at time t. As earlier, most of
5. EMERGENT OPTIMIZATION the operations are defined for a given instance of time, and we shall
To enable long running, continuous query optimization in a grid, drop the reference to time when the context is clear.
the notion of emergent optimization is proposed in this work. Grid Each sourceX selection strategy χx is of the form χx : Qx ×X →
nodes act as self-interested autonomous agents optimizing on some [0, 1], where χx = 1. In other words, given a query, a strategy
local property. In order to achieve local optimization, grid nodes X
utilize one of several strategies to connect to one another. These returns a probability vector over the set of all agents, using which
strategies enable a node to select the set of source nodes from connections are made with other agents. The unfolding behavior of
which data is fetched to answer a query and are hence also called an agent over time is defined as follows. For any grid node or agent
as “source selection strategies.” x ∈ X, if Qtx 6= Qt−1
x that is, if an event (arrival of a new query or
revocation of an existing query) has happened at time t, the agent
Strategy Fitness
performs one or both of the following tasks:
Strategy 1 Strategy 2 Strategy i represents user queries and are not under the control of the system,
I t (χx ) and Ot (χx ) are dependent on (1) the state of the grid (its
Estuary graph) at time t − 1 or Lt−1 , (2) the set of queries at the
Demographics (Emergence)
node x: Qtx , and, (3) the strategy adopted by the node: χtx .
The temporal unfolding of an agent’s behavior can be seen as a
Figure 3: Overview of the Emergent Optimization Process closed-loop control system (Figure 3), where the actions taken by
an agent impacts grid characteristics, which in turn impacts further
Variations in query patterns require nodes to evaluate its strategy actions taken by the agent.
for its effectiveness on a continuous basis. In this regard, a “fitness
function” is used to determine the feasibility of a particular strategy
5.2 Payoffs, Generations and Demographics
given the grid environment. Although the fitness function works Once an agent x chooses a source selection strategy χx ∈ χ, it
at a system-wide level, we ensure that it is simple enough to be retains this strategy for a pre-specified time period called a “gen-
computed by each node in a distributed fashion. eration.” Strategies are changed for an agent only across genera-
Fitness is based on a “payoff function” which determines the tions. The first time a strategy is chosen, this choice is made with a
payoff each node receives on answering a query. The payoff would uniform or a pre-determined probability distribution across all the
depend on: (1) the query requirements, (2) the strategy used by the strategies in χ. The “demographics” of the grid is the distribution
node and, (3) the grid conditions. By selecting a payoff function of different source selection strategies adopted by the agents. The
which operates at the grid nodes, the system designer can control set of nodes adopting source selection strategy χi is given by
intrinsic characteristics of the grid like its innate preference towards D(χi ) = {x : x ∈ X ∧ χ(x) = χi }
one optimization objective over the other. The entire process of
emergent optimization is shown in Figure 3. where χ(x) represents the strategy adopted at node x.
A payoff function ρ provides a payoff for each node at every In the experiments we first evaluate the effectiveness of emer-
time step, based on the amount of queries it has answered and/or gent optimization using the DO strategy in reducing network usage
its connections with other nodes. Agents accumulate payoffs till when compared with globally optimal query plans generated by
the end of a generation after which the agent possibly changes its taking snapshots of the grid with every new query arrival and query
source selection strategy. A new generation then begins with all revocation. We then confirm the ability of the LO strategy in dis-
agents having a payoff of 0. tributing load evenly. We also demonstrate the ability of the grid
Let P x be the average payoff per query accumulated by agent nodes to identify the correct strategy for (1) a given global opti-
x at the end of a generation with strategy χx . From the grid per- mization objective and (2) for varying query patterns. The payoff
spective, the average payoff per grid node obtained by any given function used in the experiments and its rationale is described in
strategy χi is computed as the next section.
X
Px 6.1 Payoff Function
x∈X,χ(x)=χi For the experiments we use payoff function that has two payoff
P̂χi = (9)
|D(χi )| components, one of which is based on network usage and the other
on the resulting load on data sources. We choose a payoff function
The “fitness” of any given strategy χi at the end of a generation
that results in, (1) higher payoffs for a node if its source selection
is given by
leads to reduced network usage and, (2) reduction in payoffs, with
P̂χ the reduction proportional to the load on the data sources selected.
Φ(χi ) = X i (10) Any query qi on node i is thought of bringing with it a certain
P̂χi
amount of virtual currency or income, I(qi ) which is equal to the
χi ∈χ
network usage if the query were to be answered directly from the
At the end of a generation, agents re-evaluate their current strate- source nodes. The payoff function ρ is modeled on the savings
gies and possibly change to another strategy. The probability that a incurred over the virtual currency. If L(qi ) is the set of links using
given strategy χi is chosen is given by Φ(χi ). which qi is answered, the network usage part of ρ called ρU , is the
ratio of unused virtual currency to the income:
5.3 Source Selection Strategies 1 X
The set of source selection strategies χ is a critical element of ρU = · [I(qi ) − u(li )] (11)
I(qi )
emergent optimization. Each agent initially chooses a strategy at li ∈L(qi )
random from this pool. Each strategy returns a probability vec- while the load distribution part of ρ called ρW , is measured as the
tor that guides the agent in making connections with other agents. average load ratio over all the links required to answer a query and
The strategy itself takes the present query as input and optionally is given by,
depends upon other information obtained from the grid.
A strategy may seek to optimize a single optimization parame- 1 X w{source(li ),L}
ρW = (12)
ter like network usage or load distribution in isolation or consider |L(qi )| M AXLOAD(source(li ))
li ∈L(qi )
both parameters together. In this work, we consider some simple
strategies whose information requirements are small and that seek where M AXLOAD(source(li )) is the maximum number of links
to optimize on only a single optimization parameter. Such strate- node source(li ) can handle. The overall payoff ρ for the query is
gies are also termed as singleton strategies. The three singleton evaluated as,
strategies evaluated in the work are:
ρ = α · ρU − (1 − α) · ρW (13)
1. Distance ordering (DO): Fetch data from nearest source hav- where, α ∈ [0, 1] is a configurable parameter that determines the
ing the data intrinsic importance to be given by the grid to network usage or
load distribution. Since ρU and ρW are both ratios, they can be
2. Random ordering (RO): Fetch data from a random node hav- added together in a single equation. However, they may not have
ing the required data the same characteristics and the impact of α on ρU and ρW need
3. Load ordering (LO): Fetch data from the least loaded node not be the same. In our grid scenario, a value of 0.2 for α was
having the data empirically seen to provide equal importance to network usage and
load distribution.
Of the above, DO and LO are strategies where the probability 6.2 Results
of connecting to a given node is either 1 or 0 while in RO, given a
choice of k nodes that are stream sources, the probability of con- To evaluate the performance of the DO strategy, we compare
necting to any one of them is k1 . network usage at any given time, with the network usage resulting
from the globally optimal query plan. We evaluate the LO strategy
in a similar manner. Since we are interested in measuring the per-
6. EXPERIMENTAL RESULTS formance of the various strategies and the ability of the grid to cor-
For experimental verification, a grid is simulated with 64 nodes rectly select the correct strategy for a given optimization objective,
arranged in a 8 × 8 square. 12 source nodes are distributed around we switch α between two extremes: 1 (payoff only for optimizing
the center of the grid along the periphery of a smaller square re- network usage) and 0 (payoff only for optimizing load distribution)
sulting in the possibility of 4095 unique queries. 10000 queries to indicate the optimization objective. The grid nodes are provided
arrive uniformly on the grid nodes at a rate of one query every unit with the set of three strategies: DO, RO and LO. At the start of the
of time. Queries seek data or a subset of the data available at the experiment, nodes select a strategy with equal probability.
source nodes. Each query remains in the grid for a certain random Instantaneous Network Usage: Figure 4 compares network us-
amount of time and is then terminated. age across time, between the globally optimal plan and with two
varieties of emergent optimization: (a). when the objective is to 100
Strategy Distribution (Optimize Network Usage)
optimize solely on network usage (α = 1), and (b). when the ob- DO
RO
jective is to optimize solely on load distribution (α = 0). In these LO
Number of Nodes
60
network usage starts off with a high value given the random nature
in which strategies are chosen initially. However, the demographics
40
stabilize very quickly and the network usage drops very close to the
optimal usage. When α = 0 however, network usage is continually
high, since the emphasis is on load distribution. This confirms the 20
1000
Strategy Distribution (Optimize Load Distribution)
100
DO
800 RO
LO
Network Usage
80
600
Number of Nodes
400 60
200
40
0
0 2000 4000 6000 8000 10000
20
Time
8
of emergent optimization to select the correct strategy for a given
optimization objective. It should be noted here that when α is 0 or
Standard Deviation of Load
6
1, nodes do not select the RO strategy. This essentially shows that
strategies which do not perform well for a given objective will be
rooted out of the system.
4
Query set 1 (Figure 8) has very little average load and the fluctu- 80
ation in load is not significant enough for the grid nodes to change
their strategy and hence and as seen in Figure 9 all the nodes select
Number of Nodes
60
the DO strategy throughout the experiment and optimize on net-
work usage. When the grid is subjected to query set 2, the nodes
select the DO strategy initially when the average load is around 40
200 and switch to the LO strategy to balance the load when the
average load increases to 500. Once the average load decreases to 20
200, the grid nodes switch back to the DO strategy and continue to
optimize network usage (Figure 10). This clearly shows that emer-
0
gent optimization is able to identify and adapt to the changing grid 5 10 15 20 25 30 35 40
Generations
conditions without any manual intervention. The M AXLOAD
parameter for each grid node in this experiment was 25 links per Figure 10: Strategy Distribution for Query Set 2
node. In query set 3, the initial grid load is around 500 and each
node is loaded close to its M AXLOAD. Hence the nodes adopt Strategy Distribution for Query Set 3
the LO strategy to balance the load. The nodes continue to use the 100
2 3 4 5 6
DO
LO strategy for the entire duration of the experiment (Figure 11) as RO
LO
the load on the grid is remains high.
80
This set of experiments indicates that the principle of emergent
optimization is able to select the best strategy among the set of
Number of Nodes
strategies it is provided with and is also able to address the varying 60
0
800 5 10 15 20 25 30 35 40
Generations
Average Number of Queries
400
7. CONCLUSION AND FUTURE WORK
This work shows that the primary issue in optimizing open-world
200 queries is the dynamic nature of the queries and the dependency of
the optimal query plan on the currently executing queries in the
0
grid. Emergent optimization in such a scenario provides an effi-
0 2000 4000 6000 8000 10000
Time
cient alternative to optimizing such systems. Although this work
provides insights into optimizing open queries in stream grids, the
Figure 8: Query Sets problem is fundamentally one of optimization in the presence of
uncertainty and a lot of scope exists for further work. We also
envisage a variety of more strategies of different levels of sophisti-
Strategy Distribution for Query Set 1 cation than the basic ones considered in this work. It remains to be
100
2 3 4 5 6 seen if there are specific principles behind designing strategies for
DO
RO
LO
handling open-world queries.
80
8. REFERENCES
[1] D. Abadi, Y. Ahmad., M. Balazinska., U. C. etintemel,
Number of Nodes
60
M. Cherniack, J. Hwang, W. Lindner, A. Maskey, A. Rasin,
E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The design of
40
the borealis stream processing engine. In The Conf. on
Innovative Data Systems Research, pages 277–289, 2005.
20 [2] Y. Ahmad, U. Cetintemel, J. Jannotti, A. Zgolinski, and
S. Zdonik. Network awareness in internet-scale stream
0 processing. In IEEE Data Engineering Bulletin, 2005.
5 10 15 20 25 30 35 40
Generations [3] M. Alpdemir, A. Mukherjee, N. W. Paton, P. Watson, A. A.
Fernandes, A. Gounaris, and J. Smith. Ogsa-dqp: A
Figure 9: Strategy Distribution for Query Set 1 service-based distributed query processor for the grid. In
Proceedings of UK e-Science All Hands Meeting
Nottingham, 2003.
View publication stats
Data Engineering, 2003.
[4] D. Anderson. Resilient overlay networks. In MS Thesis, [22] P. Roy, A. Seshadri, A. Sudarshan, and S. Bhobhe. Efficient
Department of Electrical Engineering and Computer and extensible algorithms for multi query optimization. In
Science, MIT, 2001. ACM SIGMOD Conf. on Management of Data, pages
[5] A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, 249–260, 2000.
R. Motwani, I. Nishizawa, U. Srivastava, D. Thomas, [23] T. Sellis. Multiple-query optimization. In ACM Trans. on
R. Varma, and J. Widom. Stream: The stanford stream data Database Systems, pages 23–52, 1988.
manager. In IEEE Data Engineering Bulletin, pages 19–26, [24] M. Shah, J. Hellerstein, S. Chandrasekharan, and
2003. M. Franklin. Flux: An adaptive reparitioning operator for
[6] R. Avnur and J. . Hellerstein. Eddies: Continously adaptive continuous query systems. In The Intl. Conf. on Data
query processing. In ACM SIGMOD Record, 2000. Engineering, 2003.
[7] R. Axelrod. The Evolution of Cooperation. Basic Books, [25] S. Shah, K. Ramamritham, and P. Shenoy. Resilient and
1984. ISBN 0-465-02122-2. coherence preserving dissemination of dynamic data using
[8] D. Carney. Monitoring streams - a new class of data cooperating peers. In IEEE Transactions on Knowledge and
management operations. In The 28th VLDB Conference, Data Engineering, 2004.
2002. [26] J. Smith, P. Watson, A. Gounaris, N. W. Paton, A. Fernandes,
[9] S. Chandrasekaran, O. Cooper, A. Deshpande, M. Franklin, and R. Sakellariou. Distributed query processing on the grid.
J. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, In The Intl. Journal of High Performance Computing
V. Raman, F. Reiss, and M. Shah. Telegraphcq: Continuous Applications, pages 353–367, 2003.
dataflow processing for an uncertain world. In The Conf. on [27] B. Stegmaier, R. Kuntschke, and A. Kemper. Streamglobe:
Innovative Data Systems Research, 2003. Adaptive query processing and optimization in streaming
[10] S. Chandrasekaran and M. J. Franklin. Streaming queries p2p environments. In Intl. Workshop on Data Management
over streaming data. In The 28th VLDB Conference, 2002. for Sensor Networks, pages 88–97, 2004.
[11] K. M. Chandy and L. Lamport. Distributed snapshots: [28] Y. Yao and J. Gehrke. The cougar approach to in-network
determining global states of distributed systems. In ACM query processing in sensor networks. In ACM SIGMOD
Transactions on Computer Systems (TOCS), pages 63–75, Record 31(3), pages 9–18, 2002.
February 1985. [29] S. Zdonik. The aurora and medusa projects. In IEEE Data
[12] G. Cormode and M. Garofalakis. Streaming in a connected Engineering Bulletin, 2003.
world:quering and tracking distributed data streams. In
International Conf. in Data Engineering, 2007.
[13] K. Gorman, D. Agarwal, and A. E. Abbadi. Multiple query
optimization by cache-aware middleware using query
teamwork. In The 18th Intl Conf. on Data Engineering, 2002.
[14] A. Gupta, S. Sudarshan, and S. Viswanathan. Query
scheduling in multi query optimization. In The Intl.
Symposium on Database Engineering and Applications,
pages 11–19, 2001.
[15] R. Kuntschke, B. Stegmaier, A. Kemper, and A. Reiser.
Streamglobe: Processing and sharing data streams in
grid-based p2p infrastructures. In The 31st VLDB
Conference, pages 1259–1262, 2005.
[16] S. Madden and M. Franklin. Fjording the stream: An
architecture for queries over streaming sensor data. In The
Intl Conf on Data Engineering, 2002.
[17] S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong.
The design of an acquisitional query processor for sensor
networks. In SIGMOD, 2003.
[18] S. Madden, M. Shah, J. Hellerstein, and V. Raman.
Continuously adaptive continuous queries over streams. In
The ACM SIGMOD Intl. Conf on Management of Data,
pages 49–60, 2002.
[19] S. Mukherjee, S. Srinivasa, and S. Patil. Emergent
(re)optimization for stream queries in grids. In IEEE
Congress on Evolutionary Computation, pages 729 – 735,
September 2007.
[20] P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos,
M. Welsh, and M. Seltzer. Network-aware operator
placement for stream-processing systems. In International
Conference on Data Engineering, 2006.
[21] V. Raman, A. Deshpande, and J. Hellerstein. Using state
modules for adaptive query processing. In The Intl. Conf. on