Download as pdf or txt
Download as pdf or txt
You are on page 1of 879

Readings in Database Systems

Readings in Database Systems


4th edition

edited by Joseph M. Hellerstein and Michael Stonebraker

The MIT Press


Cambridge, Massachusetts
London, England
© 2005 Massachusetts Institute of Technology

All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including
photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.

MIT Press books may be purchased at special quantity discounts for business or sales promotional use. For information, please
email special_sales@mitpress.mit.edu or write to Special Sales Department, The MIT Press, 5 Cambridge Center, Cambridge,
MA 02142.

Printed and bound in the United States of America.

ISBN: 0-262-69314-3
Library of Congress Control Number: 2004113624
Contents

Preface ix

Chapter 1: Data Models and DBMS Architecture

What Goes Around Comes Around 2


Michael Stonebraker and Joseph M. Hellerstein
Anatomy of a Database System 42
Joseph M. Hellerstein and Michael Stonebraker

Chapter 2: Query Processing

Introduction 96
Access Path Selection in a Relational Database Management System 103
P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price
Join Processing in Database Systems with Large Main Memories 115
Leonard D. Shapiro
Parallel Database Systems: The Future of High Performance Database Systems 141
David DeWitt and Jim Gray
Encapsulation of Parallelism in the Volcano Query Processing System 155
Goetz Graefe
AlphaSort: A RISC Machine Sort 165
Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, and Dave Lomet
R* Optimizer Validation and Performance Evaluation for Distributed Queries 175
Lothar F. Mackert and Guy M. Lohman
Mariposa: A Wide-Area Distributed Database System 186
Michael Stonebraker, Paul M. Aoki, Witold Litwin, Avi Pfeffer, Adam Sah, Jeff Sidell, Carl Staelin,
and Andrew Yu

Chapter 3: Data Storage and Access Methods

Introduction 202
The R*-tree: An Efficient and Robust Access Method for Points and Rectangles 207
Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger
Operating System Support for Database Management 217
Michael Stonebraker
The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb 224
Jim Gray and Goetz Graefe
A Case for Redundant Arrays of Inexpensive Disks (RAID) 230
David A. Patterson, Garth Gibson, and Randy H. Katz
vi Contents

Chapter 4: Transaction Management

Introduction 238
Granularity of Locks and Degrees of Consistency in a Shared Data Base 244
Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger
On Optimistic Methods for Concurrency Control 274
H. T. Kung and John T. Robinson
Concurrency Control Performance Modeling: Alternatives and Implications 288
Rakesh Agrawal, Michael J. Carey, and Miron Livny
Efficient Locking for Concurrent Operations on B-Trees 334
Philip L. Lehman and S. Bing Yao
ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks
Using Write-Ahead Logging 355
C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz
Transaction Management in the R* Distributed Database Management System 424
C. Mohan, Bruce Lindsay, and R. Obermarck
The Dangers of Replication and a Solution 443
Jim Gray, Pat Helland, Patrick O’Neil, and Dennis Shasha

Chapter 5: Extensibility

Introduction 453
Inclusion of New Types In Relational Data Base Systems 459
Michael Stonebraker
Generalized Search Trees for Database Systems 467
Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer
Grammar-like Functional Rules for Representing Query Optimization Alternatives 479
Guy M. Lohman

Chapter 6: Database Evolution

Introduction 489
AutoAdmin “What-if” Index Analysis Utility 492
Surajit Chaudhuri and Vivek Narasayya
Applying Model Management to Classical Meta Data Problems 504
Philip A. Bernstein
Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates 516
C. Mohan and Inderpal Narang

Chapter 7: Data Warehousing

Introduction 526
Contents vii

An Overview of Data Warehousing and OLAP Technology 532


Surajit Chaudhuri and Umeshwar Dayal
Improved Query Performance with Variant Indexes 542
Patrick O’Neil and Dallan Quass
DataCube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 554
Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, and Murali Venkatrao
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates 579
Yihong Zhao, Prasad M. Deshpande, and Jeffrey F. Naughton
Deriving Production Rules for Incremental View Maintenance 591
Stefano Ceri and Jennifer Widom
Informix under CONTROL: Online Query Processing 604
Joseph M. Hellerstein, Ron Avnur, Vijayshankar Raman
DynaMat: A Dynamic View Management System for Data Warehouses 638
Yannis Kotidis and Nick Roussopoulos

Chapter 8: Data Mining

Introduction 650
BIRCH: An Efficient Data Clustering Method for Very Large Databases 656
Tian Zhang, Raghu Ramakrishnan, and Miron Livny
SPRINT: A Scalable Parallel Classifier for Data Mining 668
John Shafer, Rakesh Agrawal, and Manish Mehta
Fast Algorithms for Mining Association Rules 680
Rakesh Agrawal and Ramakrishnan Srikant
Efficient Evaluation of Queries with Mining Predicates 693
Surajit Chaudhuri, Vivek Narasayya, and Sunita Sarawagi

Chapter 9: Web Services and Data Bases

Introduction 705
Combining Systems and Databases: A Search Engine Retrospective 711
Eric A. Brewer
The Anatomy of a Large-Scale Hypertextual Web Search Engine 725
Sergey Brin and Lawrence Page
The BINGO! System for Information Portal Generation and Expert Web Search 745
Sergej Sizov, Michael Biwer, Jens Graupmann, Stefan Siersdorfer, Martin Theobald, Gerhard Weikum,
and Patrick Zimmer
Data Management in Application Servers 757
Dean Jacobs
Querying Semi-Structured Data 768
Serge Abiteboul
viii Contents

DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases 786


Roy Goldman and Jennifer Widom
NiagaraCQ: A Scalable Continuous Query System for the Internet Databases 796
Jianjun Chen, David J. DeWitt, Feng Tian, and Yuan Wang

Chapter 10: Stream-Based Data Management

Introduction 808
Scalable Trigger Processing 814
Eric N. Hanson, Chris Carnes, Lan Huang, Mohan Konyala, Lloyd Noronha, Sashi Parthasarathy,
J. B. Park, and Albert Vernon
The Design and Implementation of a Sequence Database System 824
Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan
Eddies: Continuously Adaptive Query Processing 836
Ron Avnur and Joseph M. Hellerstein
Retrospective on Aurora 848
Hari Balakrishnan, Magdalena Balazinska, Don Carney, Uğur Çetintemel, Mitch Cherniack, Christian
Convey, Eddie Galvez, Jon Salz, Michael Stonebraker, Nesime Tatbul, Richard Tibbetts, and Stan Zdonik

Sources 862
Preface

This fourth edition of Readings in Database Systems is being issued at an interesting time in the history
of the field. The database industry has undergone significant consolidation in recent years. It is now
dominated by three companies, two of which are not database vendors per se. IBM and Microsoft man-
age large portfolios of products and services; database systems are one—but only one—of their crown
jewels. The third major player, Oracle, is nominally a database company, but is as active in enterprise
applications as it is in core database systems. The era of the “database wars” is over, and it has been a
long time since a database startup company has made major noise. The argument is sometimes made that
database management systems—like many other components in the computing industry—are a victim of
their own success: they have reached a level of maturity at which significant industrial innovation has
become impossible.

Even if this were an accurate assessment of the entrenched database industry, it is a very narrow view of
database research. The research field itself is healthier than it has ever been. There has been a surge of
database faculty hiring in recent years, including at some of the leading research institutions that tradi-
tionally ignored the field. New conferences have emerged both in database systems design, and in more
algorithmic fields like data mining. Lessons from the database literature are being applied in a host of
forward-looking research areas at universities, from bioinformatics to sensor networks to next-generation
Internet architectures.

This external interest in database technologies is not confined to academia. Industrial software systems
are increasingly turning to database system innovations to solve other problems. Rumor has it, for exam-
ple, that Microsoft’s next operating system will have a single unified store for all files and data, based on
their database engine. Web-based e-commerce services depend to a large extent on transactional messag-
ing technologies developed in the database community. Text-based web services like search engines also
owe a debt to database innovations in parallel query processing. The list goes on.

It would seem, then, that while the core industrial database products have gelled, the ideas that they
encapsulate have become increasingly influential. A much more optimistic and realistic view of database
research is that the field is in a position to make a bigger impact on computing in the large than it ever
has before, in part because the community has solved many of its own challenges and is courting other
areas for collaborations. This cross-fertilization could result in major changes in the traditional database
industry, and in other aspects of computing.

This book is intended to provide software technologists—both professionals and students—with a


grounding in database research past and present, and a technical context for understanding new innova-
tions. It is also designed to be a reference for anyone already active in database systems. This set of read-
ings represents what we perceive to be the most important issues in the database area: the core material
for any DBMS professional to study.

The book opens with two introductory articles we wrote to set the stage for the research papers collected
here. The first article presents a historical perspective on the design of data models and query languages;
the second provides an architectural overview of the anatomy of a database system. These articles are
intended to provide an organized, modern introduction to basic knowledge of the field, which in previous
x Preface

editions was represented by a sampling of seminal research papers from the late Ted Codd and the pio-
neering INGRES and System R projects. A true database aficionado should still read those original
papers [Cod70,ABC+76, SWK76, Sto80, CPS+81], since they give a snapshot of the excitement and
challenges of the time. However we felt that after three decades it was hard for readers to get a substan-
tive basis for the field in its current richness by reading the early papers. Hence with some notable regret
we chose not to include them in this edition.

For the remaining papers we have selected, we provide chapter introductions to discuss the context, moti-
vation, and, when relevant, the controversy in the area. These introductions summarize the comments we
make during lectures in our graduate courses, and place the papers in the broader perspective of database
research. The comments are often explicitly intended to be opinions, not necessarily statements of fact—
they are intended as conversation-starters. We hope this style encourages students and independent read-
ers to critically evaluate both the papers and our editorial remarks.

This edition of the book contains a host of new papers, including a number of chapters in new areas.
Four of the papers were written expressly for the book: the two introductory articles, Brewer’s paper on
search engine architecture, and Jacobs’ paper on application servers. The remaining papers we chose
from both the classical literature and from recent hot topics. We selected papers based on our assessment
both of the quality of research and its potential for lasting importance. We have tried to assemble a col-
lection of papers that are both seminal in nature and accessible to a reader who has a basic familiarity
with database systems. We often had two or more papers to choose from. In such cases we selected what
we felt was the best one or the one discussing the broadest variety of issues. In some areas such as trans-
action management, all of the research is very detail-oriented. In these cases we tried to favor papers that
are accessible. In areas like data mining with a strong mathematical component, we tried to select papers
that are both accessible to software systems experts, and that deal non-trivially with systems challenges.

This book has been greatly improved by the input of many colleagues, including: Paul Aoki, Eric
Brewer, David DeWitt, Mike Franklin, Johannes Gehrke, Jim Gray, James Hamilton, Wei Hong, Guy
Lohman, Sam Madden, Chris Olston, Tamer Ozsu, Raghu Ramakrishnan, Andreas Reuter, and Stuart
Russell. We particularly thank Eric Brewer and Dean Jacobs for their contributions of new material.
Thanks are also due to the students of CS286 and CS262 at Berkeley, and 689.3 at MIT; their comments
have been a major influence on our choice of papers and our presentation of the material.

References

[ABC+76] Morton M. Astrahan, Mike W. Blasgen, Donald D. Chamberlin, Kapali P. Eswaran, Jim
Gray, Patricia P. Griffiths, W. Frank King III, Raymond A. Lorie, Paul R. McJones, James W. Mehl,
Gianfranco R. Putzolu, Irving L. Traiger, Bradford W. Wade, and Vera Watson. System R: Relational
Approach to Database Management. ACM Transactions on Database Systems (TODS), 1(2):97-137,
1976.

[CPS+81] Donald D. Chamberlin, Franco Putzolu, Patricia Griffiths Selinger, Mario Schkolnick, Donald
R. Slutz, Irving L. Traiger, Bradford W. Wade, Robert A. Yost, Morton M. Astrahan, Michael W. Blasgen,
James N. Gray, W. Frank King, Bruce G. Lindsay, Raymond Lorie, James W. Mehl and Thomas G.
Price. A History and Evaluation of System R, Communications of the ACM 24:10 (1981), 632-646.
Preface xi

[Cod70] E. F. Codd. A Relational Model of Data for Large Shared Data Banks, Comm. ACM 13(6), June
1970, p377-387.

[SWK76] M.R. Stonebraker, E. Wong, and P. Kreps. The Design and Implementation of INGRES. ACM
Transactions on Database Systems (TODS), 1(3):189-222, September 1976.

[Sto80] M. Stonebraker. Retrospection on a Database System. ACM Transactions on Database Systems


(TODS), 5(2):225-240, 1980.
Chapter 1
Data Models and DBMS Architecture
What Goes Around Comes Around

Michael Stonebraker
Joseph M. Hellerstein

Abstract

This paper provides a summary of 35 years of data model proposals, grouped into 9
different eras. We discuss the proposals of each era, and show that there are only a few
basic data modeling ideas, and most have been around a long time. Later proposals
inevitably bear a strong resemblance to certain earlier proposals. Hence, it is a
worthwhile exercise to study previous proposals.

In addition, we present the lessons learned from the exploration of the proposals in each
era. Most current researchers were not around for many of the previous eras, and have
limited (if any) understanding of what was previously learned. There is an old adage that
he who does not understand history is condemned to repeat it. By presenting “ancient
history”, we hope to allow future researchers to avoid replaying history.

Unfortunately, the main proposal in the current XML era bears a striking resemblance to
the CODASYL proposal from the early 1970’s, which failed because of its complexity.
Hence, the current era is replaying history, and “what goes around comes around”.
Hopefully the next era will be smarter.

I Introduction

Data model proposals have been around since the late 1960’s, when the first author
“came on the scene”. Proposals have continued with surprising regularity for the
intervening 35 years. Moreover, many of the current day proposals have come from
researchers too young to have learned from the discussion of earlier ones. Hence, the
purpose of this paper is to summarize 35 years worth of “progress” and point out what
should be learned from this lengthy exercise.

We present data model proposals in nine historical epochs:

Hierarchical (IMS): late 1960’s and 1970’s


Network (CODASYL): 1970’s
Relational: 1970’s and early 1980’s
Entity-Relationship: 1970’s
Extended Relational: 1980’s
Semantic: late 1970’s and 1980’s
Object-oriented: late 1980’s and early 1990’s
Object-relational: late 1980’s and early 1990’s
What Goes Around Comes Around 3

Semi-structured (XML): late 1990’s to the present

In each case, we discuss the data model and associated query language, using a neutral
notation. Hence, we will spare the reader the idiosyncratic details of the various
proposals. We will also attempt to use a uniform collection of terms, again in an attempt
to limit the confusion that might otherwise occur.

Throughout much of the paper, we will use the standard example of suppliers and parts,
from [CODD70], which we write for now in relational form in Figure 1.

Supplier (sno, sname, scity, sstate)


Part (pno, pname, psize, pcolor)
Supply (sno, pno, qty, price)

A Relational Schema
Figure 1

Here we have Supplier information, Part information and the Supply relationship to
indicate the terms under which a supplier can supply a part.

II IMS Era

IMS was released around 1968, and initially had a hierarchical data model. It understood
the notion of a record type, which is a collection of named fields with their associated
data types. Each instance of a record type is forced to obey the data description
indicated in the definition of the record type. Furthermore, some subset of the named
fields must uniquely specify a record instance, i.e. they are required to be a key. Lastly,
the record types must be arranged in a tree, such that each record type (other than the
root) has a unique parent record type. An IMS data base is a collection of instances of
record types, such that each instance, other than root instances, has a single parent of the
correct record type.

This requirement of tree-structured data presents a challenge for our sample data, because
we are forced to structure it in one of the two ways indicated in Figure 2. These
representations share two common undesirable properties:

1) Information is repeated. In the first schema, Part information is repeated for


each Supplier who supplies the part. In the second schema, Supplier information
is repeated for each part he supplies. Repeated information is undesirable,
because it offers the possibility for inconsistent data. For example, a repeated
data element could be changed in some, but not all, of the places it appears,
leading to an inconsistent data base.
2) Existence depends on parents. In the first schema it is impossible for there to be
a part that is not currently supplied by anybody. In the second schema, it is
impossible to have a supplier which does not currently supply anything. There is
no support for these “corner cases” in a strict hierarchy.
4 Chapter 1: Data Models and DBMS Architecture

Supplier (sno, Part (pno,


sname, scity, pname, psize,
sstate) pcolor)

Part (pno, pname, Supplier (sno,


psize, pcolor, qty, sname, scity,
price) sstate, qty, price)

Two Hierarchical Organizations


Figure 2

IMS chose a hierarchical data base because it facilitates a simple data manipulation
language, DL/1. Every record in an IMS data base has a hierarchical sequence key
(HSK). Basically, an HSK is derived by concatenating the keys of ancestor records, and
then adding the key of the current record. HSK defines a natural order of all records in
an IMS data base, basically depth-first, left-to-right. DL/1 intimately used HSK order for
the semantics of commands. For example, the “get next” command returns the next
record in HSK order. Another use of HSK order is the “get next within parent”
command, which explores the subtree underneath a given record in HSK order.

Using the first schema, one can find all the red parts supplied by Supplier 16 as:

Get unique Supplier (sno = 16)


Until failure do
Get next within parent (color = red)
Enddo

The first command finds Supplier 16. Then we iterate through the subtree underneath
this record in HSK order, looking for red parts. When the subtree is exhausted, an error
is returned.

Notice that DL/1 is a “record-at-a-time” language, whereby the programmer constructs an


algorithm for solving his query, and then IMS executes this algorithm. Often there are
multiple ways to solve a query. Here is another way to solve the above specification:
What Goes Around Comes Around 5

Until failure do
Get next Part (color = red)
Enddo

Although one might think that the second solution is clearly inferior to the first one; in
fact if there is only one supplier in the data base (number 16), the second solution will
outperform the first. The DL/1 programmer must make such optimization tradeoffs.

IMS supported four different storage formats for hierarchical data. Basically root records
can either be:

Stored sequentially
Indexed in a B-tree using the key of the record
Hashed using the key of the record

Dependent records are found from the root using either

Physical sequentially
Various forms of pointers.

Some of the storage organizations impose restrictions on DL/1 commands. For example
the purely sequential organization will not support record inserts. Hence, it is appropriate
only for batch processing environments in which a change list is sorted in HSK order and
then a single pass of the data base is made, the changes inserted in the correct place, and a
new data base written. This is usually referred to as “old-master-new-master” processing.
In addition, the storage organization that hashes root records on a key cannot support
“get next”, because it has no easy way to return hashed records in HSK order.

These various “quirks” in IMS are designed to avoid operations that would have
impossibly bad performance. However, this decision comes at a price: One cannot freely
change IMS storage organizations to tune a data base application because there is no
guarantee that the DL/1 programs will continue to run.

The ability of a data base application to continue to run, regardless of what tuning is
performed at the physical level will be called physical data independence. Physical
data independence is important because a DBMS application is not typically written all at
once. As new programs are added to an application, the tuning demands may change,
and better DBMS performance could be achieved by changing the storage organization.
IMS has chosen to limit the amount of physical data independence that is possible.

In addition, the logical requirements of an application may change over time. New
record types may be added, because of new business requirements or because of new
government requirements. It may also be desirable to move certain data elements from
one record type to another. IMS supports a certain level of logical data independence,
because DL/1 is actually defined on a logical data base, not on the actual physical data
base that is stored. Hence, a DL/1 program can be written initially by defining the logical
6 Chapter 1: Data Models and DBMS Architecture

data base to be exactly same as the physical data base. Later, record types can be added
to the physical data base, and the logical data base redefined to exclude them. Hence, an
IMS data base can grow with new record types, and the initial DL/1 program will
continue to operate correctly. In general, an IMS logical data base can be a subtree of a
physical data base.

It is an excellent idea to have the programmer interact with a logical abstraction of the
data, because this allows the physical organization to change, without compromising the
runability of DL/1 programs. Logical and physical data independence are important
because DBMS application have a much longer lifetime (often a quarter century or more)
than the data on which they operate. Data independence will allow the data to change
without requiring costly program maintenance.

One last point should be made about IMS. Clearly, our sample data is not amenable to a
tree structured representation as noted earlier. Hence, there was quickly pressure on IMS
to represent our sample data without the redundancy or dependencies mentioned above.
IMS responded by extending the notion of logical data bases beyond what was just
described.

Supplier (sno, Part (pno,


sname, scity, pname, psize,
sstate) pcolor)

Supply (pno, qty,


price)

Two IMS Physical Data Bases


Figure 3

Suppose one constructs two physical data bases, one containing only Part information
and the second containing Supplier and Supply information as shown in the diagram of
Figure 3. Of course, DL/1 programs are defined on trees; hence they cannot be used
directly on the structures of Figure 3. Instead, IMS allowed the definition of the logical
data base shown in Figure 4. Here, the Supply and Part record types from two different
data bases are “fused” (joined) on the common value of part number into the hierarchical
structure shown.
What Goes Around Comes Around 7

Basically, the structure of Figure 3 is actually stored, and one can note that there is no
redundancy and no bad existence dependencies in this structure. The programmer is
presented with the hierarchical view shown in Figure 4, which supports standard DL/1
programs.

Supplier (sno,
sname, scity,
sstate)

Supply(pno, qty, Part (pno,


price) pname, psize,
pcolor)

An IMS Logical Data Base


Figure 4

Speaking generally, IMS allow two different tree-structured physical data bases to be
“grafted” together into a logical data base. There are many restrictions (for example in
the use of the delete command) and considerable complexity to this use of logical data
bases, but it is a way to represent non-tree structured data in IMS.

The complexity of these logical data bases will be presently seen to be pivotial in
determining how IBM decided to support relational data bases a decade later.

We will summarize the lessons learned so far, and then turn to the CODASYL proposal.

Lesson 1: Physical and logical data independence are highly desirable

Lesson 2: Tree structured data models are very restrictive

Lesson 3: It is a challenge to provide sophisticated logical reorganizations of tree


structured data

Lesson 4: A record-at-a-time user interface forces the programmer to do manual query


optimization, and this is often hard.
8 Chapter 1: Data Models and DBMS Architecture

III CODASYL Era

In 1969 the CODASYL (Committee on Data Systems Languages) committee released


their first report [CODA69], and then followed in 1971 [CODA71] and 1973 [CODA73]
with language specifications. CODASYL was an ad-hoc committee that championed a
network data model along with a record-at-a-time data manipulation language.

This model organized a collection of record types, each with keys, into a network, rather
than a tree. Hence, a given record instance could have multiple parents, rather than a
single one, as in IMS. As a result, our Supplier-Parts-Supply example could be
represented by the CODASYL network of Figure 5.

Part (pno,
Supplier (sno, pname, psize,
sname, scity, pcolor)
sstate)

Supplies Supplied_by

Supply(qty, price)

A CODASYL Network
Figure 5

Here, we notice three record types arranged in a network, connected by two named arcs,
called Supplies and Supplied_by. A named arc is called a set in CODASYL, though it is
not technically a set at all. Rather it indicates that for each record instance of the owner
record type (the tail of the arrow) there is a relationship with zero or more record
instances of the child record type (the head of the arrow). As such, it is a 1-to-n
relationship between owner record instances and child record instances.

A CODASYL network is a collection of named record types and named set types that
form a connected graph. Moreover, there must be at least one entry point (a record type
that is not a child in any set). A CODASYL data base is a collection of record instances
and set instances that obey this network description.
What Goes Around Comes Around 9

Notice that Figure 5 does not have the existence dependencies present in a hierarchical
data model. For example, it is ok to have a part that is not supplied by anybody. This
will merely be an empty instance of the Supplied_by set. Hence, the move to a network
data model solves many of the restrictions of a hierarchy. However, there are still
situations that are hard to model in CODASYL. Consider, for example, data about a
marriage ceremony, which is a 3-way relationship between a bride, a groom, and a
minister. Because CODASYL sets are only two-way relationships, one is forced into the
data model indicated in Figure 6.

Groom
Bride

Participates-1 Participates-2

Ceremony

Participates-3

Minister

A CODASYL Solution
Figure 6

This solution requires three binary sets to express a three-way relationship, and is
somewhat unnatural. Although much more flexible than IMS, the CODASYL data
model still had limitations.

The CODASYL data manipulation language is a record-at-a-time language whereby one


enters the data base at an entry point and then navigates to desired data by following sets.
To find the red parts supplied by Supplier 16 in CODASYL, one can use the following
code:
10 Chapter 1: Data Models and DBMS Architecture

Find Supplier (SNO = 16)


Until no-more {
Find next Supply record in Supplies
Find owner Part record in Supplied_by
Get current record
-check for red—
}

One enters the data base at supplier 16, and then iterates over the members of the
Supplies set. This will yield a collection of Supply records. For each one, the owner in
the Supplied_by set is identified, and a check for redness performed.

The CODASYL proposal suggested that the records in each entry point be hashed on the
key in the record. Several implementations of sets were proposed that entailed various
combinations of pointers between the parent records and child records.

The CODASYL proposal provided essentially no physical data independence. For


example, the above program fails if the key (and hence the hash storage) of the Supplier
record is changed from sno to something else. In addition, no logical data independence
is provided, since the schema cannot change without affecting application programs.

The move to a network model has the advantage that no kludges are required to
implement graph-structured data, such as our example. However, the CODASYL model
is considerably more complex than the IMS data model. In IMS a programmer navigates
in a hierarchical space, while a CODASYL programmer navigates in a multi-dimensional
hyperspace. In IMS the programmer must only worry about his current position in the
data base, and the position of a single ancestor (if he is doing a “get next within parent”).

In contrast, a CODASYL programmer must keep track of the:

The last record touched by the application


The last record of each record type touched
The last record of each set type touched

The various CODASYL DML commands update these currency indicators. Hence, one
can think of CODASYL programming as moving these currency indicators around a
CODASYL data base until a record of interest is located. Then, it can be fetched. In
addition, the CODASYL programmer can suppress currency movement if he desires.
Hence, one way to think of a CODASYL programmer is that he should program looking
at a wall map of the CODASYL network that is decorated with various colored pins
indicating currency. In his 1973 Turing Award lecture, Charlie Bachmann called this
“navigating in hyperspace” [BACH73].
What Goes Around Comes Around 11

Hence, the CODASYL proposal trades increased complexity for the possibility of easily
representing non-hierarchical data. CODASYL offers poorer logical and physical data
independence than IMS.

There are also some more subtle issues with CODASYL. For example, in IMS each data
base could be independently bulk-loaded from an external data source. However, in
CODASYL, all the data was typically in one large network. This much larger object had
to be bulk-loaded all at once, leading to very long load times. Also, if a CODASYL data
base became corrupted, it was necessary to reload all of it from a dump. Hence, crash
recovery tended to be more involved than if the data was divided into a collection of
independent data bases.

In addition, a CODASYL load program tended to be complex because large numbers of


records had to be assembled into sets, and this usually entailed many disk seeks. As
such, it was usually important to think carefully about the load algorithm to optimize
performance. Hence, there was no general purpose CODASYL load utility, and each
installation had to write its own. This complexity was much less important in IMS.

Hence, the lessons learned in CODASYL were:

Lesson 5: Networks are more flexible than hierarchies but more complex

Lesson 6: Loading and recovering networks is more complex than hierarchies

IV Relational Era

Against this backdrop, Ted Codd proposed his relational model in 1970 [CODD70]. In a
conversation with him years later, he indicated that the driver for his research was the fact
that IMS programmers were spending large amounts of time doing maintenance on IMS
applications, when logical or physical changes occurred. Hence, he was focused on
providing better data independence.

His proposal was threefold:

Store the data in a simple data structure (tables)


Access it through a high level set-at-a-time DML
No need for a physical storage proposal

With a simple data structure, one has a better change of providing logical data
independence. With a high level language, one can provide a high degree of physical
data independence. Hence, there is no need to specify a storage proposal, as was required
in both IMS and CODASYL.

Moreover, the relational model has the added advantage that it is flexible enough to
represent almost anything. Hence, the existence dependencies that plagued IMS can be
easily handled by the relational schema shown earlier in Figure 1. In addition, the three-
12 Chapter 1: Data Models and DBMS Architecture

way marriage ceremony that was difficult in CODASYL is easily represented in the
relational model as:

Ceremony (bride-id, groom-id, minister-id, other-data)

Codd made several (increasingly sophisticated) relational model proposals over the years
[CODD79, CODDXX]. Moreover, his early DML proposals were the relational calculus
(data language/alpha) [CODD71a] and the relational algebra [CODD72a]. Since Codd
was originally a mathematician (and previously worked on cellular automata), his DML
proposals were rigorous and formal, but not necessarily easy for mere mortals to
understand.

Codd’s proposal immediately touched off “the great debate”, which lasted for a good part
of the 1970’s. This debate raged at SIGMOD conferences (and it predecessor
SIGFIDET). On the one side, there was Ted Codd and his “followers” (mostly
researchers and academics) who argued the following points:

a) Nothing as complex as CODASYL can possibly be a good idea


b) CODASYL does not provide acceptable data independence
c) Record-at-a-time programming is too hard to optimize
d) CODASYL and IMS are not flexible enough to easily represent common situations
(such as marriage ceremonies)

On the other side, there was Charlie Bachman and his “followers” (mostly DBMS
practitioners) who argued the following:

a) COBOL programmers cannot possibly understand the new-fangled relational


languages
b) It is impossible to implement the relational model efficiently
c) CODASYL can represent tables, so what’s the big deal?

The highlight (or lowlight) of this discussion was an actual debate at SIGMOD ’74
between Codd and Bachman and their respective “seconds” [RUST74]. One of us was in
the audience, and it was obvious that neither side articulated their position clearly. As a
result, neither side was able to hear what the other side had to say.

In the next couple of years, the two camps modified their positions (more or less) as
follows:

Relational advocates

a) Codd is a mathematician, and his languages are not the right ones. SQL [CHAM74]
and QUEL [STON76] are much more user friendly.
What Goes Around Comes Around 13

b) System R [ASTR76] and INGRES [STON76] prove that efficient implementations of


Codd’s ideas are possible. Moreover, query optimizers can be built that are competitive
with all but the best programmers at constructing query plans.

c) These systems prove that physical data independence is achievable. Moreover,


relational views [STON75] offer vastly enhanced logical data independence, relative to
CODASYL.

d) Set-at-a-time languages offer substantial programmer productivity improvements,


relative to record-at-a-time languages.

CODASYL advocates

a) It is possible to specify set-at-a-time network languages, such as LSL [TSIC76], that


provide complete physical data independence and the possibility of better logical data
independence.

b) It is possible to clean up the network model [CODA78], so it is not so arcane.

Hence, both camps responded to the criticisms of the other camp. The debate then died
down, and attention focused on the commercial marketplace to see what would happen.

Fortuitously for the relational camp, the minicomputer revolution was occurring, and
VAXes were proliferating. They were an obvious target for the early commercial
relational systems, such as Oracle and INGRES. Happily for the relational camp, the
major CODASYL systems, such as IDMS from Culinaine Corp. were written in IBM
assembler, and were not portable. Hence, the early relational systems had the VAX
market to themselves. This gave them time to improve the performance of their products,
and the success of the VAX market went hand-in-hand with the success of relational
systems.

On mainframes, a very different story was unfolding. IBM sold a derivative of System R
on VM/370 and a second derivative on VSE, their low end operating system. However,
neither platform was used by serious business data processing users. All the action was
on MVS, the high-end operating system. Here, IBM continued to sell IMS, Cullinaine
successfully sold IDMS, and relational systems were nowhere to be seen.

Hence, VAXes were a relational market and mainframes were a non-relational market.
At the time all serious data management was done on mainframes.

This state of affairs changed abruptly in 1984, when IBM announced the upcoming
release of DB/2 on MVS. In effect, IBM moved from saying that IMS was their serious
DBMS to a dual data base strategy, in which both IMS and DB/2 were declared strategic.
Since DB/2 was the new technology and was much easier to use, it was crystal clear to
everybody who the long-term winner was going to be.
14 Chapter 1: Data Models and DBMS Architecture

IBM’s signal that it was deadly serious about relational systems was a watershed
moment. First, it ended once-and-for-all “the great debate”. Since IBM held vast
marketplace power at the time, they effectively announced that relational systems had
won and CODASYL and hierarchical systems had lost. Soon after, Cullinaine and IDMS
went into a marketplace swoon. Second, they effectively declared that SQL was the de
facto standard relational language. Other (substantially better) query languages, such as
QUEL, were immediately dead. For a scathing critique of the semantics of SQL, consult
[DATE84].

A little known fact must be discussed at this point. It would have been natural for IBM to
put a relational front end on top of IMS, as shown in Figure 7. This architecture would
have allowed IMS customers to continue to run IMS. New application could be written
to the relational interface, providing an elegant migration path to the new technology.
Hence, over time a gradual shift from DL/1 to SQL would have occurred, all the while
preserving the high-performance IMS underpinnings

In fact, IBM attempted to execute exactly this strategy, with a project code-named Eagle.
Unfortunately, it proved too hard to implement SQL on top of the IMS notion of logical
data bases, because of semantic issues. Hence, the complexity of logical data bases in
IMS came back to haunt IBM many years later. As a result, IBM was forced to move to
the dual data base strategy, and to declare a winner of the great debate.

Old programs new programs

Relational
interface

IMS

The Architecture of Project Eagle


Figure 7

In summary, the CODASL versus relational argument was ultimately settled by three
events:
What Goes Around Comes Around 15

a) the success of the VAX


b) the non-portability of CODASYL engines
c) the complexity of IMS logical data bases

The lessons that were learned from this epoch are:

Lesson 7: Set-a-time languages are good, regardless of the data model, since they offer
much improved physical data independence.

Lesson 8: Logical data independence is easier with a simple data model than with a
complex one.

Lesson 9: Technical debates are usually settled by the elephants of the marketplace, and
often for reasons that have little to do with the technology.

Lesson 10: Query optimizers can beat all but the best record-at-a-time DBMS application
programmers.

V The Entity-Relationship Era

In the mid 1970’s Peter Chen proposed the entity-relationship (E-R) data model as an
alternative to the relational, CODASYL and hierarchical data models [CHEN76].
Basically, he proposed that a data base be thought of a collection of instances of entities.
Loosely speaking these are objects that have an existence, independent of any other
entities in the data base. In our example, Supplier and Parts would be such entities.

In addition, entities have attributes, which are the data elements that characterize the
entity. In our example, the attributes of Part would be pno, pname, psize, and pcolor.
One or more of these attributes would be designated to be unique, i.e. to be a key. Lastly,
there could be relationships between entities. In our example, Supply is a relationship
between the entities Part and Supplier. Relationships could be 1-to-1, 1-to-n, n-to-1 or
m-to-n, depending on how the entities participate in the relationship. In our example,
Suppliers can supply multiple parts, and parts can be supplied by multiple suppliers.
Hence, the Supply relationship is m-to-n. Relationships can also have attributes that
describe the relationship. In our example, qty and price are attributes of the relationship
Supply.

A popular representation for E-R models was a “boxes and arrows” notation as shown in
Figure 8. The E-R model never gained acceptance as the underlying data model that is
implemented by a DBMS. Perhaps the reason was that in the early days there was no
query language proposed for it. Perhaps it was simply overwhelmed by the interest in the
relational model in the 1970’s. Perhaps it looked too much like a “cleaned up” version of
the CODASYL model. Whatever the reason, the E-R model languished in the 1970’s.
16 Chapter 1: Data Models and DBMS Architecture

Part Supply Supplier


Pno, pname, psize, Sno, sname, scity,
qty, price
pcolor. sstate

An E-R Diagram
Figure 8

There is one area where the E-R model has been wildly successful, namely in data base
(schema) design. The standard wisdom from the relational advocates was to perform data
base design by constructing an initial collection of tables. Then, one applied
normalization theory to this initial design. Throughout the decade of the 1970’s there
were a collection of normal forms proposed, including second normal form (2NF)
[CODD71b], third normal form [CODD71b], Boyce-Codd normal form (BCNF)
[CODD72b], fourth normal form (4NF) [FAGI77a], and project-join normal form
[FAGI77b].

There were two problems with normalization theory when applied to real world data base
design problems. First, real DBAs immediately asked “How do I get an initial set of
tables?” Normalization theory had no answer to this important question. Second, and
perhaps more serious, normalization theory was based on the concept of functional
dependencies, and real world DBAs could not understand this construct. Hence, data base
design using normalization was “dead in the water”.

In contrast, the E-R model became very popular as a data base design tool. Chen’s
papers contained a methodology for constructing an initial E-R diagram. In addition, it
was straightforward to convert an E-R diagram into a collection of tables in third normal
form [WONG79]. Hence, a DBA tool could perform this conversion automatically. As
such, a DBA could construct an E-R model of his data, typically using a boxes and
arrows drawing tool, and then be assured that he would automatically get a good
relational schema. Essentially all data base design tools, such as Silverrun from Magna
Solutions, ERwin from Computer Associates, and ER/Studio from Embarcadero work in
this fashion.

Lesson 11: Functional dependencies are too difficult for mere mortals to understand.
Another reason for KISS (Keep it simple stupid).

VI R++ Era

Beginning in the early 1980’s a (sizeable) collection of papers appeared which can be
described by the following template:
What Goes Around Comes Around 17

Consider an application, call it X


Try to implement X on a relational DBMS
Show why the queries are difficult or why poor performance is observed
Add a new “feature” to the relational model to correct the problem

Many X’s were investigated including mechanical CAD [KATZ86], VLSI CAD
[BATO85], text management [STON83], time [SNOD85] and computer graphics
[SPON84]. This collection of papers formed “the R++ era”, as they all proposed
additions to the relational model. In our opinion, probably the best of the lot was Gem
[ZANI83]. Zaniolo proposed adding the following constructs to the relational model,
together with corresponding query language extensions:

1) set-valued attributes. In a Parts table, it is often the case that there is an attribute,
such as available_colors, which can take on a set of values. It would be nice to add a data
type to the relational model to deal with sets of values.

2) aggregation (tuple-reference as a data type). In the Supply relation noted above,


there are two foreign keys, sno and pno, that effectively point to tuples in other tables. It
is arguably cleaner to have the Supply table have the following structure:

Supply (PT, SR, qty, price)

Here the data type of PT is “tuple in the Part table” and the data type of SR is “tuple in
the Supplier table”. Of course, the expected implementation of these data types is via
some sort of pointer. With these constructs however, we can find the suppliers who
supply red parts as:

Select Supply.SR.sno
From Supply
Where Supply.PT.pcolor = “red”

This “cascaded dot” notation allowed one to query the Supply table and then effectively
reference tuples in other tables. This cascaded dot notation is similar to the path
expressions seen in high level network languages such as LSL. It allowed one to traverse
between tables without having to specify an explicit join.

3) generalization. Suppose there are two kinds of parts in our example, say electrical
parts and plumbing parts. For electrical parts, we record the power consumption and the
voltage. For plumbing parts we record the diameter and the material used to make the
part. This is shown pictorially in Figure 9, where we see a root part with two
specializations. Each specialization inherits all of the data attributes in its ancestors.

Inheritance hierarchies were put in early programming languages such as Planner


[HEWI69] and Conniver [MCDO73]. The same concept has been included in more
recent programming languages, such as C++. Gem merely applied this well known
concept to data bases.
18 Chapter 1: Data Models and DBMS Architecture

Part (pno, pname,


psize, pcolor

Electrical Plumbing
(power, (diameter,
voltage) material)

An Inheritance Hierarchy
Figure 9

In Gem, one could reference an inheritance hierarchy in the query language. For example
to find the names of Red electrical parts, one would use:

Select E.pname
From Electrical E
Where E.pcolor = “red”

In addition, Gem had a very elegant treatment of null values.

The problem with extensions of this sort is that while they allowed easier query
formulation than was available in the conventional relational model, they offered very
little performance improvement. For example, primary-key-foreign-key relationships in
the relational model easily simulate tuple as a data type. Moreover, since foreign keys
are essentially logical pointers, the performance of this construct is similar to that
available from some other kind of pointer scheme. Hence, an implementation of Gem
would not be noticeably faster than an implementation of the relational model

In the early 1980’s, the relational vendors were singularly focused on improving
transaction performance and scalability of their systems, so that they could be used for
large scale business data processing applications. This was a very big market that had
major revenue potential. In contrast, R++ ideas would have minor impact. Hence, there
was little technology transfer of R++ ideas into the commercial world, and this research
focus had very little long-term impact.
What Goes Around Comes Around 19

Lesson 12: Unless there is a big performance or functionality advantage, new constructs
will go nowhere.

VII The Semantic Data Model Era

At around the same time, there was another school of thought with similar ideas, but a
different marketing strategy. They suggested that the relational data model is
“semantically impoverished”, i.e. it is incapable of easily expressing a class of data of
interest. Hence, there is a need for a “post relational” data model.

Post relational data models were typically called semantic data models. Examples
included the work by Smith and Smith [SMIT77] and Hammer and McLeod [HAMM81].
SDM from Hammer and McLeod is arguably the more elaborate semantic data model,
and we focus on its concepts in this section.

SDM focuses on the notion of classes, which are a collection of records obeying the same
schema. Like Gem, SDM exploited the concepts of aggregation and generalization and
included a notion of sets. Aggregation is supported by allowing classes to have attributes
that are records in other classes. However, SDM generalizes the aggregation construct in
Gem by allowing an attribute in one class to be a set of instances of records in some
class. For example, there might be two classes, Ships and Countries. The Countries class
could have an attribute called Ships_registered_here, having as its value a collection of
ships. The inverse attribute, country_of_registration can also be defined in SDM.

In addition, classes can generalize other classes. Unlike Gem, generalization is extended
to be a graph rather than just a tree. For example, Figure 10 shows a generalization graph
where American_oil_tankers inherits attributes from both Oil_tankers and
American_ships. This construct is often called multiple inheritance. Classes can also be
the union, intersection or difference between other classes. They can also be a subclass
of another class, specified by a predicate to determine membership. For example,
Heavy_ships might be a subclass of Ships with weight greater than 500 tons. Lastly, a
class can also be a collection of records that are grouped together for some other reason.
For example Atlantic_convoy might be a collection of ships that are sailing together
across the Atlantic Ocean.

Lastly, classes can have class variables, for example the Ships class can have a class
variable which is the number of members of the class.

Most semantic data models were very complex, and were generally paper proposals.
Several years after SDM was defined, Univac explored an implementation of Hammer
and McLeod’s ideas. However, they quickly discovered that SQL was an intergalactic
standard, and their incompatible system was not very successful in the marketplace.
20 Chapter 1: Data Models and DBMS Architecture

Ships

Oil_tankers
American_ship

American_Oil_tankers

A Example of Multiple Inheritance


Figure 10

In our opinion, SDMs had the same two problems that faced the R++ advocates. Like the
R++ proposals, they were a lot of machinery that was easy to simulate on relational
systems. Hence, there was very little leverage in the constructs being proposed. The
SDM camp also faced the second issue of R++ proposals, namely that the established
vendors were distracted with transaction processing performance. Hence, semantic data
models had little long term influence.

VIII OO Era
Beginning in the mid 1980’s there was a “tidal wave” of interest in Object-oriented
DBMSs (OODB). Basically, this community pointed to an “impedance mismatch”
between relational data bases and languages like C++.

In practice, relational data bases had their own naming systems, their own data type
systems, and their own conventions for returning data as a result of a query. Whatever
programming language was used alongside a relational data base also had its own version
of all of these facilities. Hence, to bind an application to the data base required a
conversion from “programming language speak” to “data base speak” and back. This
was like “gluing an apple onto a pancake”, and was the reason for the so-called
impedance mismatch.
What Goes Around Comes Around 21

For example, consider the following C++ snippet which defines a Part Structure and then
allocates an Example_part.

Struct Part {
Int number;
Char* name;
Char* bigness;
Char* color;
} Example_part;

All SQL run-time systems included mechanisms to load variables in the above Struct
from values in the data base. For example to retrieve part 16 into the above Struct
required the following stylized program:

Define cursor P as
Select *
From Part
Where pno = 16;

Open P into Example_part


Until no-more{
Fetch P (Example_part.number = pno,
Example_name = pname
Example_part.bigness = psize
Example_part.color = pcolor)
}

First one defined a cursor to range over the answer to the SQL query. Then, one opened
the cursor, and finally fetched a record from the cursor and bound it to programming
language variables, which did not need to be the same name or type as the corresponding
data base objects. If necessary, data type conversion was performed by the run-time
interface.

The programmer could now manipulate the Struct in the native programming language.
When more than one record could result from the query, the programmer had to iterate
the cursor as in the above example.

It would seem to be much cleaner to integrate DBMS functionality more closely into a
programming language. Specifically, one would like a persistent programming
language, i.e. one where the variables in the language could represent disk-based data as
well as main memory data and where data base search criteria were also language
constructs. Several prototype persistent languages were developed in the late 1970’s,
including Pascal-R [SCHM77], Rigel [ROWE79], and a language embedding for PL/1
[DATE76]. For example, Rigel allowed the above query to be expressed as:
22 Chapter 1: Data Models and DBMS Architecture

For P in Part where P.pno = 16{


Code_to_manipulate_part
}

In Rigel, as in other persistent languages, variables (in this case pno) could be declared.
However, they only needed to be declared once to Rigel, and not once to the language
and a second time to the DBMS. In addition, the predicate p.no = 16 is part of the Rigel
programming language. Lastly, one used the standard programming language iterators
(in this case a For loop) to iterate over qualifying records.

A persistent programming language is obviously much cleaner than a SQL embedding.


However, it requires the compiler for the programming language to be extended with
DBMS-oriented functionality. Since there is no programming language Esperanto, this
extension must be done once per complier. Moreover, each extension will likely be
unique, since C++ is quite different from, for example, APL.

Unfortunately, programming language experts have consistently refused to focus on I/O


in general and DBMS functionality in particular. Hence, all programming languages that
we are aware of have no built-in functionality in this area. Not only does this make
embedding data sublanguages tedious, but also the result is usually difficult to program
and error prone. Lastly, language expertise does not get applied to important special
purpose data-oriented languages, such as report writers and so-called fourth generation
languages.

Hence, there was no technology transfer from the persistent programming language
research efforts of the 1970’s into the commercial marketplace, and ugly data-
sublanguage embeddings prevailed.

In the mid 1980’s there was a resurgence of interest in persistent programming languages,
motivated by the popularity of C++. This research thrust was called Object-Oriented
Data Bases (OODB), and focused mainly on persistent C++. Although the early work
came from the research community with systems like Garden [SKAR86] and Exodus
[RICH87], the primary push on OODBs came from a collection of start-ups, including
Ontologic, Object Design and Versant. All built commercial systems that supported
persistent C++.

The general form of these systems was to support C++ as a data model. Hence, any C++
structure could be persisted. For some reason, it was popular to extend C++ with the
notion of relationships, a concept borrowed directly from the Entity-Relationship data
model a decade earlier. Hence, several systems extended the C++ run-time with support
for this concept.

Most of the OODB community decided to address engineering data bases as their target
market. One typical example of this area is engineering CAD. In a CAD application, an
engineer opens an engineering drawing, say for an electronic circuit, and then modifies
the engineering object, tests it, or runs a power simulator on the circuit. When he is done
What Goes Around Comes Around 23

he closes the object. The general form of these applications is to open a large
engineering object and then process it extensively before closing it.

Historically, such objects were read into virtual memory by a load program. This
program would “swizzle” a disk-based representation of the object into a virtual memory
C++ object. The word “swizzle” came from the necessity of modifying any pointers in
the object when loading. On disk, pointers are typically some sort of logical reference
such as a foreign key, though they can also be disk pointers, for example (block-number,
offset). In virtual memory, they should be virtual memory pointers. Hence, the loader
had to swizzle the disk representation to a virtual memory representation. Then, the code
would operate on the object, usually for a long time. When finished, an unloader would
linearize the C++ data structure back into one that could persist on the disk.

To address the engineering market, an implementation of persistent C++ had the


following requirements:

1) no need for a declarative query language. All one needed was a way to reference
large disk-based engineering objects in C++.
2) no need for fancy transaction management. This market is largely one-user-at-a-
time processing large engineering objects. Rather, some sort of versioning system
would be nice.
3) The run-time system had to be competitive with conventional C++ when
operating on the object. In this market, the performance of an algorithm using
persistent C++ had to be competitive with that available from a custom load
program and conventional C++

Naturally, the OODB vendors focused on meeting these requirements. Hence, there was
weak support for transactions and queries. Instead, the vendors focused on good
performance for manipulating persistent C++ structures. For example, consider the
following declaration:

Persistent int I;

And then the code snippet:

I =: I+1;

In conventional C++, this is a single instruction. To be competitive, incrementing a


persistent variable cannot require a process switch to process a persistent object. Hence,
the DBMS must run in the same address space as the application. Likewise, engineering
objects must be aggressively cached in main memory, and then “lazily” written back to
disk.

Hence, the commercial OODBs, for example Object Design [LAMB91], had innovative
architectures that achieved these objectives.
24 Chapter 1: Data Models and DBMS Architecture

Unfortunately, the market for such engineering applications never got very large, and
there were too many vendors competing for a “niche” market. At the present time, all of
the OODB vendors have failed, or have repositioned their companies to offer something
other than and OODB. For example, Object Design has renamed themselves Excelon,
and is selling XML services

In our opinion, there are a number of reasons for this market failure.

1) absence of leverage. The OODB vendors presented the customer with the
opportunity to avoid writing a load program and an unload program. This is not a
major service, and customers were not willing to pay big money for this feature.
2) No standards. All of the OODB vendor offerings were incompatible.
3) Relink the world. In anything changed, for example a C++ method that operated
on persistent data, then all programs which used this method had to be relinked.
This was a noticeable management problem.
4) No programming language Esperanto. If your enterprise had a single application
not written in C++ that needed to access persistent data, then you could not use
one of the OODB products.

Of course, the OODB products were not designed to work on business data processing
applications. Not only did they lack strong transaction and query systems but also they
ran in the same address space as the application. This meant that the application could
freely manipulate all disk-based data, and no data protection was possible. Protection and
authorization is important in the business data processing market. In addition, OODBs
were clearly a throw back to the CODASYL days, i.e. a low-level record at a time
language with the programmer coding the query optimization algorithm. As a result,
these products had essentially no penetration in this very large market.

There was one company, O2, that had a different business plan. O2 supported an object-
oriented data model, but it was not C++. Also, they embedded a high level declarative
language called OQL into a programming language. Hence, they proposed what
amounted to a semantic data model with a declarative query language, but marketed it as
an OODB. Also, they focused on business data processing, not on the engineering
application space.

Unfortunately for O2, there is a saying that “as goes the United States goes the rest of the
world”. This means that new products must make it in North America, and that the rest
of the world watches the US for market acceptance. O2 was a French company, spun out
of Inria by Francois Bancilhon. It was difficult for O2 to get market traction in Europe
with an advanced product, because of the above adage. Hence, O2 realized they had to
attack the US market, and moved to the United States rather late in the game. By then, it
was simply too late, and the OODB era was on a downward spiral. It is interesting to
conjecture about the marketplace chances of O2 if they had started initially in the USA
with sophisticated US venture capital backing.

Lesson 13: Packages will not sell to users unless they are in “major pain”
What Goes Around Comes Around 25

Lesson 14: Persistent languages will go nowhere without the support of the programming
language community.

IX The Object-Relational Era

The Object-Relational (OR) era was motivated by a very simple problem. In the early
days of INGRES, the team had been interested in geographic information systems (GIS)
and had suggested mechanisms for their support [GO75]. Around 1982, the following
simple GIS issue was haunting the INGRES research team. Suppose one wants to store
geographic positions in a data base. For example, one might want to store the location of
a collection of intersections as:

Intersections (I-id, long, lat, other-data)

Here, we require storing geographic points (long, lat) in a data base. Then, if we want to
find all the intersections within a bounding rectangle, (X0, Y0, X1, Y1), then the SQL
query is:

Select I-id
From Intersections
Where X0 < long < X1 and Y0 < lat < Y1

Unfortunately, this is a two dimensional search, and the B-trees in INGRES are a one-
dimensional access method. One-dimensional access methods do not do two-
dimensional searches efficiently, so there is no way in a relational system for this query
to run fast.

More troubling was the “notify parcel owners” problem. Whenever there is request for a
variance to the zoning laws for a parcel of land in California, there must be a public
hearing, and all property owners within a certain distance must be notified.

Suppose one assumes that all parcels are rectangles, and they are stored in the following
table.

Parcel (P-id, Xmin, Xmax, Ymin, Ymax)

Then, one must enlarge the parcel in question by the correct number of feet, creating a
“super rectangle” with co-ordinates X0, X1, Y0, Y1. All property owners whose parcels
intersect this super rectangle must be notified, and the most efficient query to do this task
is:

Select P-id
From Parcel
Where Xmax > X0 and Ymax > Y0 and Xmin < X1 and Ymax < Y1
26 Chapter 1: Data Models and DBMS Architecture

Again, there is no way to execute this query efficiency with a B-tree access method.
Moreover, it takes a moment to convince oneself that this query is correct, and there are
several other less efficient representations. In summary, simple GIS queries are difficult
to express in SQL, and they execute on standard B-trees with unreasonably bad
performance.

The following observation motivates the OR proposal. Early relational systems


supported integers, floats, and character strings, along with the obvious operators,
primarily because these were the data types of IMS, which was the early competition.
IMS chose these data types because that was what the business data processing market
wanted, and that was their market focus. Relational systems also chose B-trees because
these facilitate the searches that are common in business data processing. Later relational
systems expanded the collection of business data processing data types to include date,
time and money. More recently, packed decimal and blobs have been added.

In other markets, such as GIS, these are not the correct types, and B-trees are not the
correct access method. Hence, to address any given market, one needs data types and
access methods appropriate to the market. Since there may be many other markets one
would want to address, it is inappropriate to “hard wire” a specific collection of data
types and indexing strategies. Rather a sophisticated user should be able to add his own;
i.e. to customize a DBMS to his particular needs. Such customization is also helpful in
business data processing, since one or more new data types appears to be needed every
decade.

As a result, the OR proposal added

user-defined data types,


user-defined operators,
user-defined functions, and
user-defined access methods

to a SQL engine. The major OR research prototype was Postgres [STON86].

Applying the OR methodology to GIS, one merely adds geographic points and
geographic boxes as data types. With these data types, the above tables above can be
expressed as:

Intersections (I-id, point, other-data)


Parcel (P-id, P-box)

Of course, one must also have SQL operators appropriate to each data type. For our
simple application, these are !! (point in rectangle) and ## (box intersects box). The two
queries now become
What Goes Around Comes Around 27

Select I-id
From Intersections
Where point !! “X0, X1, Y0, Y1”

and

Select P-id
From Parcel
Where P-box ## “X0, X1, Y0, Y1”

To support the definition of user-defined operators, one must be able to specify a user-
defined function (UDF), which can process the operator. Hence, for the above examples,
we require functions

Point-in-rect (point, box)

and

Box-int-box (box, box)

which return Booleans. These functions must be called whenever the corresponding
operator must be evaluated, passing the two arguments in the call, and then acting
appropriately on the result.

To address the GIS market one needs a multi-dimensional indexing system, such as Quad
trees [SAME84] or R-trees [GUTM84]. In summary, a high performance GIS DBMS
can be constructed with appropriate user-defined data types, user-defined operators, user-
defined functions, and user-defined access methods.

The main contribution of Postgres was to figure out the engine mechanisms required to
support this kind of extensibility. In effect, previous relational engines had hard coded
support for a specific set of data types, operators and access methods. All this hard-
coded logic must be ripped out and replaced with a much more flexible architecture.
Many of the details of the Postgres scheme are covered in [STON90].

There is another interpretation to UDFs which we now present. In the mid 1980’s Sybase
pioneered the inclusion of stored procedures in a DBMS. The basic idea was to offer
high performance on TPC-B, which consisted of the following commands that simulate
cashing a check:

Begin transaction

Update account set balance = balance – X


Where account_number = Y
28 Chapter 1: Data Models and DBMS Architecture

Update Teller set cash_drawer = cash_drawer – X


Where Teller_number = Z

Update bank set cash – cash – Y

Insert into log (account_number = Y, check = X, Teller= Z)

Commit

This transaction requires 5 or 6 round trip messages between the DBMS and the
application. Since these context switches are expensive relative to the very simple
processing which is being done, application performance is limited by the context
switching time.

A clever way to reduce this time is to define a stored procedure:

Define cash_check (X, Y, Z)


Begin transaction

Update account set balance = balance – X


Where account_number = Y

Update Teller set cash_drawer = cash_drawer – X


Where Teller_number = Z

Update bank set cash – cash – Y

Insert into log (account_number = Y, check = X, Teller= Z)

Commit

End cash_check

Then, the application merely executes the stored procedure, with its parameters, e.g:

Execute cash_check ($100, 79246, 15)

This requires only one round trip between the DBMS and the application rather than 5 or
6, and speeds up TPC-B immensely. To go fast on standard benchmarks such as TPC-B,
all vendors implemented stored procedures. Of course, this required them to define
proprietary (small) programming languages to handle error messages and perform
required control flow. This is necessary for the stored procedure to deal correctly with
conditions such as “insufficient funds” in an account.

Effectively a stored procedure is a UDF that is written in a proprietary language and is


“brain dead”, in the sense that it can only be executed with constants for its parameters.
What Goes Around Comes Around 29

The Postgres UDTs and UDFs generalized this notion to allow code to be written in a
conventional programming language and to be called in the middle of processing
conventional SQL queries.

Postgres implemented a sophisticated mechanism for UDTs, UDFs and user-defined


access methods. In addition, Postgres also implemented less sophisticated notions of
inheritance, and type constructors for pointers (references), sets, and arrays. This latter
set of features allowed Postgres to become “object-oriented” at the height of the OO
craze.

Later benchmarking efforts such as Bucky [CARE97] proved that the major win in
Postgres was UDTs and UDFs; the OO constructs were fairly easy and fairly efficient to
simulate on conventional relational systems. This work demonstrated once more what
the R++ and SDM crowd had already seen several years earlier; namely built-in support
for aggregation and generalization offer little performance benefit. Put differently, the
major contribution of the OR efforts turned out to be a better mechanism for stored
procedures and user-defined access methods.

The OR model has enjoyed some commercial success. Postgres was commercialized by
Illustra. After struggling to find a market for the first couple of years, Illustra caught “the
internet wave” and became “the data base for cyberspace”. If one wanted to store text and
images in a data base and mix them with conventional data types, then Illustra was the
engine which could do that. Near the height of the internet craze, Illustra was acquired
by Informix. From the point of view of Illustra, there were two reasons to join forces
with Informix:

a) inside every OR application, there is a transaction processing sub-application. In order


to be successful in OR, one must have a high performance OLTP engine. Postgres had
never focused on OLTP performance, and the cost of adding it to Illustra would be very
high. It made more sense to combine Illustra features into an existing high performance
engine.

b) To be successful, Illustra had to convince third party vendors to convert pieces of their
application suites into UDTs and UDFs. This was a non-trivial undertaking, and most
external vendors balked at doing so, at least until Illustra could demonstrate that OR
presented a large market opportunity. Hence, Illustra had a “chicken and egg” problem.
To get market share they needed UDTs and UDFs; to get UDTs and UDFs they needed
market share.

Informix provided a solution to both problems, and the combined company proceeded
over time to sell OR technology fairly successfully into the GIS market and into the
market for large content repositories (such as those envisoned by CNN and the British
Broadcasting Corporation). However, widescale adoption of OR in the business data
processing market remained elusive. Of course, the (unrelated) financial difficulties at
Informix made selling new technology such as OR extremely difficult. This certainly
hindered wider adoption.
30 Chapter 1: Data Models and DBMS Architecture

OR technology is gradually finding market acceptance. For example, it is more effective


to implement data mining algorithms as UDFs, a concept pioneered by Red Brick and
recently adopted by Oracle. Instead of moving a terabyte sized warehouse up to mining
code in middleware, it is more efficient to move the code into the DBMS and avoid all
the message overhead. OR technology is also being used to support XML processing, as
we will see presently.

One of the barriers to acceptance of OR technology in the broader business market is the
absence of standards. Every vendor has his own way of defining and calling UDFs, In
addition, most vendors support Java UDFs, but Microsoft does not. It is plausible that
OR technology will not take off unless (and until) the major vendors can agree on
standard definitions and calling conventions.

Lesson 14: The major benefits of OR is two-fold: putting code in the data base (and
thereby bluring the distinction between code and data) and user-defined access methods.

Lesson 15: Widespread adoption of new technology requires either standards and/or an
elephant pushing hard.

X Semi Structured Data

There has been an avalanche of work on ”semi-structured” data in the last five years. An
early example of this class of proposals was Lore [MCHU97]. More recently, the various
XML-based proposals have the same flavor. At the present time, XMLSchema and
XQuery are the standards for XML-based data.

There are two basic points that this class of work exemplifies.

1) schema last
2) complex network-oriented data model

We talk about each point separately in this section.

10.1 Schema Last

The first point is that a schema is not required in advance. In a “schema first” system
the schema is specified, and instances of data records that conform to this schema can be
subsequently loaded. Hence, the data base is always consistent with the pre-existing
schema, because the DBMS rejects any records that are not consistent with the schema.
All previous data models required a DBA to specify the schema in advance.

In this class of proposals the schema does not need to be specified in advance. It can be
specified last, or even not at all. In a “schema last” system, data instances must be self-
describing, because there is not necessarily a schema to give meaning to incoming
records. Without a self-describing format, a record is merely “a bucket of bits”.
What Goes Around Comes Around 31

To make a record self-describing, one must tag each attribute with metadata that defines
the meaning of the attribute. Here are a couple of examples of such records, using an
artificial tagging system:

Person:
Name: Joe Jones
Wages: 14.75
Employer: My_accounting
Hobbies: skiing, bicycling
Works for: ref (Fred Smith)
Favorite joke: Why did the chicken cross the road? To get to the other side
Office number: 247
Major skill: accountant
End Person

Person:
Name: Smith, Vanessa
Wages: 2000
Favorite coffee: Arabian
Passtimes: sewing, swimming
Works_for: Between jobs
Favorite restaurant: Panera
Number of children: 3
End Person:

As can be seen, these two records each describe a person. Moreover, each attribute has
one of three characteristics:

1) it appears in only one of the two records, and there is no attribute in the other
record with the same meaning.
2) it appears in only one of the two records, but there is an attribute in the other
record with the same meaning (e.g. passtimes and hobbies).
3) it appears in both records, but the format or meaning is different (e.g. Works_for,
Wages)

Clearly, comparing these two persons is a challenge. This is an example of semantic


heterogeneity, where information on a common object (in this case a person) does not
conform to a common representation. Semantic heterogeneity makes query processing a
big challenge, because there is no structure on which to base indexing decisions and
query execution strategies.

The advocates of “schema last” typically have in mind applications where it is natural for
users to enter their data as free text, perhaps through a word processor (which may
annotate the text with some simple metadata about document structure). In this case, it is
an imposition to require a schema to exist before a user can add data. The “schema last”
32 Chapter 1: Data Models and DBMS Architecture

advocates then have in mind automatically or semi-automatically tagging incoming data


to construct the above semi-structured records.

In contrast, if a business form is used for data entry, (which would probably be natural for
the above Person data), then a “schema first” methodology is being employed, because
the person who designed the form is, in effect, also defining the schema by what he
allows in the form. As a result, schema last is appropriate mainly for applications where
free text is the mechanism for data entry.

To explore the utility of schema-last, we present the following scheme that classifies
applications into four buckets.

Ones with rigidly structured data


Ones with rigidly structured data with some text fields
Ones with semi-structured data
Ones with text

Rigidly structured data encompasses data that must conform to a schema. In general, this
includes essentially all data on which business processes must operate. For example,
consider the payroll data base for a typical company. This data must be rigidly
structured, or the check-printing program might produce erroneous results. One simply
cannot tolerate missing or badly formatted data that business processes depends on. For
rigidly structured data, one should insist on schema-first.

The personnel records of a large company are typical of the second class of data base
applications that we consider. There is a considerable amount of rigidly structured data,
such as the health plan each employee is enrolled in, and the fringe benefits they are
entitled to. In addition, there are free text fields, such as the comments of the manager at
the last employee review. The employee review form is typically rigidly structured;
hence the only free text input is into specific comment fields. Again schema first appears
the right way to go, and this kind of application is easily addressed by an Object-
Relational DBMS with an added text data type.

The third class of data is termed semi-structured. The best examples we can think of are
want ads and resumes. In each of these cases, there is some structure to the data, but data
instances can vary in the fields that are present and how they are represented. Moreover,
there is no schema to which instances necessarily conform. Semi-structured instances are
often entered as a text document, and then parsed to find information of interest, which is
in turn “shredded” into appropriate fields inside the storage engine. In this case, schema
last is a good idea.

The fourth class of data is pure text, i.e. documents with no particular structure. In this
bucket, there is no obvious structure to exploit. Information Retrieval (IR) systems have
focused on this class of data for several decades. Few IR researchers have any interest in
semi-structured data; rather they are interested in document retrieval based on the textual
What Goes Around Comes Around 33

content of the document. Hence, there is no schema to deduce in this bucket, and this
corresponds to “schema not at all”.

As a result, schema-last proposals deal only with the third class of data in our
classification system. It is difficult to think up very many examples of this class, other
than resumes and advertisements. The proponents (many of whom are academics) often
suggest that college course descriptions fit this category. However, every university we
know has a rigid format for course descriptions, which includes one or more text fields.
Most have a standard form for entering the data, and a system (manual or automatic) to
reject course descriptions that do not fit this format. Hence, course descriptions are an
example of the second class of data, not the third. In our opinion, a careful examination
of the claimed instances of class 3 applications will yield many fewer actual instances of
the class. Moreover, the largest web site specializing in resumes (Monster.com) has
recently adopted a business form through which data entry occurs. Hence, they have
switched from class 3 to class 2, presumably to enforce more uniformity on their data
base (and thereby easier comparability).

Semantic heterogeneity has been with enterprises for a very long time. They spend vast
sums on warehouse projects to design standard schemas and then convert operational data
to this standard. Moreover, in most organizations semantic heterogeneity is dealt with on
a data set basis; i.e. data sets with different schemas must be homogenized. Typical
warehouse projects are over budget, because schema homogenization is so hard. Any
schema-last application will have to confront semantic heterogeneity on a record-by-
record basis, where it will be even more costly to solve. This is a good reason to avoid
“schema last” if at all possible.

In summary, schema last is appropriate only for the third class of applications in our
classification scheme. Moreover, it is difficult to come up with very many convincing
examples in this class. If anything, the trend is to move class three applications into class
2, presumably to make semantic heterogeneity issues easier to deal with. Lastly, class
three applications appear to have modest amounts of data. For these reasons, we view
schema last data bases as a niche market.

10.2 XML Data Model

We now turn to the XML data model. In the past, the mechanism for describing a
schema was Document Type Definitions (DTDs), and in the future the data model will
be specified in XMLSchema. DTDs and XMLSchema were intended to deal with the
structure of formatted documents (and hence the word “document” in DTDs). As a
result, they look like a document markup language, in particular a subset of SGML].
Because the structure of a document can be very complex, these document specification
standards are necessarily very complex. As a document specification system, we have no
quarrel with these standards.

After DTDs and XMLSchema were “cast into cement”, members of the DBMS research
community decided to try and use them to describe structured data. As a data model for
34 Chapter 1: Data Models and DBMS Architecture

structured data, we believe both standards are seriously flawed. To a first approximation,
these standards have everything that was ever specified in any previous data model
proposal. In addition, they contain additional features that are complex enough, that
nobody in the DBMS community has ever seriously proposed them in a data model.

For example, the data model presented in XMLSchema has the following characteristics:

1) XML records can be hierarchical, as in IMS


2) XML records can have “links” (references to) other records, as in CODASYL,
Gem and SDM
3) XML records can have set-based attributes, as in SDM
4) XML records can inherit from other records in several ways, as in SDM

In addition, XMLSchema also has several features, which are well known in the DBMS
community but never attempted in previous data models because of complexity. One
example is union types, that is, an attribute in a record can be of one of a set of possible
types. For example, in a personnel data base, the field “works-for” could either be a
department number in the enterprise, or the name of an outside firm to whom the
employee is on loan. In this case works-for can either be a string or an integer, with
different meanings.

Note that B-tree indexes on union types are complex. In effect, there must be an index
for each base type in the union. Moreover, there must be a different query plan for each
query that touches a union type. If two union types containing N and M base types
respectively, are to be joined, then there will be at least Max (M, N) plans to co-ordinate.
For these reasons, union types have never been seriously considered for inclusion in a
DBMS.

Obviously, XMLSchema is far and away the most complex data model ever proposed. It
is clearly at the other extreme from the relational model on the “keep it simple stupid”
(KISS) scale. It is hard to imaging something this complex being used as a model for
structured data. We can see three scenarios off into the future.

Scenario 1: XMLSchema will fail because of excessive complexity

Scenario 2: A “data-oriented” subset of XMLSchema will be proposed that is vastly


simpler.

Scenario 3: XMLSchema will become popular. Within a decade all of the problems with
IMS and CODASYL that motivated Codd to invent the relational model will resurface.
At that time some enterprising researcher, call him Y, will “dust off” Codd’s original
paper, and there will be a replay of the “Great Debate”. Presumably it will end the same
way as the last one. Moreover, Codd won the Turing award in 1981 [CODD82] for his
contribution. In this scenario, Y will win the Turing award circa 2015.
What Goes Around Comes Around 35

In fairness to the proponents of “ X stuff”, they have learned something from history.
They are proposing a set-at-a-time query language, Xquery, which will provide a certain
level of data independence. As was discovered in the CODASYL era, providing views
for a network data model will be a challenge (and will be much harder than for the
relational model).

10.3 Summary

Summarizing XML/XML-Schema/Xquery is a challenge, because it has many facets.


Clearly, XML will be a popular “on-the-wire” format for data movement across a
network. The reason is simple: XML goes through firewalls, and other formats do not.
Since there is always a firewall between the machines of any two enterprises, it follows
that cross-enterprise data movement will use XML. Because a typical enterprise wishes
to move data within the enterprise the same way as outside the enterprise, there is every
reason to believe that XML will become an intergalactic data movement standard.

As a result, all flavors of system and application software must be prepared to send and
receive XML. It is straightforward to convert the tuple sets that are produced by
relational data bases into XML. If one has an OR engine, this is merely a user-defined
function. Similarly, one can accept input in XML and convert it to tuples to store in a
data base with a second user-defined function. Hence OR technology facilitates the
necessary format conversions. Other system software will likewise require a conversion
facility.

Moreover, higher level data movement facilities built on top of XML, such as SOAP, will
be equally popular. Clearly, remote procedure calls that go through firewalls are much
more useful than ones that don’t. Hence, SOAP will dominate other RPC proposals.

It is possible that native XML DBMSs will become popular, but we doubt it. It will take
a decade for XML DBMSs to become high performance engines that can compete with
the current elephants. Moreover, schema-last should only be attractive in limited
markets, and the overly complex network model are the antithesis of KISS. XMLSchema
cries out for subsetting.. A clean subset of XML-schema would have the characteristic
that it maps easily to current relational DBMSs. In which case, what is the point of
implementing a new engine? Hence, we expect native XML DBMSs to be a niche
market.

Consider now Xquery. A (sane) subset is readily mappable to the OR SQL systems of
several of the vendors. For example, Informix implemented the Xquery operator “//” as a
user-defined function. Hence, it is fairly straightforward to implement a subset of
Xquery on top of most existing engines. As a result, it is not unlikely that the elephants
will support both SQL and a subset of XMLSchema and XQuery. The latter interface
will be translated into SQL.

XML is sometimes marketed as the solution to the semantic heterogeneity problem,


mentioned earlier. Nothing could be further from the truth. Just because two people tag a
36 Chapter 1: Data Models and DBMS Architecture

data element as a salary does not mean that the two data elements are comparable. One
can be salary after taxes in French Francs including a lunch allowance, while the other
could be salary before taxes in US dollar. Furthermore, if you call them “rubber gloves”
and I call them “latex hand protectors”, then XML will be useless in deciding that they
are the same concept. Hence, the role of XML will be limited to providing the
vocabulary in which common schemas can be constructed.

In addition, we believe that cross-enterprise data sharing using common schemas will be
slow in coming, because semantic heterogeneity issues are so difficult to resolve.
Although W3C has a project in this area, the so-called semantic web, we are not
optimistic about its future impact. After all, the AI community has been working on
knowledge representation systems for a couple of decades with limited results. The
semantic web bears a striking resemblance to these past efforts. Since web services
depend on passing information between disparate systems, don’t bet on the early success
this concept.

More precisely, we believe that cross-enterprise information sharing will be limited to:

Enterprises that have high economic value in co-operating. After all, the airlines have
been sharing data across disparate reservation systems for years.

Applications that are semantically simple (such as e-mail) where the main data type is
text and there are no complex semantic mappings involved.

Applications where there is an “elephant” that controls the market. Enterprises like
WalMart and Dell have little difficulty in sharing data with their suppliers. They simply
say “if you want to sell to me; here is how you will interact with my information
systems”. When there is an elephant powerful enough to dictate standards, then cross
enterprise information sharing can be readily accomplished.

We close with one final cynical note. A couple of years ago OLE-DB was being pushed
hard by Microsoft; now it is “X stuff”. OLE-DB was pushed by Microsoft, in large part,
because it did not control ODBC and perceived a competitive advantage in OLE-DB.
Now Microsoft perceives a big threat from Java and its various cross platform extensions,
such as J2EE. Hence, it is pushing hard on the XML and Soap front to try to blunt the
success of Java.

There is every reason to believe that in a couple of years Microsoft will see competitive
advantage in some other DBMS-oriented standard. In the same way that OLE-DB was
sent to an early death, we expect Microsoft to send “X stuff” to a similar fate, the minute
marketing considerations dictate a change.

Less cynically, we claim that technological advances keep changing the rules. For
example, it is clear that the micro-sensor technology coming to the market in the next
few years will have a huge impact on system software, and we expect DBMSs and their
interfaces to be affected in some (yet to be figured out) way.
What Goes Around Comes Around 37

Hence, we expect a succession of new DBMS standards off into the future. In such an
ever changing world, it is crucial that a DBMS be very adaptable, so it can deal with
whatever the next “big thing” is. OR DBMSs have that characteristic; native XML
DBMSs do not.

Lesson 16: Schema-last is a probably a niche market

Lesson 17: XQuery is pretty much OR SQL with a different syntax

Lesson 18: XML will not solve the semantic heterogeneity either inside or outside the
enterprise.

XI Full Circle
This paper has surveyed three decades of data model thinking. It is clear that we have
come “full circle”. We started off with a complex data model, which was followed by a
great debate between a complex model and a much simpler one. The simpler one was
shown to be advantageous in terms of understandability and its ability to support data
independence.

Then, a substantial collection of additions were proposed, none of which gained


substantial market traction, largely because they failed to offer substantial leverage in
exchange for the increased complexity. The only ideas that got market traction were
user-defined functions and user-defined access methods, and these were performance
constructs not data model constructs. The current proposal is now a superset of the union
of all previous proposals. I.e. we have navigated a full circle.

The debate between the XML advocates and the relational crowd bears a suspicious
resemblance to the first “Great Debate” from a quarter of a century ago. A simple data
model is being compared to a complex one. Relational is being compared to
“CODASYL II”. The only difference is that “CODASYL II” has a high level query
language. Logical data independence will be harder in CODASYL II than in its
predecessor, because CODASYL II is even more complex than its predecessor.

We can see history repeating itself. If native XML DBMSs gain traction, then customers
will have problems with logical data independence and complexity.

To avoid repeating history, it is always wise to stand on the shoulders of those who went
before, rather than on their feet. As a field, if we don’t start learning something from
history, we will be condemned to repeat it yet again.

More abstractly, we see few new data model ideas. Most everything put forward in the
last 20 years is a reinvention of something from a quarter century ago. The only concepts
noticeably new appear to be:
38 Chapter 1: Data Models and DBMS Architecture

Code in the data base (from the OR camp)


Schema last (from the semi-structured data camp)

Schema last appears to be a niche market, and we don’t see it as any sort of watershed
idea. Code in the data base appears to be a really good idea. Moreover, it seems to us
that designing a DBMS which made code and data equal class citizens would be a very
helpful. If so, then add-ons to DBMSs such as stored procedures, triggers, and alerters
would become first class citizens. The OR model got part way there; maybe it is now
time to finish that effort.

References

[ASTR76] Morton M. Astrahan, Mike W. Blasgen, Donald D. Chamberlin, Kapali P.


Eswaran, Jim Gray, Patricia P. Griffiths, W. Frank King, Raymond A. Lorie, Paul R.
McJones, James W. Mehl, Gianfranco R. Putzolu, Irving L. Traiger, Bradford W. Wade,
Vera Watson: System R: Relational Approach to Database Management. TODS 1(2): 97-
137 (1976)

[BACH73] Charles W. Bachman: The Programmer as Navigator. CACM 16(11): 635-


658 (1973)

[BATO85] Don S. Batory, Won Kim: Modeling Concepts for VLSI CAD Objects.
TODS 10(3): 322-346 (1985)

[CARE97] Michael J. Carey, David J. DeWitt, Jeffrey F. Naughton, Mohammad


Asgarian, Paul Brown, Johannes Gehrke, Dhaval Shah: The BUCKY Object-Relational
Benchmark (Experience Paper). SIGMOD Conference 1997: 135-146

[CHAM74] Donald D. Chamberlin, Raymond F. Boyce: SEQUEL: A Structured English


Query Language. SIGMOD Workshop, Vol. 1 1974: 249-264

[CHEN76] Peter P. Chen: The Entity-Relationship Model - Toward a Unified View of


Data. TODS 1(1): 9-36 (1976)

[CODA69] CODASYL: Data Base Task Group Report. ACM, New York, N.Y.,
October 1969

[CODA71] CODASYL: Feature Analysis of Generalized Data Base Management


Systems. ACM, New York, N.Y., May 1971

[CODA73] CODASYL: Data Description Language, Journal of Development. National


Bureau of Standards, NBS Handbook 113, June 1973

[CODA78] CODASYL: Data Description Language, Journal of Development.


Information Systems, January 1978
What Goes Around Comes Around 39

[CODD70] E. F. Codd: A Relational Model of Data for Large Shared Data Banks.
CACM 13(6): 377-387 (1970)

[CODD71a] E. F. Codd: A Database Sublanguage Founded on the Relational Calculus.


SIGFIDET Workshop 1971: 35-68

[CODD71b] E. F. Codd: Normalized Data Structure: A Brief Tutorial. SIGFIDET


Workshop 1971: 1-17

[CODD72a] E. F. Codd: Relational Completeness of Data Base Sublanguages. IBM


Research Report RJ 987, San Jose, California: (1972)

[CODD72b] E.F. Codd: Further Normalization of the Data Base Relational Model. In
Data Base Systems ed. Randall Rustin, Prentice-Hall 1972

[CODD79] E. F. Codd: Extending the Database Relational Model to Capture More


Meaning. TODS 4(4): 397-434 (1979)

[CODD82] E. F. Codd: Relational Database: A Practical Foundation for Productivity.


CACM 25(2): 109-117 (1982)

[DATE76] C. J. Date: An Architecture for High-Level Language Database Extensions.


SIGMOD Conference 1976: 101-122

[DATE84] C. J. Date: A Critique of the SQL Database Language. SIGMOD Record


14(3): 8-54, 1984.

[FAGI77a] Ronald Fagin: Multivalued Dependencies and a New Normal Form for
Relational Databases. TODS 2(3): 262-278 (1977)

[FAGI77b] Ronald Fagin: Normal Forms and Relational Database Operators. SIGMOD
Conference 1977: 153-160

[GO75] Angela Go, Michael Stonebraker, Carol Williams: An Approach to


Implementing a Geo-Data System. Data Bases for Interactive Design 1975: 67-77

[GUTM84] Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial


Searching. SIGMOD Conference 1984: 47-57

[HAMM81] Michael Hammer, Dennis McLeod: Database Description with SDM: A


Semantic Database Model. TODS 6(3): 351-386 (1981)

[HEWI69] Carl Hewit: PLANNER: A Language for Proving Theorems in Robots.


Proceedings of' IJCAI-69, IJCAI, Washington D.C.: May, 1969.
40 Chapter 1: Data Models and DBMS Architecture

[KATZ86] Randy H. Katz, Ellis E. Chang, Rajiv Bhateja: Version Modeling Concepts
for Computer-Aided Design Databases. SIGMOD Conference 1986: 379-386

[LAMB91] Charles Lamb, Gordon Landis, Jack A. Orenstein, Danel Weinreb: The
ObjectStore System. CACM 34(10): 50-63 (1991)

[MCDO73] D. McDermott & GJ Sussman: The CONNIVER Reference


Manual. AI Memo 259, MIT AI Lab, 1973.

[MCHU97] Jason McHugh, Serge Abiteboul, Roy Goldman, Dallan Quass, Jennifer
Widom: Lore: A Database Management System for Semistructured Data. SIGMOD
Record 26(3): 54-66 (1997)

[RICH87] Joel E. Richardson, Michael J. Carey: Programming Constructs for Database


System Implementation in EXODUS. SIGMOD Conference 1987: 208-219

[ROWE79] Lawrence A. Rowe, Kurt A. Shoens: Data Abstractions, Views and Updates
in RIGEL. SIGMOD Conference 1979: 71-81

[RUST74] Randall Rustin (ed): Data Models: Data-Structure-Set versus Relational. ACM
SIGFIDET 1974

[SAME84] Hanan Samet: The Quadtree and related Hierarchical Data Structures.
Computing Surveys 16(2): 187-260 (1984)

[SCHM77] Joachim W. Schmidt: Some High Level Language Constructs for Data of
Type Relation. TODS 2(3): 247-261 (1977)

[SKAR86] Andrea H. Skarra, Stanley B. Zdonik, Stephen P. Reiss: An Object Server for
an Object-Oriented Database System. OODBS 1986: 196-204

[SMIT77] John Miles Smith, Diane C. P. Smith: Database Abstractions: Aggregation


and Generalization. TODS 2(2): 105-133 (1977)

[SNOD85] Richard T. Snodgrass, Ilsoo Ahn: A Taxonomy of Time in Databases.


SIGMOD Conference 1985: 236-246

[SPON84] David L. Spooner: Database Support for Interactive Computer Graphics.


SIGMOD Conference 1984: 90-99

[STON75] Michael Stonebraker: Implementation of Integrity Constraints and Views by


Query Modification. SIGMOD Conference 1975: 65-78

[STON76] Michael Stonebraker, Eugene Wong, Peter Kreps, Gerald Held: The Design
and Implementation of INGRES. TODS 1(3): 189-222 (1976)
What Goes Around Comes Around 41

[STON83] Michael Stonebraker, Heidi Stettner, Nadene Lynn, Joseph Kalash, Antonin
Guttman: Document Processing in a Relational Database System. TOIS 1(2): 143-158
(1983)

[STON86] Michael Stonebraker, Lawrence A. Rowe: The Design of Postgres. SIGMOD


Conference 1986: 340-355

[STON90] Michael Stonebraker, Lawrence A. Rowe, Michael Hirohama: The


Implementation of Postgres. TKDE 2(1): 125-142 (1990)

[TSIC76] Dennis Tsichritzis: LSL: A Link and Selector Language. SIGMOD


Conference 1976: 123-133

[WONG79] Eugene Wong, R. H. Katz: Logical Design and Schema Conversion for
Relational and DBTG Databases. ER 1979: 311-322

[ZANI83] Carlo Zaniolo: The Database Language GEM. SIGMOD Conference 1983:
207-218
Anatomy of a Database System
Joseph M. Hellerstein and Michael Stonebraker

1 Introduction
Database Management Systems (DBMSs) are complex, mission-critical pieces of
software. Today’s DBMSs are based on decades of academic and industrial research, and
intense corporate software development. Database systems were among the earliest
widely-deployed online server systems, and as such have pioneered design issues
spanning not only data management, but also applications, operating systems, and
networked services. The early DBMSs are among the most influential software systems
in computer science. Unfortunately, many of the architectural innovations implemented
in high-end database systems are regularly reinvented both in academia and in other areas
of the software industry.

There are a number of reasons why the lessons of database systems architecture are not
widely known. First, the applied database systems community is fairly small. There are
only a handful of commercial-grade DBMS implementations, since market forces only
support a few competitors at the high end. The community of people involved in
designing and implementing database systems is tight: many attended the same schools,
worked on the same influential research projects, and collaborated on the same
commercial products.

Second, academic treatment of database systems has often ignored architectural issues.
The textbook presentation of database systems has traditionally focused on algorithmic
and theoretical issues – which are natural to teach, study and test – without a holistic
discussion of system architecture in full-fledged implementations. In sum, there is a lot
of conventional wisdom about how to build database systems, but much of it has not been
written down or communicated broadly.

In this paper, we attempt to capture the main architectural aspects of modern database
systems, with a discussion of advanced topics. Some of these appear in the literature, and
we provide references where appropriate. Other issues are buried in product manuals,
and some are simply part of the oral tradition of the community. Our goal here is not to
glory in the implementation details of specific components. Instead, we focus on overall
system design, and stress issues not typically discussed in textbooks. For cognoscenti,
this paper should be entirely familiar, perhaps even simplistic. However, our hope is that
for many readers this paper will provide useful context for the algorithms and techniques
in the standard literature. We assume that the reader is familiar with textbook database
systems material (e.g. [53] or [61]), and with the basic facilities of modern operating
systems like Solaris, Linux, or Windows.

1
Anatomy of a Database System 43

1.1 Context
The most mature database systems in production are relational database management
systems (RDBMSs), which serve as the backbone of infrastructure applications including
banking, airline reservations, medical records, human resources, payroll, telephony,
customer relationship management and supply chain management, to name a few. The
advent of web-based interfaces has only increased the volume and breadth of use of
relational systems, which serve as the repositories of record behind essentially all online
commerce. In addition to being very important software infrastructure today, relational
database systems serve as a well-understood point of reference for new extensions and
revolutions in database systems that may arise in the future.

In this paper we will focus on the architectural fundamentals for supporting core
relational features, and bypass discussion of the many extensions present in modern
RDBMSs. Many people are unaware that commercial relational systems now encompass
enormous feature sets, with support for complex data types, multiple programming
languages executing both outside and inside the system, gateways to various external data
sources, and so on. (The current SQL standard specification stacks up to many inches of
printed paper in small type!) In the interest of keeping our discussion here manageable,
we will gloss over most of these features; in particular we will not discuss system
extensions for supporting complex code (stored procedures, user-defined functions, Java
Virtual Machines, triggers, recursive queries, etc.) and data types (Abstract Data Types,
complex objects, XML, etc.)

At heart, a typical database system has four main pieces as shown in Figure 1: a process
manager that encapsulates and schedules the various tasks in the system; a statement-at-a-
time query processing engine; a shared transactional storage subsystem that knits together
storage, buffer management, concurrency control and recovery; and a set of shared
utilities including memory management, disk space management, replication, and various
batch utilities used for administration.

2
44 Chapter 1: Data Models and DBMS Architecture

Figure 1: Main Components of a DBMS

1.2 Structure of the Paper


We begin our discussion with overall architecture of DBMS processes, including coarse
structure of the software and hardware configurations of various systems, and details
about the allocation of various database tasks to threads or processes provided by an
operating system. We continue with the storage issues in a DBMS. In the next section
we take a single query’s view of the system, focusing on the query processing engine.
The subsequent section covers the architecture of a transactional storage manager.
Finally, we present some of the shared utilities that exist in most DBMSs, but are rarely
discussed in textbooks.

2 Process Models and Hardware Architectures


When building any multi-user server, decisions have to be made early on regarding the
organization of processes in the system. These decisions have a profound influence on
the software architecture of the system, and on its performance, scalability, and
portability across operating systems1. In this section we survey a number of options for
DBMS process models. We begin with a simplified framework, assuming the availability
of good operating system support for lightweight threads in a uniprocessor architecture.
We then expand on this simplified discussion to deal with the realities of how DBMSs
implement their own threads and map them to the OS facilities, and how they manage
multiprocessor configurations.

1
Most systems are designed to be portable, but not all. Notable examples of OS-specific
DBMSs are DB2 for MVS, and Microsoft SQL Server. These systems can exploit (and
sometimes add!) special OS features, rather than using DBMS-level workarounds.

3
Anatomy of a Database System 45

2.1 Uniprocessors and OS Threads


In this subsection we outline a somewhat simplistic approach to process models for
DBMSs. Some of the leading commercial DBMSs are not architected this way today, but
this introductory discussion will set the stage for the more complex details to follow in
the remainder of Section 2.

We make two basic assumptions in this subsection, which we will relax in the
subsections to come:
1. High-performance OS threads: We assume that the operating system provides
us with a very efficient thread package that allows a process to have a very large
number of threads. We assume that the memory overhead of each thread is small,
and that context switches among threads are cheap. This is arguably true on a
number of the modern operating systems, but was certainly not true when most
DBMSs were first built. In subsequent sections we will describe how DBMS
implementations actually work with OS threads and processes, but for now we
will assume that the DBMS designers had high-performance threads available
from day one.
2. Uniprocessor Hardware: We will assume that we are designing for a single
machine with a single CPU. Given the low cost of dual-processor and four-way
server PCs today, this is an unrealistic assumption even at the low end. However,
it will significantly simplify our initial discussion.

In this simplified context, there are three natural process model options for a DBMS.
From simplest to most sophisticated, these are:
Connected Dispatcher Process
Clients

Figure 2: Process per connection model. Each gear icon represents a process.

1. Process per Connection: This was the model used in early DBMS
implementations on UNIX. In this model, users run a client tool, typically on a
machine across a network from the DBMS server. They use a database

4
46 Chapter 1: Data Models and DBMS Architecture

connectivity protocol (e.g., ODBC or JDBC) that connects to a main dispatcher


process at the database server machine, which forks a separate process (not a
thread!) to serve that connection. This is relatively easy to implement in UNIX-
like systems, because it maps DBMS units of work directly onto OS processes.
The OS scheduler manages timesharing of user queries, and the DBMS
programmer can rely on OS protection facilities to isolate standard bugs like
memory overruns. Moreover, various programming tools like debuggers and
memory checkers are well-suited to this process model. A complication of
programming in this model regards the data structures that are shared across
connections in a DBMS, including the lock table and buffer pool. These must be
explicitly allocated in OS-supported “shared memory” accessible across
processes, which requires a bit of special-case coding in the DBMS.

In terms of performance, this architecture is not attractive. It does not scale very
well in terms of the number of concurrent connections, since processes are
heavyweight entities with sizable memory overheads and high context-switch
times. Hence this architecture is inappropriate for one of the bread-and-butter
applications of commercial DBMSs: high-concurrency transaction processing.
This architecture was replaced in the commercial DBMS vendors long ago,
though it is still a compatibility option in many systems (and in fact the default
option on installation of Oracle for UNIX).

Multithreaded
Server

Figure 3: Server Process model. The multiple-gear icon represents a multithreaded process.

2. Server Process: This is the most natural architecture for efficiency today. In this
architecture, a single multithreaded process hosts all the main activity of the
DBMS. A dispatcher thread (or perhaps a small handful of such threads) listens
for SQL commands. Typically the process keeps a pool of idle worker threads
available, and the dispatcher assigns incoming SQL commands to idle worker
threads, so that each command runs in its own thread. When a command is
completed, it clears its state and returns its worker thread to the thread pool.

5
Anatomy of a Database System 47

Shared data structures like the lock table and buffer pool simply reside in the
process’ heap space, where they are accessible to all threads.

The usual multithreaded programming challenges arise in this architecture: the OS


does not protect threads from each other’s memory overruns and stray pointers,
debugging is tricky especially with race conditions, and the software can be
difficult to port across operating systems due to differences in threading interfaces
and multi-threaded performance. Although thread API differences across
operating systems have been minimized in recent years, subtle distinctions across
platforms still cause hassles in debugging and tuning.

Multithreaded
Server

I/O
Processes

Figure 4: Server process + I/O processes. Note that each disk has a dedicated, single-threaded I/O
process.

3. Server Process + I/O Processes: The Server Process model makes the important
assumption that asynchronous I/O is provided by the operating system. This
feature allows the DBMS to issue a read or write request, and work on other
things while the disk device works to satisfy the request. Asynchronous I/O can
also allow the DBMS to schedule an I/O reques to each of multiple disk devices;
and have the devices all working in parallel; this is possible even on a
uniprocessor system, since the disk devices themselves work autonomously, and
in fact have their own microprocessors on board. Some time after a disk request
is issued, the OS interrupts the DBMS with a notification the request has
completed. Because of the separation of requests from responses, this is
sometimes called a split-phase programming model.

Unfortunately, asynchronous I/O support in the operating system is a fairly recent


development: Linux only included asynchronous disk I/O support in the standard
kernel in 2002. Without asynchronous I/O, all threads of a process must block

6
48 Chapter 1: Data Models and DBMS Architecture

while waiting for any I/O request to complete, which can unacceptably limit both
system throughput and per-transaction latency. To work around this issue on
older OS versions, a minor modification to the Server Process model is used.
Additional I/O Processes are introduced to provide asynchronous I/O features
outside the OS. The main Server threads queue I/O requests to an I/O Process via
shared memory or network sockets, and the I/O Process queues responses back to
the main Server Process in a similar fashion. There is typically about one I/O
Process per disk in this environment, to ensure that the system can handle
multiple requests to separate devices in parallel.

2.1.1 Passing Data Across Threads


A good Server Process architecture provides non-blocking, asynchronous I/O. It also has
dispatcher threads connecting client requests to worker threads. This design begs the
question of how data is passed across these thread or process boundaries. The short
answer is that various buffers are used. We describe the typical buffers here, and briefly
discuss policies for managing them.

• Disk I/O buffers: The most common asynchronous interaction in a database is


for disk I/O: a thread issues an asynchronous disk I/O request, and engages in
other tasks pending a response. There are two separate I/O scenarios to consider:
o DB I/O requests: The Buffer Pool. All database data is staged through
the DBMS buffer pool, about which we will have more to say in Section
3.3. In Server Process architectures, this is simply a heap-resident data
structure. To flush a buffer pool page to disk, a thread generates an I/O
request that includes the page’s current location in the buffer pool (the
frame), and its destination address on disk. When a thread needs a page to
be read in from the database, it generates an I/O request specifying the
disk address, and a handle to a free frame in the buffer pool where the
result can be placed. The actual reading and writing and pages into and
out of frames is done asynchronously.
o Log I/O Requests: The Log Tail. The database log is an array of entries
stored on a set of disks. As log entries are generated during transaction
processing, they are staged in a memory queue that is usually called the
log tail, which is periodically flushed to the log disk(s) in FIFO order. In
many systems, a separate thread is responsible for periodically flushing
the log tail to the disk.

The most important log flushes are those that commit transactions. A
transaction cannot be reported as successfully committed until a commit
log record is flushed to the log device. This means both that client code
waits until the commit log record is flushed, and that DBMS server code
must hold resources (e.g. locks) until that time as well. In order to
amortize the costs of log writes, most systems defer them until enough are
queued up, and then do a “group commit” [27] by flushing the log tail.
Policies for group commit are a balance between keeping commit latency
low (which favors flushing the log tail more often), and maximizing log

7
Anatomy of a Database System 49

throughput (which favors postponing log flushes until the I/O can be
amortized over many bytes of log tail).

• Client communication buffers: SQL typically is used in a “pull” model: clients


consume result tuples from a query cursor by repeatedly issuing the SQL FETCH
request, which may retrieve one or more tuples per request. Most DBMSs try to
work ahead of the stream of FETCH requests, enqueuing results in advance of
client requests.

In order to support this workahead behavior, the DBMS worker thread for a query
contains a pointer to a location for enqueuing results. A simple option is to assign
each client to a network socket. In this case, the worker thread can use the socket
as a queue for the tuples it produces. An alternative is to multiplex a network
socket across multiple clients. In this case, the server process must (a) maintain
its own state per client, including a communication queue for each client’s SQL
results, and (b) have a “coordinator agent” thread (or set of threads) available to
respond to client FETCH requests by pulling data off of the communication queue.

2.2 DBMS Threads, OS Processes, and Mappings Between


Them
The previous section provided a simplified description of DBMS threading models. In
this section we relax the first of our assumptions above: the need for high-performance
OS thread packages. We provide some historical perspective on how the problem was
solved in practice, and also describe the threading in modern systems.

Most of today’s DBMSs have their roots in research systems from the 1970’s, and
commercialization efforts from the ’80’s. Many of the OS features we take for granted
today were unavailable to DBMS developers at the time the original database systems
were built. We touched on some of these above: buffering control in the filesystem, and
asynchronous I/O service. A more fundamental issue that we ignored above was the lack
of high-performance threading packages. When such packages started to become
available in the 1990’s, they were typically OS-specific. Even the current POSIX thread
standard is not entirely predictable across platforms, and recent OS research suggests that
OS threads still do not scale as well as one might like ([23][37][67][68], etc.)

Hence for legacy, portability, and performance reasons, many commercial DBMSs
provide their own lightweight, logical thread facility at application level (i.e. outside of
the OS) for the various concurrent tasks in the DBMS. We will use the term DBMS
thread to refer to one of these DBMS-level tasks. These DBMS threads replace the role
of the OS threads described in the previous section. Each DBMS thread is programmed
to manage its own state, to do all slow activities (e.g. I/Os) via non-blocking,
asynchronous interfaces, and to frequently yield control to a scheduling routine (another
DBMS thread) that dispatches among these tasks. This is an old idea, discussed in a
retrospective sense in [38], and widely used in event-loop programming for user
interfaces. It has been revisited quite a bit in the recent OS literature [23][37][67][68].

8
50 Chapter 1: Data Models and DBMS Architecture

This architecture provides fast task-switching and ease of porting, at the expense of
replicating a good deal of OS logic in the DBMS (task-switching, thread state
management, scheduling, etc.) [64].

Using a DBMS-level thread package raises another set of design questions. Given
DBMS threads and OS process facilities (but no OS threads), it is not obvious how to
map DBMS threads into OS processes: How many OS processes should there be? What
DBMS tasks should get their own DBMS threads? How should threads be assigned to
the processes? To explore this design space, we simplify things by focusing on the case
where there are only two units of scheduling: DBMS threads and OS processes. We will
reintroduce OS threads into the mix in Section 2.2.1.

In the absence of OS thread support, a good rule of thumb is to have one process per
physical device (CPU, disk) to maximize the physical parallelism inherent in the
hardware, and to ensure that the system can function efficiently in the absence of OS
support for asynchronous I/O. To that end, a typical DBMS has the following set of
processes:
• One or more processes to host DBMS threads for SQL processing. These
processes host the worker DBMS threads for query processing. In some cases it
is beneficial to allocate more than one such process per CPU; this is often a
“tuning knob” that can be set by the database administrator.
• One or more “dispatcher” processes. These processes listen on a network port
for new connections, and dispatch the connection requests to a DBMS thread in
another process for further processing. The dispatcher also sets up session state
(e.g. communication queues) for future communication on the connection. The
number of dispatchers is typically another knob that can be set by the database
administrator; a rule of thumb is to set the number of dispatchers to be the
expected peak number of concurrent connections divided by a constant (Oracle
recommends dividing by 1000.)
• One process per database disk (I/O Process Architectures). For platforms
where the OS does not supply efficient asynchronous I/O calls, the lack of OS
threads requires multiple I/O Processes, one per database disk, to service I/O
requests.
• One process per log disk (I/O Process Architectures). For platforms with I/O
Processes, there will be a process per log disk, to flush the log tail, and to read the
log in support of transaction rollback.
• One coordinator agent process per client session. In some systems, a process
is allocated for each client session, to maintain session state and handle client
communication. In other systems this state is encapsulated in a data structure that
is available to the DBMS threads in the SQL processes.
• Background Utilities: As we discuss in Section 6, DBMSs include a number of
background utilities for system maintenance, including database statistics-
gathering, system monitoring, log device archiving, and physical reorganization.
Each of these typically runs in its own process, which is typically spawned
dynamically on a schedule.

9
Anatomy of a Database System 51

2.2.1 DBMS Threads, OS Threads and Current Commercial Systems


The preceding discussion assumes no support for OS threads. In fact, most modern
operating systems now support reasonable threads packages. They may not provide the
degree of concurrency needed by the DBMS (Linux threads were very heavyweight until
recently), but they are almost certainly more efficient than using multiple processes as
described above.

Since most database systems evolved along with their host operating systems, they were
originally architected for single-threaded processes as we just described. As OS threads
matured, a natural form of evolution was to modify the DBMS to be a single process,
using an OS thread for each unit that was formerly an OS process. This approach
continues to use the DBMS threads, but maps them into OS threads rather than OS
processes. This evolution is relatively easy to code, and leverages the code investment in
efficient DBMS threads, minimizing the dependency on high-end multithreading in the
OS.

In fact, most of today’s DBMSs are written in this manner, and can be run over either
processes or threads. They abstract the choice between processes and threads in the code,
mapping DBMS threads to OS-provided “dispatchable units” (to use DB2 terminology),
be they processes or threads.

Current hardware provides one reason to stick with processes as the “dispatchable unit”.
On many architectures today, the addressable memory per process is not as large as
available physical memory – for example, on Linux for x86 only 3GB of RAM is
available per process. It is certainly possible to equip a modern PC with more physical
memory than that, but no individual process can address all of the memory. Using
multiple processes alleviates this problem in a simple fashion.

There are variations in the threading models in today’s leading systems. Oracle on UNIX
is configured by default to run in Process-Per-User mode, but for better performance can
run in the Server Process fashion described at the beginning of Section 2.2: DBMS
threads multiplexed across a set of OS processes. On Windows, Oracle uses a single OS
process with multiple threads as dispatchable units: DBMS threads are multiplexed
across a set of OS threads. DB2 does not provide its own DBMS threads. On UNIX
platforms DB2 works in a Process-per-User mode: each user’s session has its own agent
process that executes the session logic. DB2 on Windows uses OS threads as the
dispatchable unit, rather than multiple processes. Microsoft SQL Server only runs on
Windows; it runs an OS thread per session by default, but can be configured to multiplex
various “DBMS threads” across a single OS thread; in the case of SQL Server the
“DBMS threads” package is actually a Windows-provided feature known as fibers.

2.3 Parallelism, Process Models, and Memory Coordination


In this section, we relax the second assumption of Section 3.1, by focusing on platforms
with multiple CPUs. Parallel hardware is a fact of life in modern server situations, and
comes in a variety of configurations. We summarize the standard DBMS terminology

10
52 Chapter 1: Data Models and DBMS Architecture

(introduced in [65]), and discuss the process models and memory coordination issues in
each.

2.3.1 Shared Memory

Figure 5: Shared Memory Architecture

A shared-memory parallel machine is one in which all processors can access the same
RAM and disk with about the same performance. This architecture is fairly standard
today – most server hardware ships with between two and eight processors. High-end
machines can ship with dozens to hundreds of processors, but tend to be sold at an
enormous premium relative to the number of compute resources provided. Massively
parallel shared-memory machines are one of the last remaining “cash cows” in the
hardware industry, and are used heavily in high-end online transaction processing
applications. The cost of hardware is rarely the dominant factor in most companies’ IT
ledgers, so this cost is often deemed acceptable2.

The process model for shared memory machines follows quite naturally from the
uniprocessor Server Process approach – and in fact most database systems evolved from
their initial uniprocessor implementations to shared-memory implementations. On
shared-memory machines, the OS typically supports the transparent assignment of
dispatchable units (processes or threads) across the processors, and the shared data
structures continue to be accessible to all. Hence the Server Process architecture
parallelizes to shared-memory machines with minimal effort. The main challenge is to
modify the query execution layers described in Section 3 to take advantage of the ability
to parallelize a single query across multiple CPUs.

2
The dominant cost for DBMS customers is typically paying qualified people to
administer high-end systems. This includes Database Administrators (DBAs) who
configure and maintain the DBMS, and System Administrators who configure and
maintain the hardware and operating systems. Interestingly, these are typically very
different career tracks, with very different training, skill sets, and responsibilities.

11
Anatomy of a Database System 53

2.3.2 Shared Nothing

Figure 6: Shared Nothing Architecture

A shared-nothing parallel machine is made up of a cluster of single-processor machines


that communicate over a high-speed network interconnect. There is no way for a given
processor to directly access the memory or disk of another processor. This architecture is
also fairly standard today, and has unbeatable scalability and cost characteristics. It is
mostly used at the extreme high end, typically for decision-support applications on data
warehouses. Shared nothing machines can be cobbled together from individual PCs, but
for database server purposes they are typically sold (at a premium!) as packages
including specialized network interconnects (e.g. the IBM SP2 or the NCR WorldMark
machines.) In the OS community, these platforms have been dubbed “clusters”, and the
component PCs are sometimes called “blade servers”.

Shared nothing systems provide no hardware sharing abstractions, leaving coordination


of the various machines entirely in the hands of the DBMS. In these systems, each
machine runs its own Server Process as above, but allows an individual query’s execution
to be parallelized across multiple machines. The basic architecture of these systems is to
use horizontal data partitioning to allow each processor to execute independently of the
others. For storage purposes, each tuple in the database is assigned to an individual
machine, and hence each table is sliced “horizontally” and spread across the machines
(typical data partitioning schemes include hash-based partitioning by tuple attribute,
range-based partitioning by tuple attribute, or round-robin). Each individual machine is
responsible for the access, locking and logging of the data on its local disks. During
query execution, the query planner chooses how to horizontally re-partition tables across
the machines to satisfy the query, assigning each machine a logical partition of the work.
The query executors on the various machines ship data requests and tuples to each other,
but do not need to transfer any thread state or other low-level information. As a result of
this value-based partitioning of the database tuples, minimal coordination is required in
these systems. However, good partitioning of the data is required for good performance,

12
54 Chapter 1: Data Models and DBMS Architecture

which places a significant burden on the DBA to lay out tables intelligently, and on the
query optimizer to do a good job partitioning the workload.

This simple partitioning solution does not handle all issues in the DBMS. For example,
there has to be explicit cross-processor coordination to handle transaction completion, to
provide load balancing, and to support certain mundane maintenance tasks. For example,
the processors must exchange explicit control messages for issues like distributed
deadlock detection and two-phase commit [22]. This requires additional logic, and can
be a performance bottleneck if not done carefully.

Also, partial failure is a possibility that has to be managed in a shared-nothing system.


In a shared-memory system, the failure of a processor typically results in a hardware
shutdown of the entire parallel computing machine. In a shared-nothing system, the
failure of a single node will not necessarily affect other nodes, but will certainly affect the
overall behavior of the DBMS, since the failed node hosts some fraction of the data in the
database. There are three possible approaches in this scenario. The first is to bring down
all nodes if any node fails; this in essence emulates what would happen in a shared-
memory system. The second approach, which Informix dubbed “Data Skip”, allows
queries to be executed on any nodes that are up, “skipping” the data on the failed node.
This is of use in scenarios where availability trumps consistency, but the best effort
results generated do not have any well-defined semantics. The third approach is to
employ redundancy schemes like chained declustering [32], which spread copies of
tuples across multiple nodes in the cluster. These techniques are designed to tolerate a
number of failures without losing data. In practice, however, these techniques are not
provided; commercial vendors offer coarser-grained redundancy solutions like database
replication (Section 6.3), which maintain a copy of the entire database in a separate
“standby” system.

2.3.3 Shared Disk

Figure 7: Shared Disk Architecture

A shared-disk parallel machine is one in which all processors can access the same disks
with about the same performance, but are unable to access each other’s RAM. This
architecture is quite common in the very largest “single-box” (non-cluster)
multiprocessors, and hence is important for very large installations – especially for

13
Anatomy of a Database System 55

Oracle, which does not sell a shared-nothing software platform. Shared disk has become
an increasingly attractive approach in recent years, with the advent of Network Attached
Storage devices (NAS), which allow a storage device on a network to be mounted by a
set of nodes.

One key advantage of shared-disk systems over shared-nothing is in usability, since


DBAs of shared-disk systems do not have to consider partitioning tables across machines.
Another feature of a shared-disk architecture is that the failure of a single DBMS
processing node does not affect the other nodes’ ability to access the full database. This
is in contrast to both shared-memory systems that fail as a unit, and shared-nothing
systems that lose at least some data upon a node failure. Of course this discussion puts
more emphasis on the reliability of the storage nodes.

Because there is no partitioning of the data in a shared disk system, data can be copied
into RAM and modified on multiple machines. Unlike shared-memory systems there is
no natural place to coordinate this sharing of the data – each machine has its own local
memory for locks and buffer pool pages. Hence there is a need to explicitly coordinate
data sharing across the machines. Shared-disk systems come with a distributed lock
manager facility, and a cache-coherency protocol for managing the distributed buffer
pools [7]. These are complex pieces of code, and can be bottlenecks for workloads with
significant contention.

2.3.4 NUMA

Non-Uniform Memory Access (NUMA) architectures are somewhat unusual, but available
from vendors like IBM. They provide a shared memory system where the time required
to access some remote memory can be much higher than the time to access local memory.
Although NUMA architectures are not especially popular today, they do bear a
resemblance to shared-nothing clusters in which the basic building block is a small (e.g.
4-way) multiprocessor. Because of the non-uniformity in memory access, DBMS
software tends to ignore the shared memory features of such systems, and treats them as
if they were (expensive) shared-nothing systems.

2.4 Admission Control


We close this section with one remaining issue related to supporting multiple concurrent
requests. As the workload is increased in any multi-user system, performance will
increase up to some maximum, and then begin to decrease radically as the system starts
to “thrash”. As in operating system settings, thrashing is often the result of memory
pressure: the DBMS cannot keep the “working set” of database pages in the buffer pool,
and spends all its time replacing pages. In DBMSs, this is particularly a problem with
query processing techniques like sorting and hash joins, which like to use large amounts
of main memory. In some cases, DBMS thrashing can also occur due to contention for
locks; transactions continually deadlock with each other and need to be restarted [2].
Hence any good multi-user system has an admission control policy, which does not admit

14
56 Chapter 1: Data Models and DBMS Architecture

new clients unless the workload will stay safely below the maximum that can be handled
without thrashing. With a good admission controller, a system will display graceful
degradation under overload: transactions latencies will increase proportionally to their
arrival rate, but throughput will remain at peak.

Admission control for a DBMS can be done in two tiers. First, there may be a simple
admission control policy in the dispatcher process to ensure that the number of client
connections is kept below a threshold. This serves to prevent overconsumption of basic
resources like network connections, and minimizes unnecessary invocations of the query
parser and optimizer. In some DBMSs this control is not provided, under the assumption
that it is handled by some other piece of software interposed between clients and the
DBMS: e.g. an application server, transaction processing monitor, or web server.

The second layer of admission control must be implemented directly within the core
DBMS query processor. This execution admission controller runs after the query is
parsed and optimized, and determines whether a query is postponed or begins execution.
The execution admission controller is aided by information provided by the query
optimizer, which can estimate the resources that a query will require. In particular, the
optimizer’s query plan can specify (a) the disk devices that the query will access, and an
estimate of the number of random and sequential I/Os per device (b) estimates of the
CPU load of the query, based on the operators in the query plan and the number of tuples
to be processed, and most importantly (c) estimates about the memory footprint of the
query data structures, including space for sorting and hashing tables. As noted above,
this last metric is often the key for an admission controller, since memory pressure is
often the main cause of thrashing. Hence many DBMSs use memory footprint as the
main criterion for admission control.
2.5 Standard Practice
As should be clear, there are many design choices for process models in a DBMS, or any
large-scale server system. However due both to historical legacy and the need for
extreme high performance, a few standard designs have emerged

To summarize the state of the art for uniprocessor process models:


• Modern DBMSs are built using both “Process-per-User” and “Server Process”
models; the latter is more complex to implement but allows for higher
performance in some cases.
• Some Server Process systems (e.g. Oracle and Informix) implement a DBMS
thread package, which serves the role taken by OS threads in the model of Section
3.1. When this is done, DBMS threads are mapped to a smaller set of
“dispatchable units” as described in Section 3.2.
• Dispatchable units can be different across OS platforms as described in Section
3.2.1: either processes, or threads within a single process.

In terms of parallel architectures, today’s marketplace supports a mix of Shared-Nothing,


Shared-Memory and Shared-Disk architectures. As a rule, Shared-Nothing architectures
excel on price-performance for running complex queries on very large databases, and

15
Anatomy of a Database System 57

hence they occupy a high-end niche in corporate decision support systems. The other
two typically perform better at the high end for processing multiple small transactions.
The evolution from a uniprocessor DBMS implementation to a Shared-Nothing
implementation is quite difficult, and at most companies was done by spawning a new
product line that was only later merged back into the core product. Oracle still does not
ship a Shared-Nothing implementation.

3 Storage Models
In addition to the process model, another basic consideration when designing a DBMS is
the choice of the persistent storage interface to use. There are basically two options: the
DBMS can interact directly with the device drivers for the disks, or the DBMS can use
the typical OS file system facilities. This decision has impacts on the DBMS’s ability to
control storage in both space and time. We consider these two dimensions in turn, and
proceed to discuss the use of the storage hierarchy in more detail.

3.1 Spatial Control


Sequential access to disk blocks is between 10 and 100 times faster than random access.
This gap is increasing quickly. Disk density – and hence sequential bandwidth –
improves following Moore’s Law, doubling every 18 months. Disk arm movement is
improving at a much slower rate. As a result, it is critical for the DBMS storage manager
to place blocks on the disk so that important queries can access data sequentially. Since
the DBMS can understand its workload more deeply than the underlying OS, it makes
sense for DBMS architects to exercise full control over the spatial positioning of database
blocks on disk.

The best way for the DBMS to control spatial locality of its data is to issue low-level
storage requests directly to the “raw” disk device interface, since disk device addresses
typically correspond closely to physical proximity of storage locations. Most commercial
database systems offer this functionality for peak performance. Although quite effective,
this technique has some drawbacks. First, it requires the DBA to devote entire disks to
the DBMS; this used to be frustrating when disks were very expensive, but it has become
far less of a concern today. Second, “raw disk” access interfaces are often OS-specific,
which can make the DBMS more difficult to port. However, this is a hurdle that most
commercial DBMS vendors chose to overcome years ago. Finally, developments in the
storage industry like RAID, Storage Area Networks (SAN), and Network-Attached
Storage (NAS) have become popular, to the point where “virtual” disk devices are the
norm in many scenarios today – the “raw” device interface is actually being intercepted
by appliances or software that reposition data aggressively on one or more physical disks.
As a result, the benefits of explicit physical control by the DBMS have been diluted over
time. We discuss this issue further in Section 6.2.

An alternative to raw disk access is for the DBMS to create a very large file in the OS file
system, and then manage positioning of data in the offsets of that file. This offers
reasonably good performance. In most popular filesystems, if you allocate a very large
file on an empty disk, the offsets in that file will correspond fairly well to physical

16
58 Chapter 1: Data Models and DBMS Architecture

proximity of storage regions. Hence this is a good approximation to raw disk access,
without the need to go directly to the device interface. Most virtualized storage systems
are also designed to place close offsets in a file in nearby physical locations. Hence the
relative control lost when using large files rather than raw disks is becoming less
significant over time. However, using the filesystem interface has other ramifications,
which we discuss in the next subsection.

It is worth noting that in either of these schemes, the size of a database page is a tunable
parameter that can be set at the time of database generation; it should be a multiple of the
sized offered by typical disk devices. If the filesystem is being used, special interfaces
may be required to write pages of a different size than the filesystem default; the POSIX
mmap/msync calls provide this facility. A discussion of the appropriate choice of page
sizes is given in the paper on the “5-minute rule” [20].

3.2 Temporal Control: Buffering


In addition to controlling where on the disk data should lie, a DBMS must control when
data gets physically written to the disk. As we will discuss in Section 5, a DBMS
contains critical logic that reasons about when to write blocks to disk. Most OS file
systems also provide built-in I/O buffering mechanisms to decide when to do reads and
writes of file blocks. If the DBMS uses standard file system interfaces for writing, the
OS buffering can confound the intention of the DBMS logic by silently postponing or
reordering writes. This can cause major problems for the DBMS.

The first set of problems regard the correctness of the database: the DBMS cannot ensure
correct transactional semantics without explicitly controlling the timing of disk writes.
As we will discuss in Section 5, writes to the log device must precede corresponding
writes to the database device, and commit requests cannot return to users until commit
log records have been reliably written to the log device.

The second set of problems with OS buffering concern performance, but have no
implications on correctness. Modern OS file systems typically have some built-in support
for read-ahead (speculative reads) and write-behind (postponed, batched writes), and
these are often poorly-suited to DBMS access patterns. File system logic depends on the
contiguity of physical byte offsets in files to make decisions about reads and writes.
DBMS-level I/O facilities can support logical decisions based on the DBMS’ behavior.
For example, the stream of reads in a query is often predictable to the DBMS, but not
physically contiguous on the disk, and hence not visible via the OS read/write API.
Logical DBMS-level read-ahead can occur when scanning the leaves of a B+-tree, for
example. Logical read-aheads are easily achieved in DBMS logic by a query thread
issuing I/Os in advance of its needs – the query plan contains the relevant information
about data access algorithms, and has full information about future access patterns for the
query. Similarly, the DBMS may want to make its own decisions about when to flush the
log buffer (often called the log “tail”), based on considerations that mix issues like lock
contention with I/O throughput. This mix of information is available to the DBMS, but
not to the OS file system.

17
Anatomy of a Database System 59

The final performance issues are “double buffering” and the extreme CPU overhead of
memory copies. Given that the DBMS has to do its own buffering carefully for
correctness, any additional buffering by the OS is redundant. This redundancy results in
two costs. First, it wastes system memory, effectively limiting the memory available for
doing useful work. Second, it wastes time, by causing an additional copying step: on
reads, data is first copied from the disk to the OS buffer, and then copied again to the
DBMS buffer pool, about which we will say more shortly. On writes, both of these
copies are required in reverse. Copying data in memory can be a serious bottleneck in
DBMS software today. This fact is often a surprise to database students, who assume
that main-memory operations are “free” compared to disk I/O. But in practice, a well-
tuned database installation is typically not I/O-bound. This is achieved in high-end
installations by purchasing the right mix of disks and RAM so that repeated page requests
are absorbed by the buffer pool, and disk I/Os are shared across the disk arms at a rate
that can feed the appetite of all the processors in the system. Once this kind of “system
balance” is achieved, I/O latencies cease to be a bottleneck, and the remaining main-
memory bottlenecks become the limiting factors in the system. Memory copies are
becoming a dominant bottleneck in computer architectures: this is due to the gap in
performance evolution between raw CPU cycles per second (which follows Moore’s law)
and RAM access speed (which trails Moore’s law significantly).

The problems of OS buffering have been well-known in the database research literature
[64] and the industry for some time. Most modern operating systems now provide hooks
(e.g. the POSIX mmap/msync/madvise calls) for programs like database servers to
circumvent double-buffering the file cache, ensuring that writes go through to disk when
requested, that double buffering is avoided, and that some alternate replacement
strategies can be hinted at by the DBMS.

3.3 Buffer Management


In order to provide efficient access to database pages, every DBMS implements a large
shared buffer pool in its own memory space. The buffer pool is organized as an array of
frames, each frame being a region of memory the size of a database disk block. Blocks
are copied in native format from disk directly into frames, manipulated in memory in
native format, and written back. This translation-free approach avoids CPU bottlenecks
in “marshalling” and “unmarshalling” data to/from disk; perhaps more importantly, the
fixed-sized frames sidestep complexities of external memory fragmentation and
compaction that are associated with generic memory management.

Associated with the array of frames is an array of metadata called a page table, with one
entry for each frame. The page table contains the disk location for the page currently in
each frame, a dirty bit to indicate whether the page has changed since it was read from
disk, and any information needed by the page replacement policy used for choosing
pages to evict on overflow. It also contains a pin count for the page in the frame; the
page is not candidate for page replacement unless the pin count is 0. This allows tasks to

18
60 Chapter 1: Data Models and DBMS Architecture

(hopefully briefly) “pin” pages into the buffer pool by incrementing the pin count before
manipulating the page, and decrementing it thereafter.

Much research in the early days of relational systems focused on the design of page
replacement policies. The basic tension surrounded the looping access patterns resulting
from nested-loops joins, which scanned and rescanned a heap file larger than the buffer
pool. For such looping patterns, recency of reference is a pessimal predictor of future
reuse, so OS page replacement schemes like LRU and CLOCK were well known to
perform poorly for database queries [64]. A variety of alternative schemes were
proposed, including some that attempted to tune the replacement strategy via query plan
information [10]. Today, most systems use simple enhancements to LRU schemes to
account for the case of nested loops; one that appears in the research literature and has
been implemented in commercial systems is LRU-2 [48]. Another scheme used in
commercial systems is to have a the replacement policy depend on the page type: e.g. the
root of a B+-tree might be replaced with a different strategy than a page in a heap file.
This is reminiscent of Reiter’s Domain Separation scheme [55][10].

3.4 Standard Practice


In the last few years, commercial filesystems have evolved to the point where they can
now support database storage quite well. The standard usage model is to allocate a single
large file in the filesystem on each disk, and let the DBMS manage placement of data
within that file via interfaces like the mmap suite. In this configuration, modern
filesystems now offer reasonable spatial and temporal control to the DBMS. This storage
model is available in essentially all database system implementations. However, the raw
disk code in many of the DBMS products long predates the maturation of filesystems,
and provides explicit performance control to the DBMS without any worry about subtle
filesystem interactions. Hence raw disk support remains a common high-performance
option in most database systems.

4 Query Processor
The previous sections stressed the macro-architectural design issues in a DBMS. We now
begin a sequence of sections discussing design at a somewhat finer grain, addressing each
of the main DBMS components in turn. We start with the query processor.

A relational query engine takes a declarative SQL statement, validates it, optimizes it into
a procedural dataflow implementation plan, and (subject to admission control) executes
that dataflow on behalf of a client program, which fetches (“pulls”) the result tuples,
typically one at a time or in small batches. The components of a relational query engine
are shown in Figure 1; in this section we concern ourselves with both the query processor
and some non-transactional aspects of the storage manager’s access methods. In general,
relational query processing can be viewed as a single-user, single-threaded task –
concurrency control is managed transparently by lower layers of the system described in
Section 5. The only exception to this rule is that the query processor must explicitly pin
and unpin buffer pool pages when manipulating them, as we note below. In this section

19
Anatomy of a Database System 61

we focus on the common case SQL commands: “DML” statements including SELECT,
INSERT, UPDATE and DELETE.
4.1 Parsing and Authorization
Given an SQL statement, the main tasks for the parser are to check that the query is
correctly specified, to convert it into an internal format, and to check that the user is
authorized to execute the query. Syntax checking is done naturally as part of the parsing
process, during which time the parser generates an internal representation for the query.

The parser handles queries one “SELECT” block at a time. First, it considers each of the
table references in the FROM clause. It canonicalizes each table name into a
schema.tablename format; users have default schemas which are often omitted from the
query specification. It then invokes the catalog manager to check that the table is
registered in the system catalog; while so checking it may also cache metadata about the
table in internal query data structures. Based on information about the table, it then uses
the catalog to check that attribute references are correct. The data types of attributes are
used to drive the (rather intricate) disambiguation logic for overloaded functional
expressions, comparison operators, and constant expressions. For example, in the
expression “(EMP.salary * 1.15) < 75000”, the code for the multiplication function and
comparison operator – and the assumed data type and internal format of the strings
“1.15” and “75000” – will depend upon the data type of the EMP.salary attribute, which
may be an integer, a floating-point number, or a “money” value. Additional standard
SQL syntax checks are also applied, including the usage of tuple variables, the
compatibility of tables combined via set operators (UNION/INTERSECT/EXCEPT), the
usage of attributes in the SELECT list of aggregation queries, the nesting of subqueries,
and so on.

If the query parses correctly, the next phase is to check authorization. Again, the catalog
manager is invoked to ensure that the user has the appropriate permissions
(SELECT/DELETE/INSERT/UPDATE) on the tables in the query. Additionally,
integrity constraints are consulted to ensure that any constant expressions in the query do
not result in constraint violations. For example, an UPDATE command may have a
clause of the form “SET EMP.salary = -1”. If there is an integrity constraint specifying
positive values for salaries, the query will not be authorized for execution.

If a query parses and passes authorization checks, then the internal format of the query is
passed on to the query rewrite module for further processing.

4.1.1 A Note on Catalog Management


The database catalog is a form of metadata: information about the data in the system.
The catalog is itself stored as a set of tables in the database, recording the names of basic
entities in the system (users, schemas, tables, columns, indexes, etc.) and their
relationships. By keeping the metadata in the same format as the data, the system is
made both more compact and simpler to use: users can employ the same language and
tools to investigate the metadata that they use for other data, and the internal system code

20
62 Chapter 1: Data Models and DBMS Architecture

for managing the metadata is largely the same as the code for managing other tables.
This code and language reuse is an important lesson that is often overlooked in early-
stage implementations, typically to the significant regret of developers later on. (One of
the authors witnessed this mistake yet again in an industrial setting within the last few
years!)

For efficiency, basic catalog data is treated somewhat differently from normal tables.
High-traffic portions of the catalog are often materialized in main memory at bootstrap
time, typically in data structures that “denormalize” the flat relational structure of the
catalogs into a main-memory network of objects. This lack of data independence in
memory is acceptable because the in-memory data structures are used in a stylized
fashion only by the query parser and optimizer. Additional catalog data is cached in
query plans at parsing time, again often in a denormalized form suited to the query.
Moreover, catalog tables are often subject to special-case transactional tricks to minimize
“hot spots” in transaction processing.

It is worth noting that catalogs can become formidably large in commercial applications.
One major Enterprise Resource Planning application generates over 30,000 tables, with
between 4 and 8 columns per table, and typically two or three indexes per table.

4.2 Query Rewrite


The query rewrite module is responsible for a number of tasks related to simplifying and
optimizing the query, typically without changing its semantics. The key in all these tasks
is that they can be carried out without accessing the data in the tables – all of these
techniques rely only on the query and on metadata in the catalog. Although we speak of
“rewriting” the query, in fact most rewrite systems operate on internal representations of
the query, rather than on the actual text of a SQL statement.

• View rewriting: The most significant role in rewriting is to handle views. The
rewriter takes each view reference that appeared in the FROM clause, and gets the
view definition from the catalog manager. It then rewrites the query to remove
the view, replacing it with the tables and predicates referenced by the view, and
rewriting any predicates that reference the view to instead reference columns from
the tables in the view. This process is applied recursively until the query is
expressed exclusively over base tables. This view expansion technique, first
proposed for the set-based QUEL language in INGRES [63], requires some care
in SQL to correctly handle duplicate elimination, nested queries, NULLs, and
other tricky details [51].
• Constant arithmetic evaluation: Query rewrite can simplify any arithmetic
expressions that do not contain tuple variables: e.g. “R.x < 10+2” is rewritten as
“R.x < 12”.
• Logical rewriting of predicates: Logical rewrites are applied based on the
predicates and constants in the WHERE clause. Simple Boolean logic is often
applied to improve the match between expressions and the capabilities of index-
based access methods: for example, a predicate like “NOT Emp.Salary >

21
Anatomy of a Database System 63

1000000” may be rewritten as “Emp.Salary <= 1000000”. These logical rewrites


can even short-circuit query execution, via simple satisfiability tests: for example,
the expression “Emp.salary < 75000 AND Emp.salary > 1000000” can be
replaced with FALSE, possibly allowing the system to return an empty query
result without any accesses to the database. Unsatisfiable queries may seem
implausible, but recall that predicates may be “hidden” inside view definitions,
and unknown to the writer of the outer query – e.g. the query above may have
resulted from a query for low-paid employees over a view called “Executives”.

An additional, important logical rewrite uses the transitivity of predicates to


induce new predicates: e.g. “R.x < 10 AND R.x = S.y” suggests adding the
additional predicate “AND S.y < 10”. Adding these transitive predicates
increases the ability of the optimizer to choose plans that filter data early in
execution, especially through the use of index-based access methods.
• Semantic optimization: In many cases, integrity constraints on the schema are
stored in the catalog, and can be used to help rewrite some queries. An important
example of such optimization is redundant join elimination. This arises when
there are foreign key constraints from a column of one table (e.g. Emp.deptno) to
another table (Dept). Given such a foreign key constraint, it is known that there is
exactly one Dept for each Emp. Consider a query that joins the two tables but
does not make use of the Dept columns:
SELECT Emp.name, Emp.salary
FROM Emp, Dept
WHERE Emp.deptno = Dept.dno
Such queries can be rewritten to remove the Dept table, and hence the join. Again,
such seemingly implausible scenarios often arise naturally via views – for
example, a user may submit a query about employee attributes over a view
EMPDEPT that joins those two tables. Semantic optimizations can also lead to
short-circuited query execution, when constraints on the tables are incompatible
with query predicates.
• Subquery flattening and other heuristic rewrites: In many systems, queries are
rewritten to get them into a form that the optimizer is better equipped to handle.
In particular, most optimizers operate on individual SELECT-FROM-WHERE
query blocks in isolation, forgoing possible opportunities to optimize across
blocks. Rather than further complicate query optimizers (which are already quite
complex in commercial DBMSs), a natural heuristic is to flatten nested queries
when possible to expose further opportunities for single-block optimization. This
turns out to be very tricky in some cases in SQL, due to issues like duplicate
semantics, subqueries, NULLs and correlation [51][58]. Other heuristic rewrites
are possible across query blocks as well – for example, predicate transitivity can
allow predicates to be copied across subqueries [40]. It is worth noting that the
flattening of correlated subqueries is especially important for achieving good
performance in parallel architectures, since the “nested-loop” execution of
correlated subqueries is inherently serialized by the iteration through the loop.

When complete, the query rewrite module produces an internal representation of the
query in the same internal format that it accepted at its input.

22
64 Chapter 1: Data Models and DBMS Architecture

4.3 Optimizer
Given an internal representation of a query, the job of the query optimizer is to produce
an efficient query plan for executing the query (Figure 8). A query plan can be thought
of as a dataflow diagram starting from base relations, piping data through a graph of
query operators. In most systems, queries are broken into SELECT-FROM-WHERE
query blocks. The optimization of each individual query block is done using techniques
similar to those described in the famous paper by Selinger, et al. on the System R
optimizer [57]. Typically, at the top of each query block a few operators may be added as
post-processing to compute GROUP BY, ORDER BY, HAVING and DISTINCT clauses
if they exist. Then the various blocks are stitched together in a straightforward fashion.

Figure 8: A Query Plan. Note that only the main physical operators are shown.

The original System R prototype compiled query plans into machine code, whereas the
early INGRES prototype generated an interpretable query plan. Query interpretation was
listed as a “mistake” by the INGRES authors in their retrospective paper in the early
1980’s [63], but Moore’s law and software engineering have vindicated the INGRES
decision to some degree. In order to enable cross-platform portability, every system now
compiles queries into some kind of interpretable data structure; the only difference across
systems these days is the level of abstraction. In some systems the query plan is a very
lightweight object, not unlike a relational algebra expression annotated with the names of
access methods, join algorithms, and so on. Other systems use a lower-level language of
“op-codes”, closer in spirit to Java byte codes than to relational algebra expressions. For
simplicity in our discussion, we will focus on algebra-like query representations in the
remainder of this paper.

Although Selinger’s paper is widely considered the “bible” of query optimization, it was
preliminary research, and all systems extend it in a number of dimensions. We consider
some of the main extensions here.

23
Anatomy of a Database System 65

• Plan space: The System R optimizer constrained its plan space somewhat by
focusing only on “left-deep” query plans (where the right-hand input to a join
must be a base table), and by “postponing Cartesian products” (ensuring that
Cartesian products appear only after all joins in a dataflow.) In commercial
systems today, it is well known that “bushy” trees (with nested right-hand inputs)
and early use of Cartesian products can be useful in some cases, and hence both
options are considered in most systems.
• Selectivity estimation: The selectivity estimation techniques in the Selinger
paper are naïve, based on simple table and index cardinalities. Most systems
today have a background process that periodically analyzes and summarizes the
distributions of values in attributes via histograms and other summary statistics.
Selectivity estimates for joins of base tables can be made by “joining” the
histograms on the join columns. To move beyond single-column histograms,
more sophisticated schemes have been proposed in the literature in recent years to
incorporate issues like dependencies among columns [52] [11]; these innovations
have yet to show up in products. One reason for the slow adoption of these
schemes is a flaw in the industry benchmarks: the data generators in benchmarks
like TPC-H generate independent values in columns, and hence do not encourage
the adoption of technology to handle “real” data distributions. Nonetheless, the
benefits of improved selectivity estimation are widely recognized: as noted by
Ioannidis and Christodoulakis, errors in selectivity early in optimization propagate
multiplicatively up the plan tree, resulting in terrible subsequent estimations [32].
Hence improvements in selectivity estimation often merit the modest
implementation cost of smarter summary statistics, and a number of companies
appear to be moving toward modeling dependencies across columns.
• Search Algorithms: Some commercial systems – notably those of Microsoft and
Tandem – discard Selinger’s dynamic programming algorithm in favor of a goal-
directed “top-down” search scheme based on the Cascades framework [17]. Top-
down search can in some instances lower the number of plans considered by an
optimizer [60], but can also have the negative effect of increasing optimizer
memory consumption. If practical success is an indication of quality, then the
choice between top-down search and dynamic programming is irrelevant – each
has been shown to work well in state-of-the-art optimizers, and both still have
runtimes and memory requirements that are exponential in the number of tables in
a query.

It is also important to note that some systems fall back on heuristic search
schemes for queries with “too many” tables. Although there is an interesting
research literature of randomized query optimization heuristics [34][5][62], the
heuristics used in commercial systems tend to be proprietary, and (if rumors are to
be believed) do not resemble the randomized query optimization literature. An
educational exercise is to examine the query “optimizer” of the open-source
MySQL engine, which (at last check) is entirely heuristic and relies mostly on
exploiting indexes and key/foreign-key constraints. This is reminiscent of early
(and infamous) versions of Oracle. In some systems, a query with too many
tables in the FROM clause can only be executed if the user explicitly directs the

24
66 Chapter 1: Data Models and DBMS Architecture

optimizer how to choose a plan (via so-called optimizer “hints” embedded in the
SQL).
• Parallelism: Every commercial DBMS today has some support for parallel
processing, and most support “intra-query” parallelism: the ability to speed up a
single query via multiple processors. The query optimizer needs to get involved
in determining how to schedule operators – and parallelized operators – across
multiple CPUs, and (in the shared-nothing or shared-disk cases) multiple separate
computers on a high-speed network. The standard approach was proposed by
Hong and Stonebraker [31]and uses two phases: first a traditional single-site
optimizer is invoked to pick the best single-site plan, and then this plan is
scheduled across the multiple processors. Research has been published on this
latter phase [14][15] though it is not clear to what extent these results inform
standard practice – currently this seems to be more like art than science.
• Extensibility: Modern SQL standards include user-defined types and functions,
complex objects (nested tuples, sets, arrays and XML trees), and other features.
Commercial optimizers try to handle these extensions with varying degrees of
intelligence. One well-scoped issue in this area is to incorporate the costs of
expensive functions into the optimization problem as suggested in [29]. In most
commercial implementations, simple heuristics are still used, though more
thorough techniques are presented in the research literature [28][9]. Support for
complex objects is gaining importance as nested XML data is increasingly stored
in relational engines. This has generated large volumes of work in the object-
oriented [50] and XML [25] query processing literature.

Having an extensible version of a Selinger optimizer as described by Lohman [42]


can be useful for elegantly introducing new operators into the query engine; this is
presumably the approach taken in IBM’s products. A related approach for top-
down optimizers was developed by Graefe [18][17], and is likely used in
Microsoft SQL Server.
• Auto-Tuning: A variety of ongoing industrial research efforts attempt to
improve the ability of a DBMS to make tuning decisions automatically. Some of
these techniques are based on collecting a query workload, and then using the
optimizer to find the plan costs via various “what-if” analyses: what if other
indexes had existed, or the data had been laid out differently. An optimizer needs
to be adjusted somewhat to support this activity efficiently, as described by
Chaudhuri [8].

4.3.1 A Note on Query Compilation and Recompilation


SQL supports the ability to “prepare” a query: to pass it through the parser, rewriter and
optimizer, and store the resulting plan in a catalog table. This is even possible for
embedded queries (e.g. from web forms) that have program variables in the place of
query constants; the only wrinkle is that during selectivity estimation, the variables that
are provided by the forms are assumed by the optimizer to have some “typical” values.
Query preparation is especially useful for form-driven, canned queries: the query is
prepared when the application is written, and when the application goes live, users do not

25
Anatomy of a Database System 67

experience the overhead of parsing, rewriting and optimizing. In practice, this feature is
used far more heavily than ad-hoc queries that are optimized at runtime.

As a database evolves, it often becomes necessary to re-optimize prepared plans. At a


minimum, when an index is dropped, any plan that used that index must be removed from
the catalog of stored plans, so that a new plan will be chosen upon the next invocation.

Other decisions about re-optimizing plans are more subtle, and expose philosophical
distinctions among the vendors. Some vendors (e.g. IBM) work very hard to provide
predictable performance. As a result, they will not reoptimize a plan unless it will no
longer execute, as in the case of dropped indexes. Other vendors (e.g. Microsoft) work
very hard to make their systems self-tuning, and will reoptimize plans quite aggressively:
they may even reoptimize, for example, if the value distribution of a column changes
significantly, since this may affect the selectivity estimates, and hence the choice of the
best plan. A self-tuning system is arguably less predictable, but more efficient in a
dynamic environment.

This philosophical distinction arises from differences in the historical customer base for
these products, and is in some sense self-reinforcing. IBM traditionally focused on high-
end customers with skilled DBAs and application programmers. In these kinds of high-
budget IT shops, predictable performance from the database is of paramount importance
– after spending months tuning the database design and settings, the DBA does not want
the optimizer to change its mind unpredictably. By contrast, Microsoft strategically
entered the database market at the low end; as a result, their customers tend to have lower
IT budgets and expertise, and want the DBMS to “tune itself” as much as possible.

Over time these companies’ business strategies and customer bases have converged so
that they compete directly. But the original philosophies tend to peek out in the system
architecture, and in the way that the architecture affects the use of the systems by DBAs
and database programmers.

4.4 Executor
A query executor is given a fully-specified query plan, which is a fixed, directed dataflow
graph connecting operators that encapsulate base-table access and various query
execution algorithms. In some systems this dataflow graph is already compiled into op-
codes by the optimizer, in which case the query executor is basically a runtime
interpreter. In other systems a representation of the dataflow graph is passed to the query
executor, which recursively invokes procedures for the operators based on the graph
layout. We will focus on this latter case; the op-code approach essentially compiles the
logic we described here into a program.

26
68 Chapter 1: Data Models and DBMS Architecture

class iterator {
iterator &inputs[];
void init();
tuple get_next();
void close();
}
Figure 9: Iterator superclass pseudocode.
Essentially all modern query executors employ the iterator model, which was used in the
earliest relational systems. Iterators are most simply described in an object-oriented
fashion. All operators in a query plan – the nodes in the dataflow graph – are
implemented as objects from the superclass iterator. A simplified definition for an
iterator is given in Figure 9. Each iterator specifies its inputs, which define the edges in
the dataflow graph. Each query execution operator is implemented as a subclass of the
iterator class: the set of subclasses in a typical system might include filescan, indexscan,
nested-loops join, sort, merge-join, hash-join, duplicate-elimination, and grouped-
aggregation. An important feature of the iterator model is that any subclass of iterator
can be used as input to any other – hence each iterator’s logic is independent of its
children and parents in the graph, and there is no need to write special-case code for
particular combinations of iterators.

Graefe provides more details on iterators in his query processing survey [18]. The
interested reader is encouraged to examine the open-source PostgreSQL code base, which
includes moderately sophisticated implementations of the iterators for most standard
query execution algorithms.

4.4.1 Iterator Discussion


An important property of iterators is that they couple dataflow with control flow. The
get_next() call is a standard procedure call, returning a tuple reference to the callee via
the call stack. Hence a tuple is returned to a parent in the graph exactly when control is
returned. This implies that only a single DBMS thread is needed to execute an entire
query graph, and there is no need for queues or rate-matching between iterators. This
makes relational query executors clean to implement and easy to debug, and is a contrast
with dataflow architectures in other environments, e.g. networks, which rely on various
protocols for queueing and feedback between concurrent producers and consumers.

The single-threaded iterator architecture is also quite efficient for single-site query
execution. In most database applications, the performance metric of merit is time to
query completion. In a single-processor environment, time to completion for a given
query plan is achieved when resources are fully utilized. In an iterator model, since one
of the iterators is always active, resource utilization is maximized.3

3
This assumes that iterators never block waiting for I/O requests. As noted above, I/O
prefetching is typically handled by a separate thread. In the cases where prefetching is
ineffective, there can indeed be inefficiencies in the iterator model. This is typically not a

27
Anatomy of a Database System 69

As we mentioned previously, support for parallel query execution is standard in most


modern DBMSs. Fortunately, this support can be provided with essentially no changes to
the iterator model or a query execution architecture, by encapsulating parallelism and
network communication within special exchange iterators, as described by Graefe [16].

4.4.2 Where’s the Data?


Our discussion of iterators has conveniently sidestepped any questions of memory
allocation for in-flight data; we never specified how tuples were stored in memory, or
how they were passed from iterator to iterator. In practice, each iterator has a fixed
number of tuple descriptors pre-allocated: one for each of its inputs, and one for its
output. A tuple descriptor is typically an array of column references, where each column
reference is composed of a reference to a tuple somewhere else in memory, and a column
offset in that tuple. The basic iterator “superclass” logic never dynamically allocates
memory, which raises the question of where the actual tuples being referenced are stored
in memory.

There are two alternative answers to this question. The first possibility is that base-table
tuples can reside in pages in the buffer pool; we will call these BP-tuples. If an iterator
constructs a tuple descriptor referencing a BP-tuple, it must increment the pin count of
the tuple’s page; it decrements the pin count when the tuple descriptor is cleared. The
second possibility is that an iterator implementation may allocate space for a tuple on the
memory heap; we will call this an M-tuple, It may construct an M-tuple by copying
columns from the buffer pool (the copy bracketed by a pin/unpin pair), and/or by
evaluating expressions (e.g. arithmetic expressions like “EMP.sal * 0.1”) in the query
specification.

An attractive design pitfall is to always copy data out of the buffer pool immediately into
M-tuples. This design uses M-tuples as the only in-flight tuple structure, which simplifies
the executor code. It also circumvents bugs that can result from having buffer-pool pin
and unpin calls separated by long periods of execution (and many lines of code) – one
common bug of this sort is to forget to unpin the page altogether (a “buffer leak”).
Unfortunately, exclusive use of M-tuples can be a major performance problem, since
memory copies are often a serious bottleneck in high-performance systems, as noted in
Section 3.2.

On the other hand, there are cases where constructing an M-tuple makes sense. It is
sometimes beneficial to copy a tuple out of the buffer pool if it will be referenced for a
long period of time. As long as a BP-tuple is directly referenced by an iterator, the page
on which the BP-tuple resides must remain pinned in the buffer pool. This consumes a
page worth of buffer pool memory, and ties the hands of the buffer replacement policy.

big problem in single-site databases, though it arises frequently when executing queries
over remote tables [16][43].

28
70 Chapter 1: Data Models and DBMS Architecture

The upshot of this discussion is that it is most efficient to support tuple descriptors that
can reference both BP-tuples and M-tuples.

4.4.3 Data Modification Statements


Up to this point we have only discussed queries – i.e., read-only SQL statements.
Another class of DML statements modify data: INSERT, DELETE and UPDATE
statements. Typically, execution plans for these statements look like simple straight-line
query plans, with a single access method as the source, and a data modification operator
at the end of the pipeline.

In some cases, however, these plans are complicated by the fact that they both query and
modify the same data. This mix of reading and writing the same table (possibly multiple
times) raises some complications. A simple example is the notorious “Halloween
problem”, so called because it was discovered on October 31st by the System R group.
The Halloween problem arises from a particular execution strategy for statements like
“give everyone whose salary is under $20K a 10% raise”. A naïve plan for this query
pipelines an indexscan iterator over the Emp.salary field into an update iterator (the left-
hand side of Figure 10); the pipelining provides good I/O locality, because it modifies
tuples just after they are fetched from the B+-tree. However, this pipelining can also
result in the indexscan “rediscovering” a previously-modified tuple that moved rightward
in the tree after modification – resulting in multiple raises for each employee. In our
example, all low-paid employees will receive repeated raises until they earn more than
$20K; this is not the intention of the statement.

29
Anatomy of a Database System 71

Update
EMP

UPDATE EMP
SET salary=salary*1.1 Fetch-by-RID
WHERE salary < 20000 EMP

HeapScan

Update Materialize
EMP RID

IndexScan
EMP IndexScan
EMP

Figure 10: Two query plans for updating a table via an IndexScan. The plan on the left is susceptible
to the Halloween problem. The plan on the right is safe, since it identifies all tuples to be updated before
doing any updates.

SQL semantics forbid this behavior: an SQL statement is not allowed to “see” its own
updates. Some care is needed to ensure that this visibility rule is observed. A simple,
safe implementation has the query optimizer choose plans that avoid indexes on the
updated column, but this can be quite inefficient in some cases. Another technique is to
use a batch read-then-write scheme, which interposes Record-ID materialization and
fetching operators between the index scan and the data modification operators in the
dataflow (right-hand side of Figure 10.) This materialization operator receives the IDs of
all tuples to be modified and stores them in temporary file; it then scans the temporary
file and fetches each physical tuple ID by RID, feeding the resulting tuple to the data
modification operator. In most cases if an index was chosen by the optimizer, it implies
that only a few tuples are being changed, and hence the apparent inefficiency of this
technique may be acceptable, since the temporary table is likely to remain entirely in the
buffer pool. Pipelined update schemes are also possible, but require (somewhat exotic)
multiversion support from the storage engine.[54]

4.5 Access Methods


The access methods are the routines for managing access to the various disk-based data
structures supported by the system, which typically included unordered files (“heaps”) of
tuples, and various kinds of indexes. All commercial database systems include B+-tree

30
72 Chapter 1: Data Models and DBMS Architecture

indexes and heap files. Most systems are beginning to introduce some rudimentary
support for multidimensional indexes like R-trees [24]. Systems targeted at read-mostly
data warehousing workloads usually include specialized bitmap variants of indexes as
well [49].

The basic API provided by an access method is an iterator API, with the init() routine
expanded to take a “search predicate” (or in the terminology of System R, a “search
argument”, or SARG) of the form column operator constant. A NULL SARG is
treated as a request to scan all tuples in the table. The get_next() call at the access
method layer returns NULL when there are no more tuples satisfying the search argument.

There are two reasons to pass SARGs into the access method layer. The first reason
should be clear: index access methods like B+-trees require SARGs in order to function
efficiently. The second reason is a more subtle performance issue, but one that applies to
heap scans as well as index scans. Assume that the SARG is checked by the routine that
calls the access method layer. Then each time the access method returns from
get_next(), it must either (a) return a handle to a tuple residing in a frame in the buffer
pool, and pin the page in that frame to avoid replacement or (b) make a copy of the tuple.
If the caller finds that the SARG is not satisfied, it is responsible for either (a)
decrementing the pin count on the page, or (b) deleting the copied tuple. It must then try
the next tuple on the page by reinvoking get_next(). This logic involves a number of
CPU cycles simply doing function call/return pairs, and will either pin pages in the buffer
pool unnecessarily (generating unnecessary contention for buffer frames) or create and
destroy copies of tuples unnecessarily. Note that a typical heap scan will access all of the
tuples on a given page, resulting in multiple iterations of this interaction per page. By
contrast, if all this logic is done in the access method layer, the repeated pairs of
call/return and either pin/unpin or copy/delete can be avoided by testing the SARGs a
page at a time, and only returning from a get_next() call for a tuple that satisfies the
SARG.

A special SARG is available in all access methods to FETCH a tuple directly by its
physical Record ID (RID). FETCH-by-RID is required to support secondary indexes and
other schemes that “point” to tuples, and subsequently need to dereference those pointers.

In contrast to all other iterators, access methods have deep interactions with the
concurrency and recovery logic surrounding transactions. We discuss these issues next.

5 Transactions: Concurrency Control and Recovery


Database systems are often accused of being enormous, monolithic pieces of software
that cannot be split into reusable components. In practice, database systems – and the
development teams that implement and maintain them – do break down into independent
components with narrow interfaces in between. This is particularly true of the various
components of query processing described in the previous section. The parser, rewrite
engine, optimizer, executor and access methods all represent fairly independent pieces of
code with well-defined, narrow interfaces that are “published” internally between
development groups.

31
Anatomy of a Database System 73

The truly monolithic piece of a DBMS is the transactional storage manager, which
typically encompasses four deeply intertwined components:
1. A lock manager for concurrency control
2. A log manager for recovery
3. A buffer pool for staging database I/Os
4. Access methods for organizing data on disk.

A great deal of ink has been spilled describing the fussy details of transactional storage
algorithms and protocols in database systems. The reader wishing to become
knowledgable about these systems should read – at a minimum – a basic undergraduate
database textbook [53], the journal article on the ARIES log protocol [45], and one
serious article on transactional index concurrency and logging [46] [35]. More advanced
readers will want to leaf through the Gray and Reuter textbook on transactions [22]. To
really become an expert, this reading has to be followed by an implementation effort!
We will not focus on algorithms here, but rather overview the roles of these various
components, focusing on the system infrastructure that is often ignored in the textbooks,
and highlighting the inter-dependencies between the components.

5.1 A Note on ACID


Many people are familiar with the term “ACID transactions”, a mnemonic due to Härder
and Reuter [26]. ACID stands for Atomicity, Consistency, Isolation, and Durability.
These terms were not formally defined, and theory-oriented students sometimes spend a
great deal of time trying to tease out exactly what each letter means. The truth is that
these are not mathematical axioms that combine to guarantee transactional consistency,
so carefully distinguishing the terms may not be a worthwhile exercise. Despite the
informal nature, the ACID acronym is useful to organize a discussion of transaction
systems.

• Atomicity is the “all or nothing” guarantee for transactions – either all of a


transaction’s actions are visible to another transaction, or none are.
• Consistency is an application-specific guarantee, which is typically captured in a
DBMS by SQL integrity constraints. Given a definition of consistency provided
by a set of constraints, a transaction can only commit if it leaves the database in a
consistent state.
• Isolation is a guarantee to application writers that two concurrent transactions will
not see each other’s in-flight updates. As a result, applications need not be coded
“defensively” to worry about the “dirty data” of other concurrent transactions.
• Durability is a guarantee that the updates of a committed transaction will be
visible in the database to subsequent transactions, until such time as they are
overwritten by another committed transaction.

Roughly speaking, modern DBMSs implement Isolation via locking and Durability via
logging; Atomicity is guaranteed by a combination of locking (to prevent visibility of
transient database states) and logging (to ensure correctness of data that is visible).

32
74 Chapter 1: Data Models and DBMS Architecture

Consistency is managed by runtime checks in the query executor: if a transaction’s


actions will violate a SQL integrity constraint, the transaction is aborted and an error
code returned.

5.2 Lock Manager and Latches


Serializability is the well-defined textbook notion of correctness for concurrent
transactions: a sequence of interleaved actions for multiple committing transactions must
correspond to some serial execution of the transactions. Every commercial relational
DBMS implements serializability via strict two-phase locking (2PL): transactions acquire
locks on objects before reading or writing them, and release all locks at the time of
transactional commit or abort. The lock manager is the code module responsible for
providing the facilities for 2PL. As an auxiliary to database locks, lighter-weight latches
are also provided for mutual exclusion.

We begin our discussion with locks. Database locks are simply names used by
convention within the system to represent either physical items (e.g. disk pages) or
logical items (e.g., tuples, files, volumes) that are managed by the DBMS. Note that any
name can have a lock associated with it – even if that name represents an abstract
concept. The locking mechanism simply provides a place to register and check for these
names. Locks come in different lock “modes”, and these modes are associated with a
lock-mode compatibility table. In most systems this logic is based on the well-known
lock modes that are introduced in Gray’s paper on granularity of locks [21].

The lock manager supports two basic calls; lock(lockname, transactionID, mode),
and remove_transaction(transactionID). Note that because of the strict 2PL
protocol, there need not be an individual call to unlock resources individually – the
remove_transaction call will unlock all resources associated with a transaction.
However, as we discuss in Section 5.2.1, the SQL standard allows for lower degrees of
consistency than serializability, and hence there is a need for an unlock(lockname,
transactionID) call as well. There is also a lock_upgrade(lockname,
transactionID, newmode) call to allow transactions to “upgrade” to higher lock modes
(e.g. from shared to exclusive mode) in a two-phase manner, without dropping and re-
acquiring locks. Additionally, some systems also support a
conditional_lock(lockname, transactionID, mode) call. The conditional_lock
call always returns immediately, and indicates whether it succeeded in acquiring the lock.
If it did not succeed, the calling DBMS thread is not enqueued waiting for the lock. The
use of conditional locks for index concurrency is discussed in [46].

To support these calls, the lock manager maintains two data structures. A global lock
table is maintained to hold locknames and their associated information. The lock table is
a dynamic hash table keyed by (a hash function of) lock names. Associated with each
lock is a current_mode flag to indicate the lock mode, and a waitqueue of lock request
pairs (transactionID, mode). In addition, it maintains a transaction table keyed by
transactionID, which contains two items for each transaction T: (a) a pointer to T’s
DBMS thread state, to allow T’s DBMS thread to be rescheduled when it acquires any

33
Anatomy of a Database System 75

locks it is waiting on, and (b) a list of pointers to all of T’s lock requests in the lock table,
to facilitate the removal of all locks associated with a particular transaction (e.g., upon
transaction commit or abort).

Internally, the lock manager makes use of a deadlock detector DBMS thread that
periodically examines the lock table to look for waits-for cycles. Upon detection of a
deadlock, the deadlock detector aborts one of the deadlocked transaction (the decision of
which deadlocked transaction to abort is based on heuristics that have been studied in the
research literature [55].) In shared-nothing and shared-disk systems, distributed deadlock
detection facilities are required as well [47]. A more description of a lock manager
implementation is given in Gray and Reuter’s text [22].

In addition to two-phase locks, every DBMS also supports a lighter-weight mutual


exclusion mechanism, typically called a latch. Latches are more akin to monitors [30]
than locks; they are used to provide exclusive access to internal DBMS data structures.
As an example, the buffer pool page table has a latch associated with each frame, to
guarantee that only one DBMS thread is replacing a given frame at any time. Latches
differ from locks in a number of ways:
• Locks are kept in the lock table and located via hash tables; latches reside in
memory near the resources they protect, and are accessed via direct addressing.
• Locks are subject to the strict 2PL protocol. Latches may be acquired or dropped
during a transaction based on special-case internal logic.
• Lock acquisition is entirely driven by data access, and hence the order and
lifetime of lock acquisitions is largely in the hands of applications and the query
optimizer. Latches are acquired by specialized code inside the DBMS, and the
DBMS internal code issues latch requests and releases strategically.
• Locks are allowed to produce deadlock, and lock deadlocks are detected and
resolved via transactional restart. Latch deadlock must be avoided; the
occurrence of a latch deadlock represents a bug in the DBMS code.
• Latch calls take a few dozen CPU cycles, lock requests take hundreds of CPU
cycles.

The latch API supports the routines latch(object, mode), unlatch(object), and
conditional_latch(object, mode). In most DBMSs, the choices of latch modes
include only Shared or eXclusive. Latches maintain a current_mode, and a waitqueue
of DBMS threads waiting on the latch. The latch and unlatch calls work as one might
expect. The conditional_latch call is analogous to the conditional_lock call
described above, and is also used for index concurrency [46].

5.2.1 Isolation Levels


Very early in the development of the transaction concept, there were attempts to provide
more concurrency by providing “weaker” semantics than serializability. The challenge
was to provide robust definitions of the semantics in these cases. The most influential
effort in this regard was Gray’s early work on “Degrees of Consistency” [21]. That work
attempted to provide both a declarative definition of consistency degrees, and

34
76 Chapter 1: Data Models and DBMS Architecture

implementations in terms of locking. Influenced by this work, the ANSI SQL standard
defines four “Isolation Levels”:

1. READ UNCOMMITTED: A transaction may read any version of data, committed


or not. This is achieved in a locking implementation by read requests proceeding
without acquiring any locks4.
2. READ COMMITTED: A transaction may read any committed version of data.
Repeated reads of an object may result in different (committed) versions. This is
achieved by read requests acquiring a read lock before accessing an object, and
unlocking it immediately after access.
3. REPEATABLE READ: A transaction will read only one version of committed
data; once the transaction reads an object, it will always read the same version of
that object. This is achieved by read requests acquiring a read lock before
accessing an object, and holding the lock until end-of-transaction.
4. SERIALIZABLE: Fully serializable access is guaranteed.

At first blush, REPEATABLE READ seems to provide full serializability, but this is not
the case. Early in the System R project, a problem arose that was dubbed the “phantom
problem”. In the phantom problem, a transaction accesses a relation more than once with
the same predicate, but sees new “phantom” tuples on re-access that were not seen on the
first access.5 This is because two-phase locking at tuple-level granularity does not
prevent the insertion of new tuples into a table. Two-phase locking of tables prevents
phantoms, but table-level locking can be restrictive in cases where transactions access
only a few tuples via an index. We investigate this issue further in Section 5.4.3 when we
discuss locking in indexes.

Commercial systems provide the four isolation levels above via locking-based
implementations of concurrency control. Unfortunately, as noted in by Berenson, et al.
[6], neither the early work by Gray nor the ANSI standard achieve the goal of providing
truly declarative definitions. Both rely in subtle ways on an assumption that a locking
scheme is used for concurrency control, as opposed to an optimistic [36] or multi-version
[54] concurrency scheme. This implies that the proposed semantics are ill-defined. The
interested reader is encouraged to look at the Berenson paper which discusses some of the
problems in the SQL standard specifications, as well as the research by Adya et al. [1]
which provides a new, cleaner approach to the problem.

In addition to the standard ANSI SQL isolation levels, various vendors provide additional
levels that have proven popular in various cases.
• CURSOR STABILITY: This level is intended to solve the “lost update” problem
of READ COMMITTED. Consider two transactions T1 and T2. T1 runs in
READ COMMITTED mode, reads an object X (say the value of a bank account),

4
In all isolation levels, write requests are preceded by write locks that are held until end
of transaction.
5
Despite the spooky similarity in names, the phantom problem has nothing to do with the
Halloween problem of Section 4.4.

35
Anatomy of a Database System 77

remembers its value, and subsequently writes object X based on the remembered
value (say adding $100 to the original account value). T2 reads and writes X as
well (say subtracting $300 from the account). If T2’s actions happen between
T1’s read and T1’s write, then the effect of T2’s update will be lost – the final
value of the account in our example will be up by $100, instead of being down by
$200 as desired. A transaction in CURSOR STABILITY mode holds a lock on
the most recently-read item on a query cursor; the lock is automatically dropped
when the cursor is moved (e.g. via another FETCH) or the transaction terminates.
CURSOR STABILITY allows the transaction to do read-think-write sequences on
individual items without intervening updates from other transactions.
• SNAPSHOT ISOLATION: A transaction running in SNAPSHOT ISOLATION
mode operates on a version of the database as it existed at the time the transaction
began; subsequent updates by other transactions are invisible to the transaction.
When the transaction starts, it gets a unique start-timestamp from a monotonically
increasing counter; when it commits it gets a unique end-timestamp from the
counter. The transaction commits only if there is no other transaction with an
overlapping start/end-transaction pair wrote data that this transaction also wrote.
This isolation mode depends upon a multi-version concurrency implementation,
rather than locking (though these schemes typically coexist in systems that
support SNAPSHOT ISOLATION.)
• READ CONSISTENCY: This is a scheme defined by Oracle; it is subtly different
from SNAPSHOT ISOLATION. In the Oracle scheme, each SQL statement (of
which there may be many in a single transaction) sees the most recently
committed values as of the start of the statement. For statements that FETCH from
cursors, the cursor set is based on the values as of the time it is open-ed. This is
implemented by maintaining multiple versions of individual tuples, with a single
transaction possibly referencing multiple versions of a single tuple. Modifications
are maintained via long-term write locks, so when two transactions want to write
the same object the first writer “wins”, whereas in SNAPSHOT ISOLATION the
first committer “wins”.

Weak isolation schemes provide higher concurrency than serializability. As a result,


some systems even use weak consistency as the default; Oracle defaults to READ
COMMITTED, for example. The downside is that Isolation (in the ACID sense) is not
guaranteed. Hence application writers need to reason about the subtleties of the schemes
to ensure that their transactions run correctly. This is tricky given the operationally-
defined semantics of the schemes.
5.3 Log Manager
The log manager is responsible for maintaining the durability of committed transactions,
and for facilitating the rollback of aborted transactions to ensure atomicity. It provides
these features by maintaining a sequence of log records on disk, and a set of data
structures in memory. In order to support correct behavior after crash, the memory-
resident data structures obviously need to be re-createable from persistent data in the log
and the database.

36
78 Chapter 1: Data Models and DBMS Architecture

Database logging is an incredibly complex and detail-oriented topic. The canonical


reference on database logging is the journal paper on ARIES [45], and a database expert
should be familiar with the details of that paper. The ARIES paper not only explains its
protocol, but also provides discussion of alternative design possibilities, and the problems
that they can cause. This makes for dense but eventually rewarding reading. As a more
digestible introduction, the Ramakrishnan/Gehrke textbook [53] provides a description of
the basic ARIES protocol without side discussions or refinements, and we provide a set
of powerpoint slides that accompany that discussion on our website
(http://redbook.cs.berkeley.edu). Here we discuss some of the basic ideas in recovery,
and try to explain the complexity gap between textbook and journal descriptions.

As is well known, the standard theme of database recovery is to use a Write-Ahead


Logging (WAL) protocol. The WAL protocol consists of three very simple rules:
1. Each modification to a database page should generate a log record, and the log
record must be flushed to the log device before the database page is flushed.
2. Database log records must be flushed in order; log record r cannot be flushed until
all log records preceding r are flushed.
3. Upon a transaction commit request, a COMMIT log record must be flushed to the
log device before the commit request returns successfully.
Many people only remember the first of these rules, but all three are required for correct
behavior.

The first rule ensures that the actions of incomplete transactions can be undone in the
event of a transaction abort, to ensure atomicity. The combination of rules (2) and (3)
ensure durability: the actions of a committed transaction can be redone after a system
crash if they are not yet reflected in the database.

Given these simple principles, it is surprising that efficient database logging is as subtle
and detailed as it is. In practice, however, the simple story above is complicated by the
need for extreme performance. The challenge is to guarantee efficiency in the “fast path”
for transactions that commit, while also providing high-performance rollback for aborted
transactions, and quick recovery after crashes. Logging gets even more baroque when
application-specific optimizations are added, e.g. to support improved performance for
fields that can only be incremented or decremented (“escrow transactions”.)

In order to maximize the speed of the fast path, every commercial database system
operates in a mode that Härder and Reuter call “DIRECT, STEAL/NOT-FORCE” [26]:
(a) data objects are updated in place, (b) unpinned buffer pool frames can be “stolen”
(and the modified data pages written back to disk) even if they contain uncommitted data,
and (c) buffer pool pages need not be “forced” (flushed) to the database before a commit
request returns to the user. These policies keep the data in the location chosen by the
DBA, and they give the buffer manager and disk schedulers full latitude to decide on
memory management and I/O policies without consideration for transactional
correctness. These features can have major performance benefits, but require that the log
manager efficiently handle all the subtleties of undoing the flushes of stolen pages from

37
Anatomy of a Database System 79

aborted transactions, and redoing the changes to not-forced pages of committed


transactions that are lost on crash.

Another fast-path challenge in logging is to keep log records as small as possible, in order
to increase the throughput of log I/O activity. A natural optimization is to log logical
operations (e.g., “insert (Bob, $25000) into EMP”) rather than physical operations (e.g.,
the after-images for all byte ranges modified via the tuple insertion, including bytes on
both heap file and index blocks.) The tradeoff is that the logic to redo and undo logical
operations becomes quite involved, which can severely degrade performance during
transaction abort and database recovery.6 In practice, a mixture of physical and logical
logging (so-called “physiological” logging) is used. In ARIES, physical logging is
generally used to support REDO, and logical logging is used to support UNDO – this is
part of the ARIES rule of “repeating history” during recovery to reach the crash state, and
then rolling back transactions from that point.

Crash recovery performance is greatly enhanced by the presence of database checkpoints


– consistent versions of the database from the recent past. A checkpoint limits the
amount of log that the recovery process needs to consult and process. However, the
naïve generation of checkpoints is too expensive to do during regular processing, so some
more efficient “fuzzy” scheme for checkpointing is required, along with logic to correctly
bring the checkpoint up to the most recent consistent state by processing as little of the
log as possible. ARIES uses a very clever scheme in which the actual checkpoint records
are quite tiny, containing just enough information to initiate the log analysis process and
to enable the recreation of main-memory data structures lost at crash time.

Finally, the task of logging and recovery is further complicated by the fact that a database
is not merely a set of user data tuples on disk pages; it also includes a variety of
“physical” information that allows it to manage its internal disk-based data structures.
We discuss this in the context of index logging in the next section.

5.4 Locking and Logging in Indexes


Indexes are physical storage structures for accessing data in the database. The indexes
themselves are invisible to database users, except inasmuch as they improve
performance. Users cannot directly read or modify indexes, and hence user code need not
be isolated (in the ACID sense) from changes to the index. This allows indexes to be
managed via more efficient (and complex) transactional schemes than database data. The
only invariant that index concurrency and recovery needs to preserve is that the index
always returns transactionally-consistent tuples from the database.

6
Note also that logical log records must always have well-known inverse functions if
they need to participate in undo processing.

38
80 Chapter 1: Data Models and DBMS Architecture

5.4.1 Latching in B+-Trees


A well-studied example of this issue arises in B+-tree latching. B+-trees consist of
database disk pages that are accessed via the buffer pool, just like data pages. Hence one
scheme for index concurrency control is to use two-phase locks on index pages. This
means that every transaction that touches the index needs to lock the root of the B+-tree
until commit time – a recipe for limited concurrency. A variety of latch-based schemes
have been developed to work around this problem without setting any transactional locks
on index pages. The key insight in these schemes is that modifications to the tree’s
physical structure (e.g. splitting pages) can be made in a non-transactional manner as
long as all concurrent transactions continue to find the correct data at the leaves. There
are roughly three approaches to this:
• Conservative schemes, which allow multiple transactions to access the same
pages only if they can be guaranteed not to conflict in their use of a page’s
content. One such conflict is that a reading transaction wants to traverse a fully-
packed internal page of the tree, and a concurrent inserting transaction is
operating below that page, and hence might need to split it [4]. These
conservative schemes sacrifice too much concurrency compared with the more
recent ideas below.
• Latch-coupling schemes, in which the tree traversal logic latches each node before
it is visited, only unlatching a node when the next node to be visited has been
successfully latched. This scheme is sometimes called latch “crabbing”, because
of the crablike movement of “holding” a node in the tree, “grabbing” its child,
releasing the parent, and repeating. Latch coupling is used in some commercial
systems; IBM’s ARIES-IM version is well described [46]. ARIES-IM includes
some fairly intricate details and corner cases – on occasion it has to restart
traversals after splits, and even set (very short-term) tree-wide latches.
• Right-link schemes, which add some simple additional structure to the B+-tree to
minimize the requirement for latches and retraversals. In particular, a link is
added from each node to its right-hand neighbor. During traversal, right-link
schemes do no latch coupling – each node is latched, read, and unlatched. The
main intuition in right-link schemes is that if a traversing transaction follows a
pointer to a node n and finds that n was split in the interim, the traversing
transaction can detect this fact, and “move right” via the rightlinks to find the new
correct location in the tree. [39][35]
Kornacker, et al. [35] provide a detailed discussion of the distinctions between latch-
coupling and right-link schemes, and points out that latch-coupling is only applicable to
B+-trees, and will not work for index trees over more complex data, e.g.
multidimensional indexes like R-trees.

5.4.2 Logging for Physical Structures


In addition to special-case concurrency logic, indexes employ special-case logging logic.
This logic makes logging and recovery much more efficient, at the expense of more
complexity in the code. The main idea is that structural index changes need not be
undone when the associated transaction is aborted; such changes may have no effect on
the database tuples seen by other transactions. For example, if a B+-tree page is split

39
Anatomy of a Database System 81

during an inserting transaction that subsequently aborts, there is no pressing need to undo
the split during the abort processing.

This raises the challenge of labeling some log records “redo-only” – during any undo
processing of the log, these changes should be left in place. ARIES provides an elegant
mechanism for these scenarios called nested top actions, which allows the recovery
process to “jump over” log records for physical structure modifications without any
special case code during recovery.

This same idea is used in other contexts, including in heap files. An insertion into a heap
file may require the file to be extended on disk. To capture this, changes must be made to
the file’s “extent map”, a data structure on disk that points to the runs of contiguous
blocks that constitute the file. These changes to the extent map need not be undone if the
inserting transaction aborts – the fact that the file has become larger is a transactionally
invisible side-effect, and may be in fact be useful for absorbing future insert traffic.

5.4.3 Next-Key Locking: Physical Surrogates for Logical Properties


We close this section with a final index concurrency problem that illustrates a subtle but
significant idea. The challenge is to provide full seriazability (including phantom
protection) while allowing for tuple-level locks and the use of indexes.

The phantom problem arises when a transaction accesses tuples via an index: in such
cases, the transaction typically does not lock the entire table, just the tuples in the table
that are accessed via the index (e.g. “Name BETWEEN ‘Bob’ AND ‘Bobby’”). In the
absence of a table-level lock, other transactions are free to insert new tuples into the table
(e.g. Name=’Bobbie’). When these new inserts fall within the value-range of a query
predicate, they will appear in subsequent accesses via that predicate. Note that the
phantom problem relates to visibility of database tuples, and hence is a problem with
locks, not just latches. In principle, what is needed is the ability to somehow lock the
logical space represented by the original query’s search predicate. Unfortunately, it is
well known that predicate locking is expensive, since it requires a way to compare
arbitrary predicates for overlap – something that cannot be done with a hash-based lock
table [2].

The standard solution to the phantom problem in B+-trees is called “next-key locking”.
In next-key locking, the index insertion code is modified so that an insertion of a tuple
with index key k is required to allocate an exclusive lock on the “next-key” tuple that
exists in the index: the tuple with the lowest key greater than k. This protocol ensures
that subsequent insertions cannot appear “in between” two tuples that were returned
previously to an active transaction; it also ensures that tuples cannot be inserted just
below the lowest-keyed tuple previously returned (e.g. if there were no ‘Bob’ on the 1st
access, there should be no ‘Bob’ on subsequent accesses). One corner case remains: the
insertion of tuples just above the highest-keyed tuple previously returned. To protect
against this case, the next-key locking protocol requires read transactions to be modified
as well, so that they must get a shared lock on the “next-key” tuple in the index as well:

40
82 Chapter 1: Data Models and DBMS Architecture

the minimum-keyed tuple that does not satisfy the query predicate. An implementation
of next-key locking is described for ARIES [42].

Next-key locking is not simply a clever hack. It is an instance of using a physical object
(a currently-stored tuple) as a surrogate for a logical concept (a predicate). The benefit is
that simple system infrastructure (e.g. hash-based lock tables) can be used for more
complex purposes, simply by modifying the lock protocol. This idea of using physical
surrogates for logical concepts is unique to database research: it is largely unappreciated
in other systems work on concurrency, which typically does not consider semantic
information about logical concepts as part of the systems challenge. Designers of
complex software systems should keep this general approach in their “bag of tricks”
when such semantic information is available.

5.5 Interdependencies of Transactional Storage


We claimed early in this section that transactional storage systems are monolithic, deeply
entwined systems. In this section, we discuss a few of the interdependencies between the
three main aspects of a transactional storage system: concurrency control, recovery
management, and access methods. In a happier world, it would be possible to identify
narrow APIs between these modules, and allow the implementation behind those APIs to
be swappable. Our examples in this section show that this is not easily done. We do not
intend to provide an exhaustive list of interdependencies here; generating and proving the
completeness of such a list would be a very challenging exercise. We do hope, however,
to illustrate some of the twisty logic of transactional storage, and thereby justify the
resulting monolithic implementations in commercial DBMSs.

We begin by considering concurrency control and recovery alone, without complicating


things further with access method details. Even with the simplification, things are deeply
intertwined. One manifestation of the relationship between concurrency and recovery is
that write-ahead logging makes implicit assumptions about the locking protocol – it
requires strict two-phase locking, and will not operate correctly with non-strict two-phase
locking. To see this, consider what happens during the rollback of an aborted transaction.
The recovery code begins processing the log records of the aborted transaction, undoing
its modifications. Typically this requires changing pages or tuples that were previously
modified by the transaction. In order to make these changes, the transaction needs to
have locks on those pages or tuples. In a non-strict 2PL scheme, if the transaction drops
any locks before aborting, it is unable to acquire the new locks it needs to complete the
rollback process!

Access methods complicate things yet further. It is an enormous intellectual challenge to


take a textbook access method (e.g. linear hashing [41] or R-trees [24]) and implement it
correctly and efficiently in a transactional system; for this reason, most DBMSs still only
implement heap files and B+-trees as native, transactionally protected access methods.
As we illustrated above for B+-trees, high-performance implementations of transactional
indexes include intricate protocols for latching, locking, and logging. The B+-trees in
serious DBMSs are riddled with calls to the concurrency and recovery code. Even simple

41
Anatomy of a Database System 83

access methods like heap files have some tricky concurrency and recovery issues
surrounding the data structures that describe their contents (e.g. extent maps). This logic
is not generic to all access methods – it is very much customized to the specific logic of
the access method, and its particular implementation.

Concurrency control in access methods has been well-developed only for locking-
oriented schemes. Other concurrency schemes (e.g. Optimistic or Multiversion
concurrency control) do not usually consider access methods at all, or if they do mention
them it is only in an offhanded and impractical fashion [36]. Hence it is unlikely that one
can mix and match different concurrency mechanisms for a given access method
implementation.

Recovery logic in access methods is particularly system-specific: the timing and contents
of access method log records depend upon fine details of the recovery protocol, including
the handling of structure modifications (e.g. whether they get undone upon transaction
rollback, and if not how that is avoided), and the use of physical and logical logging.

Even for a specific access method, the recovery and concurrency logic are intertwined.
In one direction, the recovery logic depends upon the concurrency protocol: if the
recovery manager has to restore a physically consistent state of the tree, then it needs to
know what inconsistent states could possibly arise, to bracket those states appropriately
with log records (e.g. via nested top actions). In the opposite direction, the concurrency
protocol for an access method may be dependent on the recovery logic: for example, the
rightlink scheme for B+-trees assumes that pages in the tree never “re-merge” after they
split, an assumption that requires the recovery scheme to use a scheme like nested top
actions to avoid undoing splits generated by aborted transactions.

The one bright spot in this picture is that buffer management is relatively well-isolated
from the rest of the components of the storage manager. As long as pages are pinned
correctly, the buffer manager is free to encapsulate the rest of its logic and reimplement it
as needed, e.g. the choice of pages to replace (because of the STEAL property), and the
scheduling of page flushes (thanks to the NOT FORCE property). Of course achieving
this isolation is the direct cause of much of the complexity in concurrency and recovery,
so this spot is not perhaps as bright as it seems either.

6 Shared Components
In this section we cover a number of utility subsystems that are present in nearly all
commercial DBMS, but rarely discussed in the literature.

6.1 Memory Allocator


The textbook presentation of DBMS memory management tends to focus entirely on the
buffer pool. In practice, database systems allocate significant amounts of memory for
other tasks as well, and the correct management of this memory is both a programming
burden and a performance issue. Selinger-style query optimization can use a great deal

42
84 Chapter 1: Data Models and DBMS Architecture

of memory, for example, to build up state during dynamic programming. Query operators
like hashjoins and sorts allocate significant memory for private space at runtime. In
commercial systems, memory allocation is made more efficient and easier to debug via
the use of a context-based memory allocator.

A memory context is an in-memory data structure that maintains a list of regions of


contiguous virtual memory, with each region possibly having a small header containing a
context label or a pointer to the context header structure.

The basic API for memory contexts includes calls to:


• Create a context with a given name or type. The type of the context might
advise the allocator how to efficiently handle memory allocation: for example, the
contexts for the query optimizer grow via small increments, while contexts for
hashjoins allocate their memory in a few large batches. Based on such
knowledge, the allocator can choose to allocate bigger or smaller regions at a
time.
• Allocate a chunk of memory within a context. This allocation will return a
pointer to memory (much like the traditional malloc call). That memory may
come from an existing region in the context; if no such space exists in any region,
the allocator will ask the operating system for a new region of memory, label it,
and link it into the context.
• Delete a chunk of memory within a context. This may or may not cause the
context to delete the corresponding region. Deletion from memory contexts is
somewhat unusual – a more typical behavior is to delete an entire context.
• Delete a context. This first frees all of the regions associated with the context,
and then deletes the context header.
• Reset a context. This retains the context, but returns it to the state of original
creation – typically by deallocating all previously-allocated regions of memory.

Memory contexts provide important software engineering advantages. The most


important is that they serve as a lower-level, programmer-controllable alternative to
garbage collection. For example, the developers writing the optimizer can allocate
memory in an optimizer context for a particular query, without worrying about how to
free the memory later on. When the optimizer has picked the best plan, it can make a
copy of the plan in memory from a separate executor context for the query, and then
simply delete the query’s optimizer context – this saves the trouble of writing code to
carefully walk all the optimizer data structures and delete their components. It also
avoids tricky memory leaks that can arise from bugs in such code. This feature is very
useful for the naturally “phased” behavior of query execution, which proceeds from
parser to optimizer to executor, typically doing a number of allocations in each context,
followed by a context deletion.

Note that memory contexts actually provide more control than most garbage collectors:
developers can control both spatial and temporal locality of deallocation. Spatial control
is provided by the context mechanism itself, which allows the programmer to separate
memory into logical units. Temporal control is given by allowing programmers to issue

43
Anatomy of a Database System 85

context deletions when appropriate. By contrast, garbage collectors typically work on all
of a program’s memory, and make their own decisions about when to run. This is one of
the frustrations of attempting to write server-quality code in Java [59].

Memory contexts also provide performance advantages in some cases, due to the
relatively high overhead for malloc() and free() on many platforms. In particular,
memory contexts can use semantic knowledge (via the context type) of how memory will
be allocated and deallocated, and may call malloc() and free() accordingly to minimize
OS overheads. In particular, some pieces of a database system (e.g. the parser and
optimizer) allocate a large number of small objects, and then free them all at once via a
context deletion. On most platforms it is rather expensive to call free() on many small
objects, so a memory allocator can instead malloc() large regions, and apportion the
resulting memory to its callers. The relative lack of memory deallocations means that
there is no need for the kind of compaction logic used by malloc() and free(). And when
the context is deleted, only a few free() calls are required to remove the large regions.

The interested reader may want to browse the open-source PostgreSQL code, which has a
fairly sophisticated memory allocator.

6.1.1 A Note on Memory Allocation for Query Operators


A philosophical design difference among vendors can be seen in the allocation of
memory for space-intensive operators like hash joins and sorts. Some systems (e.g. DB2)
allow the DBA to control the amount of RAM that will be used by such operations, and
guarantee that each query gets that amount of RAM when executed; this guarantee is
ensured by the admission control policy. In such systems, the operators allocate their
memory off of the heap via the memory allocator. These systems provide good
performance stability, but force the DBA to (statically!) decide how to balance physical
memory across various subsystems like the buffer pool and the query operators.

Other systems (e.g. MS SQL Server) try to manage these issues automatically, taking the
memory allocation task out of the DBA’s hands. These systems attempt to do intelligent
memory allocation across the various pieces of query execution, including caching of
pages in the buffer pool and the use of memory by query operators. The pool of memory
used for all of these tasks is the buffer pool itself, and hence in these systems the query
operators take memory from the buffer pool, bypassing the memory allocator.

This distinction echoes our discussion of query preparation in Section 4.3.1. The former
class of systems assumes that the DBA is engaged in sophisticated tuning, and that the
workload for the system will be amenable to one carefully-chosen setting of the DBA’s
memory “knobs”. Under these conditions, these systems should always perform
predictably well. The latter class assumes that DBAs either do not or cannot correctly set
these knobs, and attempts to replace the DBA wisdom with software logic. They also
retain the right to change their relative allocations adaptively, providing the possibility
for better performance on changing workloads. As in Section 4.3.1, this distinction says

44
86 Chapter 1: Data Models and DBMS Architecture

something about how these vendors expect their products to be used, and about the
administrative expertise (and financial resources) of their customers.

6.2 Disk Management Subsystems


Textbooks on DBMSs tend to talk about disks as if they were homogeneous objects. In
practice, disk drives are complex and heterogeneous pieces of hardware, varying widely
in capacity and bandwidth. Hence every DBMS has a disk management subsystem that
deals with these issues, managing the allocation of tables and other units of storage across
multiple devices.

One aspect of this module is to manage the mapping of tables to devices and or files.
One-to-one mappings of tables to files sound natural, but raised problems in early
filesystems. First, OS files traditionally could not be larger than a disk, while database
tables may need to span multiple disks. Second, it was traditionally bad form to allocate
too many OS files, since the OS typically only allowed a few open file descriptors, and
many OS utilities for directory management and backup did not scale to very large
numbers of files. Hence in many cases a single file is used to hold multiple tables. Over
time, most filesystems have overcome these limitations, but it is typical today for OS files
to simply be treated by the DBMS as abstract storage units, with arbitrary mappings to
database tables.

More complex is the code to handle device-specific details for maintaining temporal and
spatial control as described in Section 3. There is a large and vibrant industry today
based on complex storage devices that “pretend” to be disk drives, but are in fact large
hardware/software systems whose API is a legacy disk drive interface like SCSI. These
systems, which include RAID boxes and Network Attached Storage (NAS) devices, tend
to have very large capacities, and complex performance characteristics. Users like these
systems because they are easy to install, and often provide easily-managed, bit-level
reliability with quick or instantaneous failover. These features provide a significant sense
of comfort to customers, above and beyond the promises of DBMS recovery subsystems.
It is very common to find DBMS installations on RAID boxes, for example.

Unfortunately, these systems complicate DBMS implementations. As an example, RAID


systems perform very differently after a fault than they do when all the disks are good,
potentially complicating the I/O cost models for the DBMS. Also, these systems – like
filesystems before them – tend to want to exercise temporal control over writes by
managing their own caching policies, possibly subverting the write-ahead logging
protocol. In the case of power failures, this can lead to consistency at the per-bit
granularity (storage-oriented consistency), without transactional consistency. It is
uncomfortable for the DBMS vendors to point their fingers at the disk vendors in such
cases; at the end of the day, DBMS vendors are expected to provide transactional
consistency on any popular storage device. Hence DBMSs must understand the ins and
outs of the leading storage devices, and manage them accordingly.

RAID systems also frustrate database cognoscenti by underperforming for database tasks.

45
Anatomy of a Database System 87

RAID was conceived for bytestream-oriented storage (a la UNIX files), rather than the
tuple-oriented storage used by database systems. Hence RAID devices do not tend to
perform as well as database-specific solutions for partitioning and replicating data across
multiple physical devices (e.g. the chained declustering scheme of Gamma [12] that was
roughly coeval with the invention of RAID). Most databases provide DBA commands to
control the partitioning of data across multiple devices, but RAID boxes subvert these
commands by hiding the multiple devices behind a single interface.

Moreover, many users configure their RAID boxes to minimize space overheads (“RAID
level 5”), when the database would perform far, far better via simpler schemes like disk
mirroring (a.k.a. “RAID level 1”). A particularly unpleasant feature of RAID level 5 is
that writes are much more expensive than reads; this can cause surprising bottlenecks for
users, and the DBMS vendors are often on the hook to explain or provide workarounds
for these bottlenecks. For better or worse, the use (and misuse) of RAID devices is a fact
that commercial systems must take into account, and most vendors spend significant
energy tuning their DBMSs to work well on the leading RAID boxes.

6.3 Replication Services


It is often desirable to replicate databases across a network via periodic updates. This is
frequently used for an extra degree of reliability – the replicated database serves as a
slightly-out-of-date “warm standby” in case the main system goes down. It is
advantageous to keep the warm standby in a physically different location, to be able to
continue functioning after a fire or other catastrophe. Replication is also often used to
provide a pragmatic form of distributed database functionality for large, geographically
distributed enterprises. Most such enterprises partition their databases into large
geographic regions (e.g. nations or continents), and run all updates locally on the primary
copies of the data. Queries are executed locally as well, but can run on a mix of fresh data
from their local operations, and slightly-out-of-date data replicated from remote sites
regions.

There are three typical schemes for replication, but only the third provides the
performance and scalability needed for high-end settings. It is, of course, the most
difficult to implement.

1. Physical Replication: The simplest scheme is to physically duplicate the entire


database every replication period. This scheme does not scale up to large
databases, because of the bandwidth for shipping the data, and the cost for
reinstalling it at the remote site. Moreover, it is tricky to guarantee a
transactionally consistent snapshot of the database; doing so typically requires the
unacceptable step of quiescing the source system during the replication process.
Physical replication is therefore only used as a client-side hack at the low end;
most vendors do not explicitly encourage this scheme via any software support.
2. Trigger-Based Replication: In this scheme, triggers are placed on the database
tables so that upon any insert, delete, or update to the table, a “difference” record
is installed in special replication table. This replication table is shipped to the

46
88 Chapter 1: Data Models and DBMS Architecture

remote site, and the modifications are “replayed” there. This scheme solves the
problems mentioned above for physical replication, but has a number of
performance problems. First, most database vendors provide very limited trigger
facilities – often only a single trigger is allowed per table. In such scenarios, it is
often not possible to install triggers for replication. Second, database trigger
systems cannot usually keep up with the performance of transaction systems. At a
minimum, the execution of triggering logic adds approximately 100% more I/Os
to each transaction that modifies a database, and in practice even the testing of
trigger conditions is quite slow in many systems. Hence this scheme is not
desirable in practice, though it is used with some regularity in the field.
3. Log-Based Replication: Log-based replication is the replication solution of
choice when feasible. In log-based replication, a log “sniffer” process intercepts
log writes and ships them to the remote site, where they are “played forward” in
REDO mode. This scheme overcomes all of the problems of the previous
alternatives. It is low-overhead, providing minimal or invisible peformance
burdens on the running system. It provides incremental updates, and hence scales
gracefully with the database size and the update rate. It reuses the built-in
mechanisms of the DBMS without significant additional logic. Finally, it
naturally provides transactionally consistent replicas via the log’s built-in logic.

Most of the major vendors provide log-based replication for their own systems.
Providing log-based replication that works across vendors is much more difficult
– it requires understanding another vendor’s log formats, and driving the vendors
replay logic at the remote end.
6.4 Batch Utilities
Every system provides a set of utilities for managing their system. These utilities are
rarely benchmarked, but often dictate the manageability of the system. A technically
challenging and especially important feature is to make these utilities run online, i.e.
while user queries and transactions are in flight. This is important in 24x7 operations,
which have become much more common in recent years due to the global reach of e-
commerce: the traditional “reorg window” in the wee hours is often no-longer available.
Hence most vendors have invested significant energy in recent years in providing online
utilities. We give a flavor of these utilities here:

- Optimizer Statistics Gathering: Every DBMS has a process that sweeps the
tables and builds optimizer statistics of one sort or another. Some statistics like
histograms are non-trivial to build in one pass without flooding memory; see, for
example, the work by Flajolet and Martin on computing the number of distinct
values in a column [13].
- Physical Reorganization and Index Construction: Over time, access methods
can become inefficient due to patterns of insertions and deletions leaving unused
space. Also, users may occasionally request that tables be reorganized in the
background – e.g. to recluster (sort) them on different columns, or to repartition
them across multiple disks. Online reorganization of files and indexes can be
tricky, since it must avoid holding locks for any length of time, but still needs to

47
Anatomy of a Database System 89

maintain physical consistency. In this sense it bears some analogies to the


logging and locking protocols used for indexes, as described in Section 5.4. This
has been the subject of a few research papers [68]. Similar issues arise in the
background construction of indexes from scratch.
- Backup/Export: All DBMSs support the ability to physically dump the database
to backup storage. Again, since this is a long-running process, it cannot naively
set locks. Instead, most systems perform some kind of “fuzzy” dump, and
augment it with logging logic to ensure transactional consistency. Similar
schemes can be used to export the database to an interchange format.

7 Conclusion
As should be clear from this paper, modern commercial database systems are grounded
both in academic research and in the experience of developing industrial-strength
products for high-end customers. The task of writing and maintaining a high-
performance, fully functional relational DBMS from scratch is an enormous investment
in time and energy. As the database industry has consolidated to a few main competitors,
it has become less and less attractive for new players to enter the main arena. However,
many of the lessons of relational DBMSs translate over to new domains: web services,
network-attached storage, text and e-mail repositories, notification services, network
monitors, and so on. Data-intensive services are at the core of computing today, and
knowledge of database system design is a skill that is broadly applicable, both inside and
outside the halls of the main database shops. These new directions raise a number of
research problems in database management as well, and point the way to new interactions
between the database community and other areas of computing.

8 Acknowledgments
The authors would like to thank Rob von Behren, Eric Brewer, Paul Brown, Amol
Deshpande, Jim Gray, James Hamilton, Wei Hong, Guy Lohman, Mehul Shah and Matt
Welsh for background information and comments on early drafts of this paper.

9 References

[1] Atul Adya, Barbara Liskov, and Patrick O'Neil. Generalized Isolation Level
Definitions. In 16th International Conference on Data Engineering (ICDE), San
Diego, CA, February 2000.

[2] Rakesh Agrawal, Michael J. Carey and Miron Livny. Concurrency control
performance modelling: alternatives and implications, ACM Transactons on
Database Systems (TODS) 12(4):609-654, 1987.

[3] Morton M. Astrahan, Mike W. Blasgen, Donald D. Chamberlin, Kapali P.


Eswaran, Jim Gray, Patricia P. Griffiths, W. Frank King III, Raymond A. Lorie,
Paul R. McJones, James W. Mehl, Gianfranco R. Putzolu, Irving L. Traiger,
Bradford W. Wade, and Vera Watson. System R: Relational Approach to

48
90 Chapter 1: Data Models and DBMS Architecture

Database Management. ACM Transactions on Database Systems (TODS),


1(2):97-137, 1976.

[4] Rudolf Bayer and Mario Schkolnick. Concurrency of Operations on B-Trees.


Acta Informatica, 9:1-21, 1977.

[5] Kristin P. Bennett, Michael C. Ferris, and Yannis E. Ioannidis. A Genetic


Algorithm for Database Query Optimization. In Proceedings of the 4th
International Conference on Genetic Algorithms, pages 400-407, San Diego,
CA, July 1991.

[6] Hal Berenson, Philip A. Bernstein, Jim Gray, Jim Melton, Elizabeth J. O'Neil,
and Patrick E. O'Neil. A Critique of ANSI SQL Isolation Levels. In Proc. ACM
SIGMOD International Conference on Management of Data, pages 1-10, San
Jose, CA, May 1995.

[7] William Bridge, Ashok Joshi, M. Keihl, Tirthankar Lahiri, and Juan Loaiza
andgd N. MacNaughton. The Oracle Universal Server Buffer. In Proc. 23rd
International Conference on Very Large Data Bases (VLDB), pages 590-594,
Athens, Greece, August 1997. Morgan Kaufmann.

[8] Surajit Chaudhuri and Vivek R. Narasayya. AutoAdmin 'What-if' Index Analysis
Utility. In Proc. ACM SIGMOD International Conference on Management of
Data, pages 367-378, Seattle, WA, June 1998.

[9] Surajit Chaudhuri and Kyuseok Shim. Optimization of Queries with User-
Defined Predicates. ACM Transactions on Database Systems (TODS), 24(2):177-
228, 1999.

[10] Hong-Tai Chou and David J. DeWitt. An Evaluation of Buffer Management


Strategies for Relational Database Systems. In Proceedings of 11th International
Conference on Very Large Data Bases (VLDB), pages 127-141, Stockholm,
Sweden, August 1985.

[11] Amol Desphande, Minos Garofalakis, and Rajeev Rastogi. Independence is


Good: Dependency-Based Histogram Synopses for High-Dimensional Data. In
Proceedings of the 18th International Conference on Data Engineering, San
Jose, CA, February 2001.

[12] David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Krishna
B. Kumar, and M. Muralikrishna. GAMMA - A High Performance Dataflow
Database Machine. In Twelfth International Conference on Very Large Data
Bases (VLDB), pages 228-237, Kyoto, Japan, August 1986.

49
Anatomy of a Database System 91

[13] Philippe Flajolet and G. Nigel Martin. Probabilistic Counting Algorithms for
Data Base Applications. Journal of Computing System Science, 31(2):182-209,
1985.

[14] Sumit Ganguly, Waqar Hasan, and Ravi Krishnamurthy. Query Optimization for
Parallel Execution. In Proceedings of the ACM SIGMOD International
Conference on Management of Data, pages 9-18, San Diego, CA, June 1992.

[15] Minos N. Garofalakis and Yannis E. Ioannidis. Parallel Query Scheduling and
Optimization with Time- and Space-Shared Resources. In Proc. 23rd
International Conference on Very Large Data Bases (VLDB), pages 296-305,
Athens, Greece, August 1997.

[16] G. Graefe. Encapsulation of Parallelism in the Volcano Query Processing


System. In Proc. ACM-SIGMOD International Conference on Management of
Data, pages 102-111, Atlantic City, May 1990.

[17] Goetz Graefe. The Cascades Framework for Query Optimization. IEEE Data
Engineering Bulletin, 18(3):19-29, 1995.

[18] G. Graefe. Query Evaluation Techniques for Large Databases. Computing


Surveys 25 (2): 73-170 (1993).

[19] Goetz Graefe and William J. McKenna. The Volcano Optimizer Generator:
Extensibility and Efficient Search. In Proc. 9th International Conference on
Data Engineering (ICDE), pages 209-218, Vienna, Austria, April 1993.

[20] Jim Gray and Goetz Graefe. The Five-Minute Rule Ten Years Later, and Other
Computer Storage Rules of Thumb. ACM SIGMOD Record, 26(4):63-68, 1997.

[21] Jim Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger.
Granularity of Locks and Degrees of Consistency in a Shared Data Base. In IFIP
Working Conference on Modelling in Data Base Management Systems, pages
365-394, 1976.

[22] Jim Gray and Andreas Reuter. Transaction Processing: Concepts and
Techniques. Morgan Kaufmann, 1993.

[23] Steven D. Gribble, Eric A. Brewer, Joseph M. Hellerstein, and David Culler.
Scalable, Distributed Data Structures for Internet Service Construction. In
Proceedings of the Fourth Symposium on Operating Systems Design and
Implementation (OSDI), 2000. 2

[24] Antonin Guttman. R-Trees: A Dynamic Index Structure For Spatial Searching. In
Proc. ACM-SIGMOD International Conference on Management of Data, pages
47-57, Boston, June 1984.

50
92 Chapter 1: Data Models and DBMS Architecture

[25] Alon Y. Halevy, editor. The VLDB Journal, Volume 11(4). The VLDB
Foundation, Dec 2002.

[26] Theo Härder and Andreas Reuter. Principles of Transaction-Oriented Database


Recovery. ACM Computing Surveys, 15(4):287-317, 1983.

[27] Pat Helland, Harald Sammer, Jim Lyon, Richard Carr, Phil Garrett, and Andreas
Reuter. Group Commit Timers and High-Volume Transaction Systems.
Technical Report TR-88.1, Tandem Computers, March 1988.

[28] Joseph M. Hellerstein. Optimization Techniques for Queries with Expensive


Methods. ACM Transactions on Database Systems (TODS), 23(2):113-157,
1998.

[29] Joseph M. Hellerstein and Michael Stonebraker. Predicate Migration: Optimizing


Queries With Expensive Predicates. In Proc. ACM-SIGMOD International
Conference on Management of Data, pages 267-276, Washington, D.C., May
1993.

[30] C. Hoare. Monitors: An operating system structuring concept. Communications


of the ACM (CACM), 17(10):549-557, 1974.

[31] Wei Hong and Michael Stonebraker. Optimization of Parallel Query Execution
Plans in XPRS. In Proceedings of the First International Conference on Parallel
and Distributed Information Systems (PDIS), pages 218-225, Miami Beach, FL,
December 1991.

[32] Hui-I Hsiao and David J. DeWitt. Chained Declustering: A New Availability
Strategy for Multiprocessor Database Machines. In Proc. Sixth International
Conference on Data Engineering (ICDE), pages 456-465, Los Angeles, CA,
November 1990.

[33] Yannis E. Ioannidis and Stavros Christodoulakis. On the Propagation of Errors in


the Size of Join Results. In Proceedings of the ACM SIGMOD International
Conference on Management of Data, pages 268-277, Denver, CO, May 1991.

[34] Yannis E. Ioannidis and Younkyung Cha Kang. Randomized Algorithms for
Optimizing Large Join Queries. In Proc. ACM-SIGMOD International
Conference on Management of Data, pages 312-321, Atlantic City, May 1990.

[35] Marcel Kornacker, C. Mohan, and Joseph M. Hellerstein. Concurrency and


Recovery in Generalized Search Trees. In Proc. ACM SIGMOD International
Conference on Management of Data, pages 62-72, Tucson, AZ, May 1997.

51
Anatomy of a Database System 93

[36] H. T. Kung and John T. Robinson. On Optimistic Methods for Concurrency


Control. ACM Tranactions on Database Systems (TODS), 6(2):213-226, 1981.

[37] James R. Larus and Michael Parkes. Using Cohort Scheduling to Enhance Server
Performance. In USENIX Annual Conference, 2002.

[38] H. C. Lauer and R. M. Needham. On the Duality of Operating System Structures.


ACM SIGOPS Operating Systems Review, 13(2):3-19, April 1979. 3

[39] Philip L. Lehman and S. Bing Yao. Efficient Locking for Concurrent Operations
on B-Trees. ACM Transactions on Database Systems (TODS), 6(4):650-670,
December 1981.

[40] Alon Y. Levy, Inderpal Singh Mumick, and Yehoshua Sagiv. Query
Optimization by Predicate Move-Around. In Proc. 20th International
Conference on Very Large Data Bases, pages 96-107, Santiago, September 1994.

[41] Witold Litwin. Linear Hashing: A New Tool for File and Table Addressing. In
Sixth International Conference on Very Large Data Bases (VLDB), pages 212-
223, Montreal, Quebec, Canada, October 1980.

[42] Guy M. Lohman. Grammar-like Functional Rules for Representing Query


Optimization Alternatives. In Proc. ACM SIGMOD International Conference on
Management of Data, pages 18-27, Chicago, IL, June 1988.

[43] Samuel R. Madden and Michael J. Franklin. Fjording the Stream: An


Architecture for Queries over Streaming Sensor Data. In Proc. 12th IEEE
International Conference on Data Engineering (ICDE), San Jose, February
2002.

[44] C. Mohan. ARIES/KVL: A Key-Value Locking Method for Concurrency


Control of Multiaction Transactions Operating on B-Tree Indexes. In 16th
International Conference on Very Large Data Bases (VLDB), pages 392-405,
Brisbane, Queensland, Australia, August 1990.

[45] C. Mohan, Donald J. Haderle, Bruce G. Lindsay, Hamid Pirahesh, and Peter M.
Schwarz. ARIES: A Transaction Recovery Method Supporting Fine- Granularity
Locking and Partial Rollbacks Using Write-Ahead Logging. ACM Transactions
on Database Systems (TODS), 17(1):94-162, 1992.

[46] C. Mohan and Frank Levine. ARIES/IM: An Efficient and High Concurrency
Index Management Method Using Write-Ahead Logging. In Michael
Stonebraker, editor, Proc. ACM SIGMOD International Conference on
Management of Data, pages 371-380, San Diego, CA, June 1992.

52
94 Chapter 1: Data Models and DBMS Architecture

[47] C. Mohan, Bruce G. Lindsay, and Ron Obermarck. Transaction Management in


the R* Distributed Database Management System. ACM Transactions on
Database Systems (TODS), 11(4):378-396, 1986.

[48] Elizabeth J. O'Neil, Patrick E. O'Neil, and Gerhard Weikum. The LRU-K Page
Replacement Algorithm For Database Disk Buffering. In Proceedings ACM
SIGMOD International Conference on Management of Data, pages 297-306,
Washington, D.C., May 1993.

[49] Patrick E. O'Neil and Dallan Quass. Improved Query Performance with Variant
Indexes. In Proc. ACM-SIGMOD International Conference on Management of
Data, pages 38-49, Tucson, May 1997.

[50] M. Tamer Ozsu and Jose A. Blakeley. Query Processing in Object-Oriented


Database Systems. In Won Kim, editor, Modern Database Systems. Addison
Wesley, 1995. 4

[51] Hamid Pirahesh, Joseph M. Hellerstein, and Waqar Hasan. Extensible/Rule-


Based Query Rewrite Optimization in Starburst. In Proc. ACM-SIGMOD
International Conference on Management of Data, pages 39-48, San Diego, June
1992.

[52] Viswanath Poosala and Yannis E. Ioannidis. Selectivity Estimation Without the
Attribute Value Independence Assumption. In Proceedings of 23rd International
Conference on Very Large Data Bases (VLDB), pages 486-495, Athens, Greece,
August 1997.

[53] Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems,


Third Edition. McGraw-Hill, Boston, MA, 2003.

[54] David P. Reed. Naming and Synchronization in a Decentralized Computer


System. PhD thesis, MIT, Dept. of Electrical Engineering, 1978.

[55] Allen Reiter. A Study of Buffer Management Policies for Data Management
Systems. Technical Summary Report 1619, Mathematics Research Center,
University of Wisconsin, Madison, 1976.

[56] Daniel J. Rosenkrantz, Richard E. Stearns, and Philip M. Lewis. System Level
Concurrency Control for Distributed Database Systems. ACM Transactions on
Database Systems (TODS), 3(2):178-198, June 1978.

[57] Patricia G. Selinger, M. Astrahan, D. Chamberlin, Raymond Lorie, and T. Price.


Access Path Selection in a Relational Database Management System. In Proc.
ACM-SIGMOD International Conference on Management of Data, pages 22-34,
Boston, June 1979.

53
Anatomy of a Database System 95

[58] Praveen Seshadri, Hamid Pirahesh, and T.Y. Cliff Leung. Complex Query
Decorrelation. In Proc. 12th IEEE International Conference on Data
Engineering (ICDE), New Orleans, February 1996.

[59] Mehul A. Shah, Samuel Madden, Michael J. Franklin, and Joseph M. Hellerstein.
Java Support for Data-Intensive Systems: Experiences Building the Telegraph
Dataflow System. ACM SIGMOD Record, 30(4):103-114, 2001.

[60] Leonard D. Shapiro. Exploiting Upper and Lower Bounds in Top-Down Query
Optimization. International Database Engineering and Application Symposium
(IDEAS), 2001.

[61] Abraham Silberschatz, Henry F. Korth, and S. Sudarshan. Database System


Concepts, Fourth Edition. McGraw-Hill, Boston, MA, 2001.

[62] Michael Steinbrunn, Guido Moerkotte, and Alfons Kemper. Heuristic and
Randomized Optimization for the Join Ordering Problem. VLDB Journal,
6(3):191-208, 1997.

[63] Michael Stonebraker. Retrospection on a Database System. ACM Transactions


on Database Systems (TODS), 5(2):225-240, 1980.

[64] Michael Stonebraker. Operating System Support for Database Management.


Communications of the ACM (CACM), 24(7):412-418, 1981.

[65] Michael Stonebraker. The Case for Shared Nothing. IEEE Database Engineering
Bulletin, 9(1):4-9, 1986. 5

[66] M.R. Stonebraker, E. Wong, and P. Kreps. The Design and Implementation of
INGRES. ACM Transactions on Database Systems, 1(3):189-222, September
1976.

[67] Matt Welsh, David Culler, and Eric Brewer. SEDA: An Architecture for Well-
Conditioned, Scalable Internet Services. In Proceedings of the 18th Symposium
on Operating Systems Principles (SOSP-18), Banff, Canada, October 2001.

[68] Rob von Behren, Jeremy Condit, Feng Zhou, George C. Necula, and Eric
Brewer. Capriccio: Scalable Threads for Internet Services. In Proceedings of the
Ninteenth Symposium on Operating System Principles (SOSP-19), Lake George,
New York. October 2003.

[69] Chendong Zou and Betty Salzberg. On-line Reorganization of Sparsely-


populated B+trees. In Proc. ACM SIGMOD International Conference on
Management of Data, pages 115-124, Montreal, Quebec, Canada, 1996.

54
Chapter 2
Query Processing

This chapter presents a selection of key papers on query processing, starting with single-site
query processing, and continuing through parallel and distributed systems. In previous editions
we presented the material on parallel and distributed systems in separate sections, but the reality
today is that all systems of note have parallel processing features, and most have at least
rudimentary distributed functionality as well. Hence we fold the discussion of single-site,
parallel, and distributed systems into a single chapter. We will say more about parallelism and
distribution soon, but we begin with two foundational issues from a single-site perspective: query
optimization, and join algorithms.

Relational query optimization is well known to be a difficult problem, and the theoretical results
in the space can be especially discouraging. First, it is computationally complex: Ibaraki and
Kameda showed early on that optimizing a query that joins n relations is NP-hard [IK84].
Second, it relies on cost estimation techniques that are difficult to get right; and as
Christodoulakis and Ioannidis showed, the effects of even small errors can in some circumstances
render most optimization schemes no better than random guesses [IC91]. Fortunately, query
optimization is an arena where negative theoretical results at the extremes do not spell disaster in
most practical cases. In the field, query optimization technology has proved critical to the
performance of database systems, and serves as the cornerstone of architectures to achieve
Codd’s vision of data independence. The limitation on the number of joins is a fact that users
have learned to live with, and estimation error is an active area of research and development that
has seen significant, continuing improvements over time. In short, query optimizers today do
their jobs quite well, based on the foundations developed a quarter century ago. However, the
difficulty of the problem has left room for a steady stream of improvements from the research
community.

Early innovations in query optimization separated the technologists from the marketeers in the
database industry. This culminated in now-famous tales of Oracle’s so-called “syntactic
optimizer” (which simply ordered joins based on their lexical appearance in the query) and the
embarrassment it brought upon them in the days of the Wisconsin Benchmark [BDT83]1. In
practice, a respectable cost-based query optimizer is typically far better than any simple heuristic
scheme.

We begin this chapter with the famous Selinger, et al. paper on System R’s query optimization
scheme, which remains the fundamental reading in this area. The paper does two things
remarkably well. First, it breaks down the complex space of query optimization into manageable,
independently-addressable problems; this breakdown is not explicit in the paper, but represents its
largest contribution. Second, it provides a plausible line of attack for each of the problems. Of

1
This embarrassment was due only in part to Oracle’s poor optimizer and resulting poor performance.
Like any good scandal, it was eclipsed by attempts at a cover-up. As the story goes, in the wake of the
initial Wisconsin Benchmark results, Oracle’s CEO tried to convince the University of Wisconsin to fire
benchmark author David DeWitt. This corporate meddling apparently had little influence on UW
administrators. Subsequently, Oracle introduced a licensing clause that forbids customers from using the
system for purposes of benchmarking. Imagine Ford trying to forbid Consumer Reports from evaluating
cars! Sadly, Oracle’s competitors all adopted this “DeWitt Clause” as well, and it persists to this day.
Although some legal experts question the ability of the DeWitt Clause to stand up in court, it has not been
significantly tested to date.
Introduction 97

course many of the techniques for attacking these problems have evolved over time – but this is
rightly seen as a tribute to the problem breakdown proposed in the paper.

At the highest level, the Selinger paper first simplifies the query optimization problem to focus on
individual SQL query “blocks” (SELECT-FROM-WHERE, or select-project-join in algebraic
terms.) For each query block, it neatly separates three concerns: the plan space of legal execution
strategies, cost estimation techniques for predicting the resource consumption of each plan, and a
search strategy based on dynamic programming for efficiently considering different options in
the search space. These three concerns have each been the subject of significant follow-on work,
as mentioned in the second paper of Section 1. Each new benchmark drives the vendors to plug
another hole in their optimizer, and commercial optimizer improvement is a slow but continuous
process. Perhaps the least robust aspect of the original System R design was its set of formulae
for selectivity estimation. But there are numerous other issues glossed over in the paper. An
excellent exercise for the research-minded reader is to compile a list of the assumptions built into
this paper. Lurking behind each assumption is a research topic: find a plausible scenario where
the assumption does not hold, see how that stresses the System R design, and propose a fix. Some
of these scenarios will be amenable to evolutionary fixes, others may require more revolutionary
changes.

The paper closes with some execution tricks for nested queries. This topic received relatively
little attention for quite some time, until it was revisited in the context of query rewriting –
especially in Starburst [PHH92,SPL96], the predecessor to recent versions of DB2.

The second paper in this chapter presents Shapiro’s description of hash join and sort-merge join
algorithms. This paper reviews the earlier GRACE [KTM83] and Hybrid Hash [DKO+84]
conference papers, but we include this later paper by Shapiro since it does a nice job placing these
algorithms side-by-side with sort-merge join. Shapiro also provides a discussion of possible
interactions between hash joins and virtual memory replacement strategies, but this material is of
less interest – in practice, these algorithms explicitly manage whatever memory they are granted,
without any participation from virtual memory.

Some notes on this paper are in order. First, it presents Hybrid Hash as the most advanced hash
join variant, but in practice the advantage of Hybrid over Grace is negligible (especially in the
presence of a good optimizer), and hence most commercial systems simply implement Grace hash
join. Second, Shapiro’s paper does not cover schemes to handle hash skew, where some
partitions are much larger than others; this tricky issue is addressed in a sequence of papers
[NKT88, KNT89, PCL93]. Third, it does not discuss how hash-based schemes akin to Grace’s
can be used for unary operators like grouping, duplicate-elimination, or caching; these topics are
addressed in more detail in other work [Bra84, HN96].

Graefe’s query processing survey [Gra93] covers various subtleties inherent in the hash- and sort-
based operators used in query execution; the reader is especially directed to the discussion of the
“duality” between sorting and hashing. The description of Hash Teams in Microsoft’s SQL
Server [GBC93] covers additional details of both memory management and query optimization
that show a number of more holistic details that apply when many hash-based operators are used
in the same query plan.

Following the Shapiro paper, the chapter continues with a survey of parallel database technology
by DeWitt and Gray, which focuses largely on query processing issues. Historically, parallel
database systems arose from earlier research on database machines: hardware/software co-
designs for database systems [BD83]. These systems investigated designing special devices to
98 Chapter 2: Query Processing

accelerate simple database operations like selection, including disks with a processor on each
head, or on each track. This research thrust was eventually abandoned when it became clear that
specialized database hardware would never keep pace with the exponentially-improving rate of
commodity hardware.2

The research thread on database machines evolved into research on exploiting multiple
commodity processors for query execution. The DeWitt/Gray survey does an excellent job laying
out this design space, including the relevant performance metrics and the core architectural ideas.
DeWitt and Gray distill out most of the key points from the revolutionary parallel database
systems including Gamma [DGG+86], Bubba [BAC+90], XPRS [HS91] and the commercial
TeraData system. A deep study of the area should certainly consult the original papers on these
systems, but many of the major lessons are covered in the survey.

To flesh out some of the detail in parallel databases, we include two additional papers: Graefe’s
deceptively simple architectural paper on Exchange provides a flavor of high-level software
engineering elegance, whereas the AlphaSort paper gives an example of the significant benefits
available by micro-optimizing individual query operators.

Our fourth paper, on Exchange, shows how parallelism can be elegantly retrofitted into a
traditional iterator architecture for query execution. The insight is easy to understand, and hence
perhaps easy to undervalue, but it can have an important simplifying effect on software
implementation. This style of work is typically the domain of Operating Systems research; the
database systems literature is perhaps more biased towards feats of complexity than elegance of
mechanism. Exchange is a nice example of elegant mechanism design in the database research
literature. It weaves hash-partitioning, queuing, networking, and process boundaries into a single
iterator that encapsulates them all invisibly. This simplicity is especially attractive when one
considers that most parallel DBMS implementations evolved from single-site implementations.

While Exchange is elegant, the reader is also warned that things are not as simple as they appear
in the paper. Of course a parallel DBMS needs quite a bit of surrounding technology beyond
Exchange: a parallelism-aware optimizer, appropriate support for transactions as discussed in the
paper in Section 1, parallel management utilities, and so on. The query executor is probably not
the most challenging DBMS component to parallelize. Moreover, even Exchange itself requires
somewhat more subtlety in practice than is described here, as Graefe notes elsewhere [GD93],
particularly to handle starting up and shutting down the many processes for the Exchange.

The fifth paper in the chapter is on AlphaSort. Parallel sorting has become a competitive sport,
grounded in the database research community. Jim Gray maintains a website off of his home
page where he keeps the latest statistics on world records in parallel, disk-to-disk sorting (the
input begins on disk, and must be output in sorted runs on disk). Since the time of the AlphaSort
work, a number of other teams have come along and improved upon the work, both via new
software insights and via improvements in hardware over time. While the competition here is
stiff, the enthusiastic reader is not discouraged from entering the fray – a number of student

2
Perhaps the most commercially successful and technically impressive of these systems was from a
company called Britton-Lee, which was founded by a number of alumi from the INGRES research group.
As it became clear that Britton-Lee’s hardware would not be competitive in the marketplace, one of the
company’s founders, Robert Shapiro, left to start a software-only database company called Sybase that was
eventually quite successful. Ironically, Sybase was rather late in joining the parallel processing game.
Introduction 99

groups have held sorting trophies at various times, and contributed to our understanding of the
topic.

AlphaSort also represents a thread of research into the interactions between database systems and
computer architecture; this topic has seen increasing interest in recent years (e.g.
[ADH02,RR00,CGM01, etc.]) Sorting is an excellent benchmark of both hardware and software
data throughput – the raw ability to pump data off disk, through various network and memory
busses, through some non-trivial code, and eventually back onto disk. As noted in the AlphaSort
paper, one of the major bottlenecks to consider in such scenarios is that of processor cache
misses. This problem has become even more important since the time of the AlphaSort work.
Readers who find the AlphaSort paper intriguing are encouraged to consult a good computer
architecture textbook, like that of Patterson and Hennessy, which spells out many of the issues in
the paper in more detail.

We conclude the section with two papers on wide-area, distributed query processing. Distributed
database systems arose quite separately from the work on parallelism; the leading early
distributed DBMS projects were SDD-1 (at the Computer Corporation of America), INGRES* (at
Berkeley), and R* (at IBM San Jose). A main goal of the early work on distributed query
processing was to minimize network bandwidth consumption during query processing; this was
not a main goal of the parallel systems.

Our sixth paper, by Mackert and Lohman, enumerates the space of standard join algorithms for
distributed query processing. It also makes a point that was overlooked in SDD-1: bandwidth
consumption may be an important cost in distributed query processing, but it is not the only cost
that should be considered. Instead, a traditional query optimizer should be extended to weigh all
of the relevant costs, including I/Os, CPU operations, and network communication (including
per-message latencies as well as bandwidth). The Mackert/Lohman paper is ostensibly a micro-
benchmark of the R* system, but it should be read largely as an introduction to the various join
strategies – particularly the ideas of semi-joins and Bloom joins. It is worth noting that semi-join
style techniques recur in the literature even in single-site scenarios, as part of decorrelating
subqueries [MFPR90, SPL96]. The lesson there is somewhat subtle and beyond the scope of our
discussion; connoisseurs of query processing may choose to investigate those papers further.

We close this chapter with an overview of Mariposa, the most recent distributed database
research to be developed to a significant level of functionality. Mariposa’s main contribution in
query processing is to change the model for cost estimation during query optimization. Instead of
a unified, catalog-driven cost estimation module, Mariposa allows each site to declare its own
costs for various tasks. Mariposa’s approach introduces a computational marketplace, where sites
can declare their local costs for a query based not only on their estimates of resource
consumption, but also on runtime issues such as current system load, and economic issues such as
their reciprocal relationship with the query site, their relationship with competing sites that could
do the work, and so on. Architecturally, Mariposa’s changes to the R* optimizer are fairly
minimal – they simply add communication rounds to the cost estimation routines. More
suggestive is the way that this decoupling of cost estimation enables multiple independent parties
(e.g. different companies) to participate in federated query processing, wherein each party gets to
make autonomous decisions about their participation in any task. The Mariposa system was
commercialized as Cohera (later bought by PeopleSoft) and was demonstrated to work across
administrative domains in the field. But the flexibility and efficiency of its computational
economy ideas have yet to be significantly tested, and it is unclear whether corporate IT is ready
for significant investments federated query processing. It is possible that we will see the ideas
from Mariposa re-emerge in the peer-to-peer space, where there is significant grassroots interest,
100 Chapter 2: Query Processing

a few database-style query systems being proposed [HHL+03, NOTZ03, PMT03], and a number
of researchers interested in economic incentives for peer-to-peer (e.g. [Chu03]).

References

[ADH02] A. Ailamaki, D.J. DeWitt, and M.D. Hill. “Data Page Layouts for Relational Databases
on Deep Memory Hierarchies.” The VLDB Journal 11(3), 2002.

[BAC+90] H.Boral,W. Alexander, L. Clay, et al. “Prototyping Bubba, a Highly Parallel Database
System. Transactions on Knowledge and Data Engineering 2(1), March 1990.

[BD83] Haran Boral and David J. DeWitt. “Database Machines: An Idea Whose Time Passed? A
Critique of the Future of Database Machines”. In Proc. International Workshop on Database
Machines (IWDM), pp 166-187, 1983

[BDT83] Dina Bitton and David J. DeWitt and Carolyn Turbyfill. “Benchmarking Database
Systems, a Systematic Approach”. In Proc. 9th International Conference on Very Large Data
Bases (VLDB), Florence, Italy, October, 1983.

[Bra84] Kjell Bratbergsengen. “Hashing Methods and Relational Algebra Operations”. In Proc.
10th International Conference on Management of Data (VLDB), Singapore, August 1984, pp.
323-333.

[CGM01] Shimin Chen, Phillip B. Gibbons, and Todd C. Mowry, “Improving Index
Performance through Prefetching”. In Proc. ACM SIGMOD International Conference on
Management of Data, 2001.

[Chu03] John Chuang, editor. First Workshop on Economics of Peer-to-Peer Systems.


Berkeley, California, June 5-6 2003.

[DGG+86] David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Krishna B.
Kumar, and M. Muralikrishna. GAMMA - A High Performance Dataflow Database Machine. In
Twelfth International Conference on Very Large Data Bases (VLDB), pages 228-237, Kyoto,
Japan, August 1986.

[DKO+] David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R.
Stonebraker and David Wood. “Implementation Techniques for Main Memory Database
Systems”. In Proc. ACM-SIGMOD International Conference on Management of Data, Boston,
MA, June, 1984, pages 1-8.

[GBC93] G. Graefe, R. Bunker, and S. Cooper. “Hash joins and hash teams in Microsoft SQL
Server. In Proceedings of 24th International Conference on Very Large Data Bases (VLDB),
August 24-27, 1993.

[GD93] G. Graefe, D.L. Davison. “Encapsulation of Parallelism and Architecture-Independence


in Extensible Database Query Execution” IEEE Transactions on Software Engineering 19(8)
749-764, August 1993.

[Gra93] G. Graefe. Query Evaluation Techniques for Large Databases. Computing Surveys 25
(2): 73-170 (1993).
Introduction 101

[HHL+03] Ryan Huebsch, Joseph M. Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker
and Ion Stoica. “Querying the Internet with PIER.” In Proceedings of 19th International
Conference on Very Large Databases (VLDB), Berlin, 2003.

[HN96] Joseph M. Hellerstein and Jeffrey F. Naughton. Query Execution Techniques for
Caching Expensive Methods. In Proc. ACM-SIGMOD International Conference on Management
of Data, June 1996, Montreal, pp. 423-424.

[HS91] Wei Hong and Michael Stonebraker. Optimization of Parallel Query Execution Plans in
XPRS. In Proceedings of the First International Conference on Parallel and Distributed
Information Systems (PDIS), pages 218-225, Miami Beach, FL, December 1991.

[IC91] Yannis E. Ioannidis and Stavros Christodoulakis. On the Propagation of Errors in the Size
of Join Results. In Proceedings of the ACM SIGMOD International Conference on Management
of Data, pages 268-277, Denver, CO, May 1991.

[IK84] Toshihide Ibaraki and Tiko Kameda. “Optimal Nesting for Computing N-relational
Joins.” ACM Transactions on Database Systems (TODS), 9(3) 482-502, October, 1984.

[KNT89] Masaru Kitsuregawa, Masaya Nakayama and Mikio Takagi. “The Effect of Bucket
Size Tuning in the Dynamic Hybrid GRACE Hash Join Method.” Proceedings of the Fifteenth
International Conference on Very Large Data Bases, August 22-25, 1989,
pp. 257-266.

[KTM83] Masaru Kitsuregawa, Hidehiko Tanaka and Tohru Moto-Oka. “Application of Hash to
Data Base Machine and Its Architecture”. New Generation Comput. 1(1): 63-74, 1983.

[MFPR90] Inderpal Singh Mumick, Sheldon J. Finkelstein, Hamid Pirahesh and Raghu
Ramakrishnan. “Magic is Relevant”. In Proceedings of the ACM SIGMOD International
Conference on Management of Data, pages 247-258, Atlantic City, NJ, May 1990.

[NKT88] Masaya Nakayama, Masaru Kitsuregawa and Mikio Takagi. “Hash-Partitioned Join
Method Using Dynamic Destaging Strategy”. In Proc. 14th International Conference on
Management of Data (VLDB). Los Angeles, CA, August-September 1988."

[NOTZ03] Wee Siong Ng, Beng Chin Ooi, Kian Lee Tan and AoYing Zhou. “PeerDB: A P2P-
based System for Distributed Data Sharing.” In Proc. 19th International Conference on Data
Engineering (ICDE), 2003.

[PCL93] H. Pang, M. Carey, and M. Livny. “Partially preemptible hash joins”. In Proc. ACM
SIGMOD International Conference on Management of Data, pp. 59-68, 1993.

[PHH92] Hamid Pirahesh, Joseph M. Hellerstein, and Waqar Hasan. Extensible/Rule- Based
Query Rewrite Optimization in Starburst. In Proc. ACM-SIGMOD International Conference on
Management of Data, pages 39-48, San Diego, June 1992.

[PMT03] Vassilis Papadimos, David Maier and Kristin Tufte. “Distributed Query Processing and
Catalogs for Peer-to-Peer Systems.” In Proc. First Biennial Conference on Innovative Data
Systems Research (CIDR), Asilomar, CA, January 5-8, 2003.
102 Chapter 2: Query Processing

[RR00] Jun Rao and Kenneth Ross. “Making B+-trees Cache Conscious in Main Memory.” In
Proc. of ACM SIGMOD International Conference on Management of Data, 2000, pp. 475-486.

[SPL96] Praveen Seshadri, Hamid Pirahesh, and T.Y. Cliff Leung. Complex Query
Decorrelation. In Proc. 12th IEEE International Conference on Data Engineering (ICDE), New
Orleans, February 1996.
104 Chapter 2: Query Processing
Access Path Selection in a Relational Database Management System 105
106 Chapter 2: Query Processing
Access Path Selection in a Relational Database Management System 107
108 Chapter 2: Query Processing
Access Path Selection in a Relational Database Management System 109
110 Chapter 2: Query Processing
Access Path Selection in a Relational Database Management System 111
112 Chapter 2: Query Processing
Access Path Selection in a Relational Database Management System 113
114 Chapter 2: Query Processing
116 Chapter 2: Query Processing
Join Processing in Database Systems with Large Main Memories 117
118 Chapter 2: Query Processing
Join Processing in Database Systems with Large Main Memories 119
120 Chapter 2: Query Processing
Join Processing in Database Systems with Large Main Memories 121
122 Chapter 2: Query Processing
Join Processing in Database Systems with Large Main Memories 123
124 Chapter 2: Query Processing
Join Processing in Database Systems with Large Main Memories 125
126 Chapter 2: Query Processing
Join Processing in Database Systems with Large Main Memories 127
128 Chapter 2: Query Processing
Join Processing in Database Systems with Large Main Memories 129
130 Chapter 2: Query Processing
Join Processing in Database Systems with Large Main Memories 131
132 Chapter 2: Query Processing
Join Processing in Database Systems with Large Main Memories 133
134 Chapter 2: Query Processing
Join Processing in Database Systems with Large Main Memories 135
136 Chapter 2: Query Processing
Join Processing in Database Systems with Large Main Memories 137
138 Chapter 2: Query Processing
Join Processing in Database Systems with Large Main Memories 139
140 Chapter 2: Query Processing
142 Chapter 2: Query Processing
Parallel Database Systems: The Future of High Performance Database Systems 143
144 Chapter 2: Query Processing
Parallel Database Systems: The Future of High Performance Database Systems 145
146 Chapter 2: Query Processing
Parallel Database Systems: The Future of High Performance Database Systems 147
148 Chapter 2: Query Processing
Parallel Database Systems: The Future of High Performance Database Systems 149
150 Chapter 2: Query Processing
Parallel Database Systems: The Future of High Performance Database Systems 151
152 Chapter 2: Query Processing
Parallel Database Systems: The Future of High Performance Database Systems 153
154 Chapter 2: Query Processing
156 Chapter 2: Query Processing
Encapsulation of Parallelism in the Volcano Query Processing System 157
158 Chapter 2: Query Processing
Encapsulation of Parallelism in the Volcano Query Processing System 159
160 Chapter 2: Query Processing
Encapsulation of Parallelism in the Volcano Query Processing System 161
162 Chapter 2: Query Processing
Encapsulation of Parallelism in the Volcano Query Processing System 163
164 Chapter 2: Query Processing
166 Chapter 2: Query Processing
AlphaSort: A RISC Machine Sort 167
168 Chapter 2: Query Processing
AlphaSort: A RISC Machine Sort 169
170 Chapter 2: Query Processing
AlphaSort: A RISC Machine Sort 171
172 Chapter 2: Query Processing
AlphaSort: A RISC Machine Sort 173
174 Chapter 2: Query Processing
176 Chapter 2: Query Processing
R* Optimizer Validation and Performance Evaluation for Distributed Queries 177
178 Chapter 2: Query Processing
R* Optimizer Validation and Performance Evaluation for Distributed Queries 179
180 Chapter 2: Query Processing
R* Optimizer Validation and Performance Evaluation for Distributed Queries 181
182 Chapter 2: Query Processing
R* Optimizer Validation and Performance Evaluation for Distributed Queries 183
184 Chapter 2: Query Processing
R* Optimizer Validation and Performance Evaluation for Distributed Queries 185
The VLDB Journal (1996) 5: 48 63
The VLDB Journal

c Springer-Verlag 1996

Mariposa: a wide-area distributed database system


Michael Stonebraker, Paul M. Aoki, Witold Litwin1 , Avi Pfeffer2 , Adam Sah, Jeff Sidell, Carl Staelin3 , Andrew Yu4
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720-1776, USA

Edited by Henry F. Korth and Amit Sheth. Received November 1994 / Revised June 1995 / Accepted September 14, 1995

Abstract. The requirements of wide-area distributed data- 1 Introduction


base systems differ dramatically from those of local-area
network systems. In a wide-area network (WAN) configura- The Mariposa distributed database system addresses a fun-
tion, individual sites usually report to different system ad- damental problem in the standard approach to distributed
ministrators, have different access and charging algorithms, data management. We argue that the underlying assumptions
install site-specific data type extensions, and have differ- traditionally made while implementing distributed data man-
ent constraints on servicing remote requests. Typical of the agers do not apply to today’s wide-area network (WAN) en-
last point are production transaction environments, which vironments. We present a set of guiding principles that must
are fully engaged during normal business hours, and cannot apply to a system designed for modern WAN environments.
take on additional load. Finally, there may be many sites We then demonstrate that existing architectures cannot ad-
participating in a WAN distributed DBMS. here to these principles because of the invalid assumptions
In this world, a single program performing global query just mentioned. Finally, we show how Mariposa can success-
optimization using a cost-based optimizer will not work fully apply the principles through its adoption of an entirely
well. Cost-based optimization does not respond well to site- different paradigm for query and storage optimization.
specific type extension, access constraints, charging algo- Traditional distributed relational database systems that
rithms, and time-of-day constraints. Furthermore, traditional offer location-transparent query languages, such as Dis-
cost-based distributed optimizers do not scale well to a large tributed INGRES (Stonebraker 1986), R* (Williams et al.
number of possible processing sites. Since traditional dis- 1981), SIRIUS (Litwin 1982) and SDD-1 (Bernstein 1981),
tributed DBMSs have all used cost-based optimizers, they all make a collection of underlying assumptions. These as-
are not appropriate in a WAN environment, and a new ar- sumptions include:
chitecture is required.
We have proposed and implemented an economic para- – Static data allocation: In a traditional distributed DBMS,
digm as the solution to these issues in a new distributed there is no mechanism whereby objects can quickly and eas-
DBMS called Mariposa. In this paper, we present the archi- ily change sites to reflect changing access patterns. Moving
tecture and implementation of Mariposa and discuss early an object from one site to another is done manually by a da-
feedback on its operating characteristics. tabase administrator, and all secondary access paths to the
data are lost in the process. Hence, object movement is a
very “heavyweight” operation and should not be done fre-
Key words: Databases – Distributed systems – Economic quently.
site – Autonomy – Wide-area network – Name service – Single administrative structure: Traditional distributed da-
tabase systems have assumed a query optimizer which de-
composes a query into “pieces” and then decides where to
execute each of these pieces. As a result, site selection for
query fragments is done by the optimizer. Hence, there is
1 Present address: Université Paris IX Dauphine, Section MIAGE, Place no mechanism in traditional systems for a site to refuse to
de Lattre de Tassigny, 75775 Paris Cedex 16, France execute a query, for example because it is overloaded or oth-
2 Present address: Department of Computer Science, Stanford University, erwise indisposed. Such “good neighbor” assumptions are
Stanford, CA 94305, USA only valid if all machines in the distributed system are con-
3 Present address: Hewlett-Packard Laboratories, M/S 1U-13 P.O. Box
trolled by the same administration.
10490, Palo Alto, CA 94303, USA
4 Present address: Illustra Information Technologies, Inc., 1111 Broadway, – Uniformity: Traditional distributed query optimizers gener-
Suite 2000, Oakland, CA 94607, USA ally assume that all processors and network connections are
e-mail: mariposa@postgres.Berkeley.edu the same speed. Moreover, the optimizer assumes that any
Correspondence to: M. Stonebraker join can be done at any site, e.g., all sites have ample disk
Mariposa: A Wide-Area Distributed Database System 187

space to store intermediate results. They further assume that the requirements defined above for fundamental architectural
every site has the same collection of data types, functions reasons. For example, any distributed DBMS must address
and operators, so that any subquery can be performed at any distributed query optimization and placement of DBMS ob-
site. jects. However, if sites can refuse to process subqueries, then
it is difficult to perform cost-based global optimization. In
These assumptions are often plausible in local-area net- addition, cost-based global optimization is “brittle” in that it
work (LAN) environments. In LAN worlds, environment does not scale well to a large number of participating sites.
uniformity and a single administrative structure are com- As another example, consider the requirement that objects
mon. Moreover, a high-speed, reasonably uniform intercon- must be able to move freely between sites. Movement is
nect tends to mask performance problems caused by subop- complicated by the fact that the sending site and receiving
timal data allocation. site have total local autonomy. Hence the sender can refuse
In a WAN environment, these assumptions are much less to relinquish the object, and the recipient can refuse to ac-
plausible. For example, the Sequoia 2000 project (Stone- cept it. As a result, allocation of objects to sites cannot be
braker 1991) spans six sites around the state of California done by a central database administrator.
with a wide variety of hardware and storage capacities. Each Because of these inherent problems, the Mariposa de-
site has its own database administrator, and the willingness sign rejects the conventional distributed DBMS architecture
of any site to perform work on behalf of users at another in favor of one that supports a microeconomic paradigm for
site varies widely. Furthermore, network connectivity is not query and storage optimization. All distributed DBMS is-
uniform. Lastly, type extension often is available only on se- sues (multiple copies of objects, naming service, etc.) are
lected machines, because of licensing restrictions on propri- reformulated in microeconomic terms. Briefly, implementa-
etary software or because the type extension uses the unique tion of an economic paradigm requires a number of entities
features of a particular hardware architecture. As a result, and mechanisms. All Mariposa clients and servers have an
traditional distributed DBMSs do not work well in the non- account with a network bank. A user allocates a budget in
uniform, multi-administrator WAN environments of which the currency of this bank to each query. The goal of the
Sequoia 2000 is typical. We expect an explosion of configu- query processing system is to solve the query within the
rations like Sequoia 2000 as multiple companies coordinate allotted budget by contracting with various Mariposa pro-
tasks, such as distributed manufacturing, or share data in so- cessing sites to perform portions of the query. Each query
phisticated ways, for example through a yet-to-be-built query is administered by a broker, which obtains bids for pieces
optimizer for the World Wide Web. of a query from various sites. The remainder of this section
As a result, the goal of the Mariposa project is to design shows how use of these economic entities and mechanisms
a WAN distributed DBMS. Specifically, we are guided by allows Mariposa to meet the requirements set out above.
the following principles, which we assert are requirements The implementation of the economic infrastructure sup-
for non-uniform, multi-administrator WAN environments: ports a large number of sites. For example, instead of using
– Scalability to a large number of cooperating sites: In a centralized metadata to determine where to run a query, the
WAN environment, there may be a large number of sites broker makes use of a distributed advertising service to find
which wish to share data. A distributed DBMS should not sites that might want to bid on portions of the query. More-
contain assumptions that will limit its ability to scale to 1000 over, the broker is specifically designed to cope success-
sites or more. fully with very large Mariposa networks. Similarly, a server
– Data mobility: It should be easy and efficient to change the can join a Mariposa system at any time by buying objects
“home” of an object. Preferably, the object should remain from other sites, advertising its services and then bidding
available during movement. on queries. It can leave Mariposa by selling its objects and
– No global synchronization: Schema changes should not ceasing to bid. As a result, we can achieve a highly scalable
force a site to synchronize with all other sites. Otherwise, system using our economic paradigm.
some operations will have exceptionally poor response time. Each Mariposa site makes storage decisions to buy and
– Total local autonomy: Each site must have complete con- sell fragments, based on optimizing the revenue it expects to
trol over its own resources. This includes what objects to collect. Mariposa objects have no notion of a home, merely
store and what queries to run. Query allocation cannot be that of a current owner. The current owner may change
done by a central, authoritarian query optimizer. rapidly as objects are moved. Object movement preserves
– Easily configurable policies: It should be easy for a local all secondary indexes, and is coded to offer as high per-
database administrator to change the behavior of a Mariposa formance as possible. Consequently, Mariposa fosters data
site. mobility and the free trade of objects.
Avoidance of global synchronization is simplified in
Traditional distributed DBMSs do not meet these re- many places by an economic paradigm. Replication is one
quirements. Use of an authoritarian, centralized query opti- such area. The details of the Mariposa replication system are
mizer does not scale well; the high cost of moving an object contained in a separate paper (Sidell 1995). In short, copy
between sites restricts data mobility, schema changes typ- holders maintain the currency of their copies by contract-
ically require global synchronization, and centralized man- ing with other copy holders to deliver their updates. This
agement designs inhibit local autonomy and flexible policy contract specifies a payment stream for update information
configuration. delivered within a specified time bound. Each site then runs
One could claim that these are implementation issues, a “zippering” system to merge update streams in a consistent
but we argue that traditional distributed DBMSs cannot meet way. As a result, copy holders serve data which is out of
188 Chapter 2: Query Processing

date by varying degrees. Query processing on these divergent Client Application


copies is resolved using the bidding process. Metadata man-
agement is another, related area that benefits from economic
processes. Parsing an incoming query requires Mariposa to SQL Parser
interact with one or more name services to identify relevant
metadata about objects referenced in a query, including their Single-Site Optimizer
location. The copy mechanism described above is designed Middleware
Layer Query Fragmenter
so that name servers are just like other servers of replicated
data. The name servers contract with other Mariposa sites
Broker
to receive updates to the system catalogs. As a result of this
architecture, schema changes do not entail any synchroniza- Coordinator
tion; rather, such changes are “percolated” to name services
asynchronously.
Since each Mariposa site is free to bid on any business of Bidder
interest, it has total local autonomy. Each site is expected to Local
maximize its individual profit per unit of operating time and Execution Executor

to bid on those queries that it feels will accomplish this goal. Component
Of course, the net effect of this freedom is that some queries Storage Manager

may not be solvable, either because nobody will bid on them


or because the aggregate of the minimum bids exceeds what Fig. 1. Mariposa architecture
the client is willing to pay. In addition, a site can buy and
sell objects at will. It can refuse to give up objects, or it may
not find buyers for an object it does not want. query could potentially be decomposed into a collection of
Finally, Mariposa provides powerful mechanisms for table fragments. Fragments can obey range- or hash-based
specifying the behavior of each site. Sites must decide which distribution criteria which logically partition the table. Alter-
objects to buy and sell and which queries to bid on. Each nately, fragments can be unstructured, in which case records
site has a bidder and a storage manager that make these are allocated to any convenient fragment.
decisions. However, as conditions change over time, pol- Mariposa provides a variety of fragment operations.
icy decisions must also change. Although the bidder and Fragments are the units of storage that are bought and sold
storage manager modules may be coded in any language by sites. In addition, the total number of fragments in a ta-
desired, Mariposa provides a low level, very efficient em- ble can be changed dynamically, perhaps quite rapidly. The
bedded scripting language and rule system called Rush (Sah current owner of a fragment can split it into two storage
et al. 1994). Using Rush, it is straightforward to change pol- fragments whenever it is deemed desirable. Conversely, the
icy decisions; one simply modifies the rules by which these owner of two fragments of a table can coalesce them into a
modules are implemented. single fragment at any time.
The purpose of this paper is to report on the architec- To process queries on fragmented tables and support buy-
ture, implementation, and operation of our current prototype. ing, selling, splitting, and coalescing fragments, Mariposa is
Preliminary discussions of Mariposa ideas have been previ- divided into three kinds of modules as noted in Fig. 1. There
ously reported (Stonebraker et al. 1994a, 19994b). At this is a client program which issues queries, complete with bid-
time (June 1995), we have a complete optimization and ex- ding instructions, to the Mariposa system. In turn, Mariposa
ecution system running, and we will present performance contains a middleware layer and a local execution compo-
results of some initial experiments. nent. The middleware layer contains several query prepara-
In Sect. 2, we present the three major components of our tion modules, and a query broker. Lastly, local execution
economic system. Section 3 describes the bidding process by is composed of a bidder, a storage manager, and a local
which a broker contracts for service with processing sites, execution engine.
the mechanisms that make the bidding process efficient, and In addition, the broker, bidder and storage manager can
the methods by which network utilization is integrated into be tailored at each site. We have provided a high perfor-
the economic model. Section 4 describes Mariposa storage mance rule system, Rush, in which we have coded initial
management. Section 5 describes naming and name service Mariposa implementations of these modules. We expect site
in Mariposa. Section 6 presents some initial experiments administrators to tailor the behavior of our implementations
using the Mariposa prototype. Section 7 discusses previous by altering the rules present at a site. Lastly, there is a low-
applications of the economic model in computing. Finally, level utility layer that implements essential Mariposa primi-
Sect. 8 summarizes the work completed to date and the future tives for communication between sites. The various modules
directions of the project. are shown in Fig. 1. Notice that the client module can run
anywhere in a Mariposa network. It communicates with a
middleware process running at the same or a different site.
2 Architecture In turn, Mariposa middleware communicates with local ex-
ecution systems at various sites.
Mariposa supports transparent fragmentation of tables across This section describes the role that each module plays
sites. That is, Mariposa clients submit queries in a dialect in the Mariposa economy. In the process of describing the
of SQL3; each table referenced in the FROM clause of a modules, we also give an overview of how query processing
Mariposa: A Wide-Area Distributed Database System 189

Client Application

Query select * from EMP;

Jeff, 100K,...
Paul, 100K,...
Bid Curve $ Mike, 10K,...
Answer
Delay

Jeff, 100K,...
Paul, 100K,...
Answer
Coordinator Mike, 10K,...

SQL Parser
Executor
select
Parse Tree select
* EMP Execute
Query SS(EMP1)

Single-Site Optimizer
Local
Execution
($$$, DELAY) Bidder Component
select Bid
Plan Tree
SS(EMP)

Query Fragmenter

select
select
Fragmented MERGE
Plan SS(EMP1) Broker SS(EMP1)
SS(EMP2)
SS(EMP3)
Request For Bid

YOU WIN!!! Middleware


Bid Acceptance Layer
Fig. 2. Mariposa communication

works in an economic framework. Section 3 will explain this each fragment of the table, and an indicator of the staleness
process in more detail. of the information. Metadata is itself part of the economy and
Queries are submitted by the client application. Each has a price. The choice of name server is determined by the
query starts with a budget B(t) expressed as a bid curve. desired quality of metadata, the prices offered by the name
The budget indicates how much the user is willing to pay to servers, the available budget, and any local Rush rules de-
have the query executed within time t. Query budgets form fined to prioritize these factors. The parser hands the query,
the basis of the Mariposa economy. Figure 2 includes a bid in the form of a parse tree, to the single-site optimizer. This
curve indicating that the user is willing to sacrifice perfor- is a conventional query optimizer along the lines of Selinger
mance for a lower price. Once a budget has been assigned et al. (1979). The single-site optimizer generates a single-site
(through administrative means not discussed here), the client query execution plan. The optimizer ignores data distribu-
software hands the query to Mariposa middleware. Mariposa tion and prepares a plan as if all the fragments were located
middleware contains an SQL parser, single-site optimizer, at a single server site.
query fragmenter, broker, and coordinator module. The bro- The fragmenter accepts the plan produced by the single-
ker is primarily coded in Rush. Each of these modules is site optimizer. It uses location information previously ob-
described below. The communication between modules is tained from the name server, to decompose the single site
shown in Fig. 2. plan into a fragmented query plan. The fragmenter decom-
The parser parses the incoming query, performing name poses each restriction node in the single site plan into sub-
resolution and authorization. The parser first requests meta- queries, one per fragment in the referenced table. Joins are
data for each table referenced in the query from some name decomposed into one join subquery for each pair of frag-
server. This metadata contains information including the ment joins. Lastly, the fragmenter groups the operations that
name and type of each attribute in the table, the location of can proceed in parallel into query strides. All subqueries in
190 Chapter 2: Query Processing
Table 1. The main Mariposa primitives
a stride must be completed before any subqueries in the next
stride can begin. As a result, strides form the basis for intra- Actions Events
query synchronization. Notice that our notion of strides does (messages) (received messages)
not support pipelining the result of one subquery into the ex- Request bid Receive bid request
Bid Receive bid
ecution of a subsequent subquery. This complication would Award contract Contract won
introduce sequentiality within a query stride and complicate Notify loser Contract lost
the bidding process to be described. Inclusion of pipelining Send query Receive query
into our economic system is a task for future research. Send data Receive data
The broker takes the collection of fragmented query
plans prepared by the fragmenter and sends out requests for
bids to various sites. After assembling a collection of bids, subqueries and to buy and sell fragments can be sent between
the broker decides which ones to accept and notifies the sites. Additionally, queries and data must be passed around.
winning sites by sending out a bid acceptance. The bidding The main messages are indicated in Table 1. Typically, the
process will be described in more detail in Sect. 3. outgoing message is the action part of a Rush rule, and
The broker hands off the task of coordinating the exe- the corresponding incoming message is a Rush event at the
cution of the resulting query strides to a coordinator. The recipient site.
coordinator assembles the partial results and returns the final
answer to the user process.
At each Mariposa server site there is a local execution
module containing a bidder, a storage manager, and a lo- 3 The bidding process
cal execution engine. The bidder responds to requests for
bids and formulates its bid price and the speed with which Each query Q has a budget B(t) that can be used to solve
the site will agree to process a subquery based on local re- the query. The budget is a non-increasing function of time
sources such as CPU time, disk I/O bandwidth, storage, etc. that represents the value the user gives to the answer to his
If the bidder site does not have the data fragments speci- query at a particular time t. Constant functions represent a
fied in the subquery, it may refuse to bid or it may attempt willingness to pay the same amount of money for a slow
to buy the data from another site by contacting its storage answer as for a quick one, while steeply declining functions
manager. Winning bids must sooner or later be processed. indicate that the user will pay more for a fast answer.
To execute local queries, a Mariposa site contains a number The broker handling a query Q receives a query plan
of local execution engines. An idle one is allocated to each containing a collection of subqueries, Q1 , . . . , Qn , and B(t).
incoming subquery to perform the task at hand. The number Each subquery is a one-variable restriction on a fragment F
of executors controls the multiprocessing level at each site, of a table, or a join between two fragments of two tables.
and may be adjusted as conditions warrant. The local execu- The broker tries to solve each subquery, Qi , using either an
tor sends the results of the subquery to the site executing the expensive bid protocol or a cheaper purchase order protocol.
next part of the query or back to the coordinator process. At The expensive bid protocol involves two phases: in the
each Mariposa site there is also a storage manager, which first phase, the broker sends out requests for bids to bidder
watches the revenue stream generated by stored fragments. sites. A bid request includes the portion of the query execu-
Based on space and revenue considerations, it engages in tion plan being bid on. The bidders send back bids that are
buying and selling fragments with storage managers at other represented as triples: (Ci , Di , Ei ). The triple indicates that
Mariposa sites. the bidder will solve the subquery Qi for a cost Ci within a
The storage managers, bidders and brokers in our proto- delay Dsubi after receipt of the subquery, and that this bid
type are primarily coded in the rule language Rush. Rush is is only valid until the expiration date, Ei .
an embeddable programming language with syntax similar In the second phase of the bid protocol, the broker no-
to Tcl (Ousterhout 1994) that also includes rules of the form: tifies the winning bidders that they have been selected. The
broker may also notify the losing sites. If it does not, then
on <condition> do <action> Every Mariposa
the bids will expire and can be deleted by the bidders. This
entity embeds a Rush interpreter, calling it to execute code process requires many (expensive) messages. Most queries
to determine the behavior of Mariposa. will not be computationally demanding enough to justify
Rush conditions can involve any combination of prim- this level of overhead. These queries will use the simpler
itive Mariposa events, described below, and computations purchase order protocol.
on Rush variables. Actions in Rush can trigger Mariposa The purchase order protocol sends each subquery to the
primitives and modify Rush variables. As a result, Rush can processing site that would be most likely to win the bidding
be thought of as a fairly conventional forward-chaining rule process if there were one; for example, one of the storage
system. We chose to implement our own system, rather than sites of a fragment for a sequential scan. This site receives
use one of the packages available from the AI community, the query and processes it, returning the answer with a bill
primarily for performance reasons. Rush rules are in the “in- for services. If the site refuses the subquery, it can either
ner loop” of many Mariposa activities, and as a result, rule return it to the broker or pass it on to a third processing
interpretation must be very fast. A separate paper (Sah and site. If a broker uses the cheaper purchase order protocol,
Blow 1994) discusses how we have achieved this goal. there is some danger of failing to solve the query within the
Mariposa contains a specific inter-site protocol by which allotted budget. The broker does not always know the cost
Mariposa entities communicate. Requests for bids to execute and delay which will be charged by the chosen processing
Mariposa: A Wide-Area Distributed Database System 191

site. However, this is the risk that must be taken to use this then makes the greediest substitution until there are no more
faster protocol. profitable ones to make. Thus a series of solutions are pro-
posed with steadily increasing delay values for each pro-
cessing step. On any iteration of the algorithm, the proposed
3.1 Bid acceptance solution contains a collection of bids with a certain delay
for each processing step. For every collection of bids with
All subqueries in each stride are processed in parallel, and greater delay a cost gradient is computed. This cost gradient
the next stride cannot begin until the previous one has been is the cost decrease that would result for the processing step
completed. Rather than consider bids for individual sub- by replacing the collection in the solution by the collection
queries, we consider collections of bids for the subqueries being considered, divided by the time increase that would
in each stride. result from the substitution.
When using the bidding protocol, brokers must choose The algorithm begins by considering the bid collection
a winning bid for each subquery with aggregate cost C and with the smallest delay for each processing step and comput-
aggregate delay D such that the aggregate cost is less than or ing the total cost C and the total delay D. Compute the cost
equal to the cost requirement B(D). There are two problems gradient for each unused bid. Now, consider the processing
that make finding the best bid collection difficult: subquery step that contains the unused bid with the maximum cost gra-
parallelism and the combinatorial search space. The aggre- dient, B  . If this bid replaces the current one used in the pro-
gate delay is not the sum of the delays Di for each subquery cessing step, then cost will become C  and delay D . If the
Qi , since there is parallelism within each stride of the query resulting dif f erence is greater at D than at D, then make
plan. Also, the number of possible bid collections grows ex- the bid substitution. That is, if B(D )−C  > B(D)−C, then
ponentially with the number of strides in the query plan. replace B with B  . Recalculate all the cost gradients for the
For example, if there are ten strides and three viable bids processing step that includes B  , and continue making sub-
for each one, then the broker can evaluate each of the 310 stitutions until there are none that increase the dif f erence.
bid possibilities. Notice that our current Mariposa algorithm decomposes
The estimated delay to process the collection of sub- the query into executable pieces, and then the broker tries to
queries in a stride is equal to the highest bid time in the solve the individual pieces in a heuristically optimal way. We
collection. The number of different delay values can be no are planning to extend Mariposa to contain a second bidding
more than the total number of bids on subqueries in the col- strategy. Using this strategy, the single-site optimizer and
lection. For each delay value, the optimal bid collection is the fragmenter would be bypassed. Instead, the broker would
least expensive bid for each subquery that can be processed get the entire query directly. It would then decide whether
within the given delay. By coalescing the bid collections in to decompose it into a collection of two or more “hunks”
a stride and considering them as a single (aggregate) bid, using heuristics yet to be developed. Then, it would try to
the broker may reduce the bid acceptance problem to the find contractors for the hunks, each of which could freely
simpler problem of choosing one bid from among a set of subdivide the hunks and subcontract them. In contrast to
aggregated bids for each query stride. our current query processing system which is a “bottom up”
With the expensive bid protocol, the broker receives a algorithm, this alternative would be a “top down” decom-
collection of zero or more bids for each subquery. If there position strategy. We hope to implement this alternative and
is no bid for some subquery, or no collection of bids meets test it against our current system.
the client’s minimum price and performance requirements
(B(D)), then the broker must solicit additional bids, agree
to perform the subquery itself, or notify the user that the 3.2 Finding bidders
query cannot be run. It is possible that several collections
of bids meet the minimum requirements, so the broker must Using either the expensive bid or the purchase order pro-
choose the best collection of bids. In order to compare the tocol from the previous section, a broker must be able to
bid collections, we define a dif f erence function on the identify one or more sites to process each subquery. Mari-
collection of bids: dif f erence = B(D) − C. Note that this posa achieves this through an advertising system. Servers
can have a negative value, if the cost is above the bid curve. announce their willingness to perform various services by
For all but the simplest queries referencing tables with a posting advertisements. Name servers keep a record of these
minimal number of fragments, exhaustive search for the best advertisements in an Ad Table. Brokers examine the Ad Ta-
bid collection will be combinatorially prohibitive. The crux ble to find out which servers might be willing to perform the
of the problem is in determining the relative amounts of the tasks they need. Table 2 shows the fields of the Ad Table.
time and cost resources that should be allocated to each sub- In practice, not all these fields will be used in each adver-
query. We offer a heuristic algorithm that determines how tisement. The most general advertisements will specify the
to do this. Although it cannot be shown to be optimal, we fewest number of fields. Table 3 summarizes the valid fields
believe in practice it will demonstrate good results. Prelim- for some types of advertisement.
inary performance numbers for Mariposa are included later Using yellow pages, a server advertises that it offers a
in this paper which support this supposition. A more detailed specific service (e.g., processing queries that reference a spe-
evaluation and comparison against more complex algorithms cific fragment). The date of the advertisement helps a broker
is planned in the future. decide how timely the yellow pages entry is, and therefore
The algorithm is a “greedy” one. It produces a trial so- how much faith to put in the information. A server can is-
lution in which the total delay is the smallest possible, and sue a new yellow pages advertisement at any time without
192 Chapter 2: Query Processing
Table 2. Fields in the Ad Table
queries. Presumably the broker will include such sites in
the bidding process, thereby generating a system that learns
Ad Table field Description over time which processing sites are appropriate for various
query-template A description of the service being offered. The query tem- queries. Lastly, the broker also knows the likely location of
plate is a query with parameters left unspecified. For ex-
ample,
each fragment, which was returned previously to the query
SELECT param-1 preparation module by the name server. The site most likely
FROM EMP to have the data is automatically a likely bidder.
indicates a willingness to perform any SELECT query on
the EMP table, while
SELECT param-1 3.3 Setting the bid price for subqueries
FROM EMP
WHERE NAME = param-2
When a site is asked to bid on a subquery, it must respond
indicates that the server wants to perform queries that per-
form an equality restriction on the NAME column. with a triple (C, D, E) as noted earlier. This section dis-
server-id The server offering the service. cusses our current bidder module and some of the exten-
start-time The time at which the service is first offered. This may sions that we expect to make. As noted earlier, it is coded
be a future time, if the server expects to begin performing primarily as Rush rules and can be changed easily.
certain tasks at a specific point in time. The naive strategy is to maintain a billing rate for CPU
expiration-time The time at which the advertisement ceases to be valid.
and I/O resources for each site. These constants are to be
price The price charged by the server for the service.
delay The time in which the server expects to complete the task. set by a site administrator based on local conditions. The
limit-quantity The maximum number of times the server will perform a bidder constructs an estimate of the amount of each resource
service at the given cost and delay. required to process a subquery for objects that exist at the
bulk-quantity The number of orders needed to obtain the advertised price local site. A simple computation then yields the required bid.
and delay. If the referenced object is not present at the site, then the
to-whom The set of brokers to whom the advertised services are
site declines to bid. For join queries, the site declines to bid
available.
other-fields Comments and other information specific to a particular unless one of the following two conditions are satisfied:
advertisement.
– It possesses one of the two referenced objects.
– It had already bid on a query, whose answer formed one
of the two referenced objects.
The time in which the site promises to process the query
explicitly revoking a previous one. In addition, a server may is calculated with an estimate of the resources required. Un-
indicate the price and delay of a service. This is a posted der zero load, it is an estimate of the elapsed time to perform
price and becomes current on the start-date indicated. There the query. By adjusting for the current load on the site, the
is no guarantee that the price will hold beyond that time and, bidder can estimate the expected delay. Finally, it multiplies
as with yellow pages, the server may issue a new posted by a site-specific safety factor to arrive at a promised delay
price without revoking the old one. (the D in the bid). The expiration date on a bid is currently
Several more specific types of advertisements are avail- assigned arbitrarily as the promised delay plus a site-specific
able. If the expiration-date field is set, then the details of the constant.
offer are known to be valid for a certain period of time. Post- This naive strategy is consistent with the behavior as-
ing a sale price in this manner involves some risk, as the sumed of a local site by a traditional global query optimizer.
advertisement may generate more demand than the server However, our current prototype improves on the naive strat-
can meet, forcing it to pay heavy penalties. This risk can be egy in three ways. First, each site maintains a billing rate on
offset by issuing coupons, which, like supermarket coupons, a per-fragment basis. In this way, the site administrator can
place a limit on the number of queries that can be executed bias his bids toward fragments whose business he wants and
under the terms of the advertisement. Coupons may also away from those whose business he does not want. The bid-
limit the brokers who are eligible to redeem them. These der also automatically declines to bid on queries referencing
are similar to the coupons issued by the Nevada gambling fragments with billing rates below a site-specific threshold.
establishments, which require the client to be over 21 years In this case, the query will have to be processed elsewhere,
of age and possess a valid California driver’s license. and another site will have to buy or copy the indicated frag-
Finally, bulk purchase contracts are renewable coupons ment in order to solve the user query. Hence, this tactic will
that allow a broker to negotiate cheaper prices with a server hasten the sale of low value fragments to somebody else. Our
in exchange for guaranteed, pre-paid service. This is analo- second improvement concerns adjusting bids based on the
gous to a travel agent who books ten seats on each sailing current site load. Specifically, each site maintains its current
of a cruise ship. We allow the option of guaranteeing bulk load average by periodically running a UNIX utility. It then
purchases, in which case the broker must pay for the speci- adjusts its bid, based on its current load average as follows:
fied queries whether it uses them or not. Bulk purchases are
actual bid = computed bid × load average
especially advantageous in transaction processing environ-
ments, where the workload is predictable, and brokers solve In this way, if it is nearly idle (i.e., its load average is near
large numbers of similar queries. zero), it will bid very low prices. Conversely, it will bid
Besides referring to the Ad Table, we expect a broker higher and higher prices as its load increases. Notice that this
to remember sites that have bid successfully for previous simple formula will ensure a crude form of load balancing
Mariposa: A Wide-Area Distributed Database System 193
Table 3. Ad Table fields applicable to each type of advertisement
Ad Table field Type of advertisement
Yellow pages Posted price Sale price Coupon Bulk purchase
√ √ √ √ √
query-template
√ √ √ √ √
server-id
√ √ √ √ √
start-date
√ √ √
expiration-date – –
√ √ √ √
price –
√ √ √ √
delay –

limit-quantity – – – –

bulk-quantity – – – –
to-whom – – – * *
other-fields * * * * *

–, null; , valid; *, optional

among a collection of Mariposa sites. Our third improvement from the network bidder at the destination node of the form:
concerns bidding on subqueries when the site does not pos- (transaction-id, request-id, price, time). In order to determine
sess any of the data. As will be seen in the next section, the the price and time, the network bidder at the destination node
storage manager buys and sells fragments to try to maximize must contact each of the intermediate nodes between itself
site revenue. In addition, it keeps a hot list of fragments it and the source node.
would like to acquire but has not yet done so. The bidder For convenience, call the destination node n0 and the
automatically bids on any query which references a hot list source node nk (see Fig. 3.) Call the first intermediate node
fragment. In this way, if it gets a contract for the query, it on the path from the destination to the source n1 , the second
will instruct the storage manager to accelerate the purchase such node n2 , etc. Available bandwidth between two adja-
of the fragment, which is in line with the goals of the storage cent nodes as a function of time is represented as a band-
manager. width profile. The bandwidth profile contains entries of the
In the future we expect to increase the sophistication of form (available bandwidth, t1 , t2 ) indicating the available
the bidder substantially. We plan more sophisticated integra- bandwidth between time t1 and time t2 . If ni and ni−1 are
tion between the bidder and the storage manager. We view directly-connected nodes on the path from the source to the
hot lists as merely the first primitive step in this direction. destination, and data is flowing from ni to ni−1 , then node
Furthermore, we expect to adjust the billing rate for each ni is responsible for keeping track of (and charging for)
fragment automatically, based on the amount of business for available bandwidth between itself and ni−1 and therefore
the fragment. Finally, we hope to increase the sophistication maintains the bandwidth profile. Call the bandwidth profile
of our choice of expiration dates. Choosing an expiration between node ni and node ni−1 Bi and the price ni charges
date far in the future incurs the risk of honoring lower out- for a bandwidth reservation Pi .
of-date prices. Specifying an expiration date that is too close The available bandwidth on the entire path from source
means running the risk of the broker not being able to use to destination is calculated step by step starting at the des-
the bid because of inherent delays in the processing engine. tination node, n0 . Node n0 contacts n1 which has B1 , the
Lastly, we expect to consider network resources in the bid- bandwidth profile for the network link between itself and
ding process. Our proposed algorithms are discussed in the n0 . It sends this profile to node n2 , which has the band-
next subsection. width profile B2 . Node n2 calculates min(B1 , B2 ), producing
a bandwidth profile that represents the available bandwidth
along the path from n2 to n0 . This process continues along
3.4 The network bidder each intermediate link, ultimately reaching the source node.
When the bandwidth profile reaches the source node, it
In addition to producing bids based on CPU and disk us- is equal to the minimum available bandwidth over all links
age, the processing sites need to take the available network on the path between the source and destination, and repre-
bandwidth into account. The network bidder will be a sepa- sents the amount of bandwidth available as a function of
rate module in Mariposa. Since network bandwidth is a dis- time on the entire path. The source node, nk , then initiates
tributed resource, the network bidders along the path from a backward pass to calculate the price for this bandwidth
source to destination must calculate an aggregate bid for the along the entire path. Node nk sends its price to reserve the
entire path and must reserve network resources as a group. bandwidth, Pk , to node nk−1 , which adds its price, and so
Mariposa will use a version of the Tenet network proto- on, until the aggregate price arrives at the destination, n0 .
cols RTIP (Zhang and Fisher 1992) and RCAP (Banerjea Bandwidth could also be reserved at this time. If bandwidth
and Mah 1991) to perform bandwidth queries and network is reserved at bidding time, there is a chance that it will not
resource reservation. be used (if the source or destination is not chosen by the
A network bid request will be made by the broker to broker). If bandwidth is not reserved at this time, then there
transfer data between source/destination pairs in the query will be a window of time between bidding and bid award
plan. The network bid request is sent to the destination when the available bandwidth may have changed. We are
node. The request is of the form: (transaction-id, request- investigating approaches to this problem.
id, data size, from-node, to-node). The broker receives a bid
194 Chapter 2: Query Processing
n (Destination) n n n 3 (Source)
0 1 2 that, we turn to the splitting and coalescing of fragments into
smaller or bigger storage units.

B1 B2
100%
100%
4.1 Buying and selling fragments
BW
BW
0%
Time 0% In order for sites to trade fragments, they must have some
Time
means of calculating the (expected) value of the fragment for
MIN(B1,B2) each site. Some access history is kept with each fragment so
100%
sites can make predictions of future activity. Specifically, a
BW
site maintains the size of the fragment as well as its revenue
0%
Time
history. Each record of the history contains the query, num-
ber of records which qualified, time-since-last-query, rev-
MIN(B1,B2) B3
100% 100% enue, delay, I/O-used, and CPU-used. The CPU and I/O in-
BW BW formation is normalized and stored in site-independent units.
0% 0%
To estimate the revenue that a site would receive if it
100% Time Time owned a particular fragment, the site must assume that access
BW MIN(MIN(B1,B2), B3)
100%
rates are stable and that the revenue history is therefore a
0% good predictor of future revenue. Moreover, it must convert
t0 t1 t2 t3 BW
Time site-independent resource usage numbers into ones specific
0%
Bandwidth Profile
Time to its site through a weighting function, as in Mackert and
MIN(MIN(B1,B2), B3)
Lohman (1986). In addition, it must assume that it would
100% have successfully bid on the same set of queries as appeared
BW in the revenue history. Since it will be faster or slower than
0% the site from which the revenue history was collected, it must
Time
adjust the revenue collected for each query. This calculation
Fig. 3. Calculating a bandwidth profile requires the site to assume a shape for the average bid curve.
Lastly, it must convert the adjusted revenue stream into a
In addition to the choice of when to reserve network cash value, by computing the net present value of the stream.
resources, there are two choices for when the broker sends If a site wants to bid on a subquery, then it must either
out network bid requests during the bidding process. The buy any fragment(s) referenced by the subquery or subcon-
broker could send out requests for network bids at the same tract out the work to another site. If the site wishes to buy a
time that it sends out other bid requests, or it could wait until fragment, it can do so either when the query comes in (on
the single-site bids have been returned and then send out demand) or in advance (prefetch). To purchase a fragment,
requests for network bids to the winners of the first phase. a buyer locates the owner of the fragment and requests the
In the first case, the broker would have to request a bid from revenue history of the fragment, and then places a value
every pair of sites that could potentially communicate with on the fragment. Moreover, if it buys the fragment, then it
one another. If P is the number of parallelized phases of the will have to evict a collection of fragments to free up space,
query plan, and Si is the number of sites in phase i, then adding to the cost of the fragment to be purchased. To the
P extent that storage is not full, then fewer (or no) evictions
this approach would produce a total of i=2 Si Si−1 bids. In
the second case, the broker only has to request bids between will be required. In any case, this collection is called the
the winners of each phase of the query plan. If winneri is alternate fragments in the formula below. Hence, the buyer
the winning group of sites for phase i, then the number of will be willing to bid the following price for the fragment:
P
network bid requests sent out is i=2 Swinneri Swinneri−1 . offer price = value of fragment
The first approach has the advantage of parallelizing the −value of alternate fragments
bidding phase itself and thereby reducing the optimization +price received
time. However, the sites that are asked to reserve bandwidth
are not guaranteed to win the bid. If they reserve all the band- In this calculation, the buyer will obtain the value of the
width for each bid request they receive, this approach will new fragment but lose the value of the fragments that it
result in reserving more bandwidth than is actually needed. must evict. Moreover, it will sell the evicted fragments, and
This difficulty may be overcome by reserving less bandwidth receive some price for them. The latter item is problematic
than is specified in bids, essentially “overbooking the flight.” to compute. A plausible assumption is that price received
is equal to the value of the alternate fragments. A more
conservative assumption is that the price obtained is zero.
4 Storage management Note that in this case the offer price need not be positive.
The potential seller of the fragment performs the follow-
Each site manages a certain amount of storage, which it ing calculation: the site will receive the offered price and
can fill with fragments or copies of fragments. The basic will lose the value of the fragment which is being evicted.
objective of a site is to allocate its CPU, I/O and storage However, if the fragment is not evicted, then a collection of
resources so as to maximize its revenue income per unit time. alternate fragments summing in size to the indicated frag-
This topic is the subject of the first part of this section. After ment must be evicted. In this case, the site will lose the
Mariposa: A Wide-Area Distributed Database System 195

value of these (more desirable) fragments, but will receive keep the number of copies low, it has to break up the frag-
the expected price received. Hence, it will be willing to ment into smaller fragments, which have less revenue and
sell the fragment, transferring it to the buyer: are less attractive for copies. On the other hand, a small frag-
ment has high processing overhead for queries. Economies
offer price > value of fragment
of scale could be realized by coalescing it with another frag-
−value of alternate fragments ment in the same class into a single larger fragment.
+price received If more direct intervention is required, then Mariposa
might resort to the following tactic. Consider the execution
Again, price received is problematic, and subject to the same
of queries referencing only a single class. The broker can
plausible assumptions noted above.
fetch the number of fragments, N umC , in that class from a
Sites may sell fragments at any time, for any reason. For name server and, assuming that all fragments are the same
example, decommissioning a server implies that the server size, can compute the expected delay (ED) of a given query
will sell all of its fragments. To sell a fragment, the site on the class if run on all fragments in parallel. The budget
conducts a bidding process, essentially identical to the one function tells the broker the total amount that is available
used for subqueries above. Specifically, it sends the revenue for the entire query under that delay. The amount of the
history to a collection of potential bidders and asks them expected feasible bid per site in this situation is:
what they will offer for the fragment. The seller considers
the highest bid and will accept the bid under the same con- B(ED)
expected feasible site bid =
siderations that applied when selling fragments on request, N umC
namely if: The broker can repeat those calculations for a variable num-
offered price > value of fragment ber of fragments to arrive at N um∗ , the number of fragments
−value of alternate fragments to maximize the expected revenue per site.
This value, N um∗ , can be published by the broker, along
+price received with its request for bids. If a site has a fragment that is too
If no bid is acceptable, then the seller must try to evict large (or too small), then in steady state it will be able to
another (higher value) fragment until one is found that can obtain a larger revenue per query if it splits (coalesces) the
be sold. If no fragments are sellable, then the site must lower fragment. Hence, if a site keeps track of the average value
the value of its fragments until a sale can be made. In fact, of N um∗ for each class for which it stores a fragment,
if a site wishes to go out of business, then it must find a site then it can decide whether its fragments should be split or
to accept its fragments and lower their internal value until a coalesced.
buyer can be found for all of them. Of course, a site must honor any outstanding contracts
The storage manager is an asynchronous process running that it has already made. If it discards or splits a fragment
in the background, continually buying and selling fragments. for which there is an outstanding contract, then the site must
Obviously, it should work in harmony with the bidder men- endure the consequences of its actions. This entails either
tioned in the previous section. Specifically, the bidder should subcontracting to some other site a portion of the previously
bid on queries for remote fragments that the storage manager committed work or buying back the missing data. In either
would like to buy, but has not yet done so. In contrast, it case, there are revenue consequences, and a site should take
should decline to bid on queries to remote objects in which its outstanding contracts into account when it makes frag-
the storage manager has no interest. The first primitive ver- ment allocation decisions. Moreover, a site should carefully
sion of this interface is the “hot list” mentioned in the the consider the desirable expiration time for contracts. Shorter
previous section. times will allow the site greater flexibility in allocation de-
cisions.

4.2 Splitting and coalescing


5 Names and name service
Mariposa sites must also decide when to split and coalesce
fragments. Clearly, if there are too few fragments in a class, Current distributed systems use a rigid naming approach,
then parallel execution of Mariposa queries will be hindered. assume that all changes are globally synchronized, and often
On the other hand, if there are too many fragments, then the have a structure that limits the scalability of the system. The
overhead of dealing with all the fragments will increase and Mariposa goals of mobile fragments and avoidance of global
response time will suffer, as noted in Copeland et al. (1988). synchronization require that a more flexible naming service
The algorithms for splitting and coalescing fragments must be used. We have developed a decentralized naming facility
strike the correct balance between these two effects. that does not depend on a centralized authority for name
At the current time, our storage manager does not have registration or binding.
general Rush rules to deal with splitting and coalescing frag-
ments. Hence, this section indicates our current plans for the
future. 5.1 Names
One strategy is to let market pressure correct inappropri-
ate fragment sizes. Large fragments have high revenue and Mariposa defines four structures used in object naming.
attract many bidders for copies, thereby diverting some of These structures (internal names, full names, common names
the revenue away from the owner. If the owner site wants to and name contexts) are defined below.
196 Chapter 2: Query Processing

Internal names are location-dependent names used to de- 5.2 Name resolution
termine the physical location of a fragment. Because these
are low-level names that are defined by the implementation, A name must be resolved to discover which object is bound
they will not be described further. to the name. Every client and server has a name cache at
Full names are completely-specified names that uniquely the site to support the local translation of common names
identify an object. A full name can be resolved to any object to full names and of full names to internal names. When a
regardless of location. Full names are not specific to the broker wants to resolve a name, it first looks in the local
querying user and site, and are location-independent, so that name cache to see if a translation exists. If the cache does
when a query or fragment moves the full name is still valid. not yield a match, the broker uses a rule-driven search to
A name consists of components describing attributes of the resolve ambiguous common names. If a broker still fails to
containing table, and a full name has all components fully resolve a name using its local cache, it will query one or
specified. more name servers for additional name information.
In contrast, common names (sometimes known as syn- As previously discussed, names are unordered sets of at-
onyms) are user-specific, partially specified names. Using tributes. In addition, since the user may not know all of an
them avoids the tedium of using a full name. Simple rules object’s attributes, it may be incomplete. Finally, common
permit the translation of common names into full names by names may be ambiguous (more than one match) or untrans-
supplying the missing name components. The binding op- latable (no matches). When the broker discovers that there
eration gathers the missing parts either from parameters di- are multiple matches to the same common name, it tries to
rectly supplied by the user or from the user’s environment as pick one according to the policy specified in its rule base.
stored in the system catalogs. Common names may be am- Some possible policies are “first match,” as exemplified by
biguous because different users may refer to different objects the UNIX shell command search (path), or a policy of “best
using the same name. Because common names are context match” that uses additional semantic criteria. Considerable
dependent, they may even refer to different objects at dif- information may exist that the broker can apply to choose
ferent times. Translation of common names is performed by the best match, such as data types, ownership, and protection
functions written in the Mariposa rule/extension language, permissions.
stored in the system catalogs, and invoked by the mod-
ule (e.g., the parser) that requires the name to be resolved.
Translation functions may take several arguments and return 5.3 Name discovery
a string containing a Boolean expression that looks like a
query qualification. This string is then stored internally by In Mariposa, a name server responds to metadata queries in
the invoking module when called by the name service mod- the same way as data servers execute regular queries, except
ule. The user may invoke translation functions directly, e.g., that they translate common names into full names using a list
my naming(EMP). Since we expect most users to have a of name contexts provided by the client. The name service
“usual” set of name parameters, a user may specify one such process uses the bidding protocol of Sect. 3 to interact with a
function (taking the name string as its only argument) as a collection of potential bidders. The name service chooses the
default in the USER system catalog. When the user specifies winning name server based on economic considerations of
a simple string (e.g., EMP) as a common name, the system cost and quality of service. Mariposa expects multiple name
applies this default function. servers, and this collection may be dynamic as name servers
Finally, a name context is a set of affiliated names. are added to and removed from a Mariposa environment.
Names within a context are expected to share some feature. Name servers are expected to use advertising to find clients.
For example, they may be often used together in an appli- Each name server must make arrangements to read the
cation (e.g., a directory) or they may form part of a more local system catalogs at the sites whose catalogs it serves
complex object (e.g., a class definition). A programmer can periodically and build a composite set of metadata. Since
define a name context for global use that everyone can ac- there is no requirement for a processing site to notify a name
cess, or a private name context that is visible only to a single server when fragments change sites or are split or coalesced,
application. The advantage of a name context is that names the name server metadata may be substantially out of date.
do not have to be globally registered, nor are the names tied As a result, name servers are differentiated by their qual-
to a physical resource to make them unique, such as the birth ity of service regarding their price and the staleness of their
site used in Williams et al. (1981). Like other objects, a name information. For example, a name server that is less than one
context can also be named. In addition, like data fragments, minute out of date generally has better quality information
it can be migrated between name servers, and there can be than one which can be up to one day out of date. Quality
multiple copies residing on different servers for better load is best measured by the maximum staleness of the answer
balancing and availability. This scheme differs from another to any name service query. Using this information, a bro-
proposed decentralized name service (Cheriton and Mann ker can make an appropriate tradeoff between price, delay
1989) that avoided a centralized name authority by relying and quality of answer among the various name services, and
upon each type of server to manage their own names without select the one that best meets its needs.
relying on a dedicated name service. Quality may be based on more than the name server’s
polling rate. An estimate of the real quality of the metadata
may be based on the observed rate of update. From this we
predict the chance that an invalidating update will occur for
a time period after fetching a copy of the data into the local
Mariposa: A Wide-Area Distributed Database System 197
Table 4. Mariposa site configurations

WAN LAN

Site Host Location Model Memory Host Location Model Memory

1 huevos Santa Barbara 3000/600 96 MB arcadia Berkeley 3000/400 64 MB


2 triplerock Berkeley 2100/500 256 MB triplerock Berkeley 2100/500 256 MB
3 pisa San Diego 3000/800 128 MB nobozo Berkeley 3000/500X 160 MB

Table 5. Parameters for the experimental test data


how Mariposa query optimization and execution compares
Table Location Number of tows Total size to that of a traditional system.
R1 Site 1 50 000 5 MB
R2 Site 2 10 000 1 MB
R3 Site 3 50 000 5 MB
6.1 Experimental environment

The experiments were conducted on Alpha AXP worksta-


cache. The benefit is that the calculation can be made without tions running versions 2.1 and 3.0 of Digital UNIX. Table 4
probing the actual metadata to see if it has changed. The shows the actual hardware configurations used. The worksta-
quality of service is then a measurement of the metadata’s tions were connected by a 10 MB/s Ethernet in the LAN case
rate of update, as well as the name server’s rate of update. and the Internet in the WAN case. The WAN experiments
were performed after midnight in order to avoid heavy day-
time Internet traffic that would cause excessive bandwidth
6 Mariposa status and experiments and latency variance.
The results in this section were generated using a sim-
At the current time (June 1995), a complete Mariposa im- ple synthetic dataset and workload. The database consists of
plementation using the architecture described in this paper is three tables, R1, R2 and R3. The tables are part of the Wis-
operational on Digital Equipment Corp. Alpha AXP work- consin Benchmark database (Bitton et al. 1983), modified to
stations running Digital UNIX. The current system is a com- produce results of the sizes indicated in Table 5. We make
bination of old and new code. The basic server engine is that available statistics that allow a query optimizer to estimate
of POSTGRES (Stonebraker and Kemnitz 1991), modified the size of (R1 join R2), (R2 join R3) and (R1 join R2 join
to accept SQL instead of POSTQUEL. In addition, we have R3) as 1 MB, 3 MB and 4.5 MB, respectively. The workload
implemented the fragmenter, broker, bidder and coordinator query is an equijoin of all three tables:
modules to form the complete Mariposa system portrayed in
SELECT *
Fig. 1.
FROM R1, R2, R3
Building a functional distributed system has required the WHERE R1.u1 = R2.u1
addition of a substantial amount of software infrastructure. AND R2.u1 = R3.u1
For example, we have built a multithreaded network commu-
nication package using ONC RPC and POSIX threads. The In the wide area case, the query originates at Berkeley
primitive actions shown in Table 1 have been implemented and performs the join over the WAN connecting UC Berke-
as RPCs and are available as Rush procedures for use in ley, UC Santa Barbara and UC San Diego.
the action part of a Rush rule. Implementation of the Rush
language itself has required careful design and performance
engineering, as described in Sah and Blow (1994). 6.2 Comparison of the purchase order and expensive bid
We are presently extending the functionality of our pro- protocols
totype. At the current time, the fragmenter, coordinator and
broker are fairly complete. However, the storage manager Before discussing the performance benefits of the Mariposa
and the bidder are simplistic, as noted earlier. We are in economic protocols, we should quantify the overhead they
the process of constructing more sophisticated routines in add to the process of constructing and executing a plan rel-
these modules. In addition, we are implementing the repli- ative to a traditional distributed DBMS. We can analyze the
cation system described in Sidell et al. (1995). We plan to situation as follows. A traditional system plans a query and
release a general Mariposa distribution when these tasks are sends the subqueries to the processing sites; this process
completed later in 1995. follows essentially the same steps as the purchase order pro-
The rest of this section presents details of a few sim- tocol discussed in Sect. 3. However, Mariposa can choose
ple experiments which we have conducted in both LAN between the purchase order protocol and the expensive bid
and WAN environments. The experiments demonstrate the protocol. As a result, Mariposa overhead (relative to the tra-
power, performance and flexibility of the Mariposa approach ditional system) is the difference in elapsed time between
to distributed data management. First, we describe the ex- the two protocols, weighted by the proportion of queries
perimental setup. We then show by measurement that the that actually use the expensive bid protocol.
Mariposa protocols do not add excessive overhead relative To measure the difference between the two protocols,
to those in a traditional distributed DBMS. Finally, we show we repeatedly executed the three-way join query described
198 Chapter 2: Query Processing
Table 6. Elapsed times for various query processing stages
We also show how data migration in Mariposa can automat-
ically ameliorate poor initial data placement.
Network Stage Time (s) In our simple economy, each site uses the same pricing
Purchase order protocol Expensive bid protocol scheme and the same set of rules. The expensive bid protocol
Parser 0.18 0.18
is used for every economic transaction. Sites have adequate
LAN Optimizer 0.08 0.08 storage space and never need to evict alternate fragments
Broker 1.72 6.69 to buy fragments. The exact parameters and decision rules
Parser 0.18 0.18 used to price queries and fragments are as follows:
WAN Optimizer 0.08 0.08
Queries: Sites bid on subqueries as described in Sect. 3.3.
Broker 4.52 14.08
That is, a bidder will only bid on a join if the
criteria specified in Sect. 3.3 are satisfied. The
billing rate is simply 1.5× estimated cost, lead-
in the previous section over both a LAN and a WAN. The ing to the following offer price:
elapsed times for the various processing stages shown in
Table 6 represent averages over ten runs of the same query. actual bid = (1.5 × estimated cost)
For this experiment, we did not install any rules that would ×load average
cause fragment migration and did not change any optimizer
load average = 1 for the duration of the ex-
statistics. The query was therefore executed identically every
periment, reflecting the fact that the system is
time. Plainly, the only difference between the purchase order
lightly loaded. The difference in the bids of-
and the expensive bid protocol is in the brokering stage.
fered by each bidder is therefore solely due to
The difference in elapsed time between the two proto-
data placement (e.g., some bidders need to sub-
cols is due largely to the message overhead of brokering,
contract remote scans).
but not in the way one would expect from simple message
Fragments: A broker who subcontracts for remote scans
counting. In the purchase order protocol, the single-site op-
also considers buying the fragment instead of
timizer determines the sites to perform the joins and awards
paying for the scan. The fragment value dis-
contracts to the sites accordingly. Sending the contracts to 2×scan cost
the two remote sites involves two round-trip network mes-
cussed in Section 4.1 is set to load average ; this,
sages (as previously mentioned, this is no worse than the combined with the fact that eviction is never
cost in a traditional distributed DBMS of initiating remote necessary, means that a site will consider sell-
query execution). In the expensive bid protocol, the broker ing a fragment whenever
sends out request for bid (RFB) messages for the two joins 2 times scan cost
to each site. However, each prospective join processing site of f er price >
load average
then sends out subbids for remote table scans. The whole
brokering process therefore involves 14 round-trip messages A broker decides whether to try to buy a frag-
for RFBs (including subbids), six round-trip messages for ment or to pay for the remote scan according
recording the bids and two more for notifying the winners to the following rule:
of the two join subqueries. Note, however, that the bid col- on (salePrice(frag)
lection process is executed in parallel because the broker and <= moneySpentForScan(frag))
the bidder are multithreaded, which accounts for the fact that do acquire(frag)
the additional cost is not as high as might be thought. In other words, the broker tries to acquire a
As is evident from the results presented in Table 6, the fragment when the amount of money spent
expensive bid protocol is not unduly expensive. If the query scanning the fragment in previous queries is
takes more than a few minutes to execute, the savings from greater than or equal to the price for buying the
a better query processing strategy can easily outweigh the fragment. As discussed in Sect. 4.1, each bro-
small cost of bidding. Recall that the expensive protocol will ker keeps a hot-list of remote fragments used
only be used when the purchase order protocol cannot be. in previous queries with their associated scan
We expect the less expensive protocol to be used for the ma- costs. This rule will cause data to move closer
jority of the time. The next subsection shows how economic to the query when executed frequently.
methods can produce better query processing strategies.
This simple economy is not entirely realistic. Consider
the pricing of selling a fragment as shown above. If load
average increases, the sale price of the fragment decreases.
6.3 Bidding in a simple economy This has the desirable effect of hastening the sale of frag-
ments to off-load a busy site. However, it tends to cause
We illustrate how the economic paradigm works by running the sale of hot fragments as well. An effective Mariposa
the three-way distributed join query described in the previ- economy will consist of more rules and a more sophisti-
ous section, repeatedly in a simple economy. We discuss how cated pricing scheme than that with which we are currently
the query optimization and execution strategy in Mariposa experimenting.
differs from traditional distributed database systems and how We now present the performance and behavior of Mari-
Mariposa achieves an overall performance improvement by posa using the simple economy described above and the
adapting its query processing strategy to the environment. WAN environment shown in Table 4. Our experiments show
Mariposa: A Wide-Area Distributed Database System 199
Table 7. Execution times, data placement and revenue at each site
Steps
1 2 3 4 5 6

Elapsed time Brokering 13.06 12.78 18.81 13.97 8.9 10.06


(s) Total 449.30 477.74 403.61 428.82 394.3 384.04
R1 1 1 1 1 3 3
Location of R2 2 2 1 11 13 13
(site) R3 13 3 3 3 3 3
Site 1 97.6 97.6 95.5 97.2 102.3 0.0
Revenue Site 2 2.7 2.7 3.5 1.9 1.9 1.9
(per query) Site 3 177.9 177.9 177.9 177.9 165.3 267.7

how Mariposa adapts to the environment through the bidding be performed locally1 . The cost of moving the tables can
process under the economy and the rules described above. be amortized over repeated execution of queries that require
A traditional query optimizer will use a fixed query pro- the same data.
cessing strategy. Assuming that sites are uniform in their The experimental results vary considerably because of
query processing capacity, the optimizer will ultimately dif- the wide variance in Internet network latency. Table 7 shows
ferentiate plans based on movement of data. That is, it will a set of results which best illustrate the beneficial effects of
tend to choose plans that minimize the amount of base table the economic model.
and intermediate result data transmitted over the network. As
a result, a traditional optimizer will construct the following
plan: 7 Related work

(1) Move R2 from Berkeley to Santa Barbara. Perform R1 Currently, there are only a few systems documented in the
join R2 at Santa Barbara. literature that incorporate microeconomic approaches to re-
(2) Move the answer to San Diego. Perform the second join source sharing problems. Huberman (1988) presents a col-
at San Diego. lection of articles that cover the underlying principles and
(3) Move the final answer to Berkeley. explore the behavior of those systems.
Miller and Drexler (1988) use the term “Agoric Sys-
This plan causes 6.5 MB of data to be moved (1 MB in step tems” for software systems deploying market mechanisms
1, 1 MB in step 2, and 4.5 MB in step 3). If the same query is for resource allocation among independent objects. The data-
executed repeatedly under identical load conditions, then the type agents proposed in that article are comparable to our
same plan will be generated each time, resulting in identical brokers. They mediate between consumer and supplier ob-
costs. jects, helping to find the current best price and supplier for
By contrast, the simple Mariposa economy can adjust the a service. As an extension, agents have a “reputation” and
assignment of queries and fragments to reflect the current their services are brokered by an agent-selection agent. This
workload. Even though the Mariposa optimizer will pick is analogous to the notion of a quality-of-service of name
the same join order as the traditional optimizer, the broker servers, which also offer their services to brokers.
can change its query processing strategy because it acquires Kurose and Simha (1989) present a solution to the file
bids for the two joins among the three sites. Examination allocation problem that makes use of microeconomic princi-
of Table 7 reveals the performance improvements resulting ples, but is based on a cooperative, not competitive, environ-
from dynamic movement of objects. It shows the elapsed ment. The agents in this economy exchange fragments in or-
time, location of data and revenue generated at each site der to minimize the cumulative system-wide access costs for
by running the three-way join query described in Sect. 6.1 all incoming requests. This is achieved by having the sites
repeatedly from site 2 (Berkeley). voluntarily cede fragments or portions thereof to other sites
At the first step of the experiment, Santa Barbara is the if it lowers access costs. In this model, all sites cooperate to
winner of the first join. The price of scanning the smaller achieve a global optimum instead of selfishly competing for
table, R2, remotely from Santa Barbara is less than that of resources to maximize their own utility.
scanning R1 remotely from Berkeley; as a result, Santa Bar- Malone et al. describe the implementation of a pro-
bara offers a lower bid. Similarly, San Diego is the winner cess migration facility for a pool of workstations connected
of the second join. Hence, for the first two steps, the execu- through a LAN. In this system, a client broadcasts a re-
tion plan resulting from the bidding is identical to the one quest for bids that includes a task description. The servers
obtained by a traditional distributed query optimizer. willing to process that task return an estimated completion
However, subsequent steps show that Mariposa can gen- time, and the client picks the best bid. The time estimate is
erate better plans than a traditional optimizer by migrating computed on the basis of processor speed, current system
fragments when necessary. For instance, R2 is moved to load, a normalized runtime of the task, and the number and
Santa Barbara in step 3 of the experiment, and subsequent length of files to be loaded. The latter two parameters are
joins of R1 and R2 can be performed locally. This elim- 1 Note that the total elapsed time does not include the time to move the
inates the need to move 1 MB of data. Similarly, R1 and fragments. It takes 82 s to move R2 to site 1 at step 3 and 820 s to move
R2 are moved to San Diego at step 5 so that the joins can R1 and R3 to site 3 at step 5
200 Chapter 2: Query Processing

supplied by the task description. No prices are charged for agement sciences. In particular, Mendelson (1985) proposes
processing services and there is no provision for a shortcut a microeconomic model for studies of queueing effects of
to the bidding process by mechanisms like posting server popular pricing policies, typically not considering the de-
characteristics or advertisements of servers. lays. The model shows that when delay cost is taken into
Another distributed process scheduling system is pre- account, a low utilization ratio of the center is often opti-
sented in Waldspurge (1992). Here, CPU time on remote mal. The model is refined by Dewan and Mendelson (1990).
machines is auctioned off by the processing sites, and ap- The authors assume a nonlinear delay cost structure, and
plications hand in bids for time slices. This is in contrast to present necessary and sufficient conditions for the optimal-
our system, where processing sites make bids for servicing ity of pricing rules that charges out service resources at their
requests. There are different types of auctions, and computa- marginal capacity cost. Although these and similar results
tions are aborted if their funding is depleted. An application were intended for human decision making, many apply to
is structured into manager and worker modules. The worker the Mariposa context as well.
modules perform the application processing and several of On the other hand, Mendelson and Saharia (1986) pro-
them can execute in parallel. The managers are responsi- pose a methodology for trading off the cost of incomplete
ble for funding their workers and divide the available funds information against data-related costs, and for constructing
between them in an application-specific way. To adjust the minimum-cost answers to a variety of query types. These re-
degree of parallelism to the availability of idle CPUs, the sults can be useful in the Mariposa context. Users and their
manager changes the funding of individual workers. brokers will indeed often face a compromise between com-
Wellman (1993) offers a simulation of multicommodity plete but costly and cheaper but incomplete and partial data
flow that is quite close to our bidding model, but with a and processing.
bid resolution model that converges with multiple rounds of
messages. His clearinghouses violate our constraint against
single points of failure. Mariposa name service can be 8 Conclusions
thought of as clearinghouses with only a partial list of pos-
sible suppliers. His optimality results are clearly invalidated We present a distributed microeconomic approach for man-
by the possible exclusion of optimal bidders. This suggests aging query execution and storage management. The dif-
the importance of high-quality name service, to ensure that ficulty in scheduling distributed actions in a large system
the winning bidders are usually solicited for bids. stems from the combinatorially large number of possible
A model similar to ours is proposed by Ferguson et al. choices for each action, the expense of global synchroniza-
(1993), where fragments can be moved and replicated be- tion, and the requirement of supporting systems with het-
tween the nodes of a network of computers, although they erogeneous capabilities. Complexity is further increased by
are not allowed to be split or coalesced. Transactions, con- the presence of a rapidly changing environment, including
sisting of simple read/write requests for fragments, are given time-varying load levels for each site and the possibility of
a budget when entering the system. Accesses to fragments sites entering and leaving the system. The economic model
are purchased from the sites offering them at the desired is well-studied and can reduce the scheduling complexity
price/quality ratio. Sites are trying to maximize their rev- of distributed interactions because it does not seek globally
enue and therefore lease fragments or their copies if the optimal solutions. Instead, the forces of the market provide
access history for that fragment suggests that this will be an “invisible hand” to guide reasonably equitable trading of
profitable. Unlike our model, there is no bidding process resources.
for either service purchase or fragment lease. The relevant We further demonstrated the power and flexibility of
prices are published at every site in catalogs that can be up- Mariposa through experiments running over a wide-area net-
dated at any time to reflect current demand and system load. work. Initial results confirm our belief that the bidding pro-
The network distance to the site offering the fragment access tocol is not unduly expensive and that the bidding process
service is included in the price quote to give a quality-of- results in execution plans that can adapt to the environment
service indication. A major difference between this model (such as unbalanced workload and poor data placement) in
and ours is that every site needs to have perfect information a flexible manner. We are implementing more sophisticated
about the prices of fragment accesses at every other site, features and plan a general release for the end of 1995.
requiring global updates of pricing information. Also, it is
assumed that a name service, which has perfect information
about all the fragments in the network, is available at every Acknowledgements. The authors would like to thank Jim Frew and Darla
Sharp of the Institute for Computational Earth System Science at the Uni-
site, again requiring global synchronization. The name ser-
versity of California, Santa Barbara and Joseph Pasquale and Eric Anderson
vice is provided at no cost and is hence excluded from the of the Department of Computer Science and Engineering of the University
economy. We expect that global updates of metadata will of California, San Diego for providing a home for the remote Mariposa sites
suffer from a scalability problem, sacrificing the advantages and their assistance in the initial setup. Mariposa has been designed and
of the decentralized nature of microeconomic decisions. implemented by a team of students, faculty and staff that includes the au-
When computer centers were the main source of com- thors as well as Robert Devine, Marcel Kornacker, Michael Olson, Robert
Patrick and Rex Winterbottom. The presentation and ideas in this paper
puting power, several authors studied the economics of such
have been greatly improved by the suggestions and critiques provided by
centers’ services. The work focussed on the cost of the ser- Sunita Sarawagi and Allison Woodruff. This research was sponsored by the
vices, the required scale of the center given user needs, the Army Research Office under contract DAAH04-94-G-0223, the Advanced
cost of user delays, and the pricing structure. Several results Research Projects Agency under contract DABT63-92-C-0007, the National
are reported in the literature, in both computer and man- Science Foundation under grant IRI-9107455, and Microsoft Corp.
Mariposa: A Wide-Area Distributed Database System 201

References Ousterhout JK (1994) Tcl and the Tk Toolkit. Addison-Wesley, Reading,


Mass
Banerjea A Mah BA (1991) The real-time channel administration protocol. Sah A, Blow J (1994) A new architecture for the implementation of scripting
In: Proc 2nd Int Workshop on Network and Operating System Support languages. In: Proc USENIX Symp on Very High Level Languages,
for Digital Audio and Video, Heidelberg, Germany, November Santa Fe, NM, October. pp 21–38
Bernstein PA, Goodman N, Wong E, Reeve CL, Rothnie J (1981) Query Sah A, Blow J, Dennis B (1994) An introduction to the Rush language. In:
processing in a system for distributed databases (SDD-1). ACM Trans In: Proc Tcl’94 Workshop, New Orleans, La, June pp 105–116
Database Syst 6:602–625 Selinger PG, Astrahan MM, Chamberlin DD, Lorie RA, Price TG (1979)
Bitton D, DeWitt DJ, Turbyfill C (1983) Benchmarking data base systems: Access path selection in a relational database management system.
a systematic approach. In: Proc 9th Int Conf on Very Large Data Bases, In: Proc 1979 ACM-SIGMOD Conf on Management of Data, Boston,
Florence, Italy, November Mass, June
Cheriton D, Mann TP (1989) Decentralizing a global naming service for Sidell J, Aoki PM, Barr S, Sah A, Staelin C, Stonebraker M, Yu A (1995)
improved performance and fault tolerance. ACM Trans Comput Syst Data replication in Mariposa (Sequoia 2000 Technical Report 95-60)
7:147–183 University of California, Berkeley, Calif
Copeland G, Alexander W, Boughter E, Keller T (1988) Data placement in Stonebraker M (1986) The design and implementation of distributed IN-
bubba. In: Proc 1988 ACM-SIGMOD Conf on Management of Data, GRES. In: Stonebraker M (ed) The INGRES papers. M. Addison-
Chicago, Ill, June, pp 99–108 Wesley, Reading, Mass
Dewan S, Mendelson H (1990) User delay costs and internal pricing for a Stonebraker M (1991) An overview of the Sequoia 2000 project (Sequoia
service facility. Management Sci 36:1502–1517 2000 Technical Report 91/5), University of California, Berkeley, Calif
Ferguson D, Nikolaou C, Yemini Y (1993) An economy for managing Stonebraker M, Kemnitz G (1991) The POSTGRES next-generation data-
replicated data in autonomous decentralized systems. Proc Int Symp base management system. Commun ACM 34:78–92
on Autonomous Decentralized emsSyst (ISADS 93), Kawasaki, Japan, Stonebraker M, Aoki PM, Devine R, Litwin W, Olson M (1994a) Mariposa:
March, pp 367–375 a new architecture for distributed data. In: Proc 10th Int Conf on Data
Huberman BA (ed) (1988) The ecology of computation. North-Holland, Engineering, Houston, Tex, February, pp 54–65
Amsterdam Stonebraker M, Devine R, Kornacker M, Litwin W, Pfeffer A, Sah A,
Kurose J, Simha R (1989) A microeconomic approach to optimal resource Staelin C (1994b) An economic paradigm for query processing and
allocation in distributed computer systems. IEEE Trans Comp 38:705– data migration in Mariposa. In: Proc 3rd Int Conf on Parallel and
717 Distributed Information Syst, Austin, Tex, September, pp 58–67
Litwin W et al (1982) SIRIUS system for distributed data management. In: Waldspurger CA, Hogg T, Huberman B, Kephart J, Stornetta S (1992)
Schneider HJ (ed) Distributed data bases. North-Holland, Amsterdam Spawn: a distributed computational ecology. IEEE Trans Software Eng
Mackert LF, Lohman GM (1986) R* Optimizer validation and performance 18:103–117
evaluation for distributed queries. Proc 12th Int Conf on Very Large Wellman MP (1993) A market-oriented programming environment and its
Data Bases, Kyoto, Japan, August, pp 149–159 applications to distributed multicommodity flow problems. J AI Res
Malone TW, Fikes RE, Grant KR, Howard MT (1988) Enterprise: a market- 1:1–23
like task scheduler for distributed computing environments. In: Huber- Williams R, Daniels D, Haas L, Lapis G, Lindsay B, Ng P, Obermarck
man BA (ed) The ecology of computation. North-Holland, Amsterdam R, Selinger P, Walker A, Wilms P, Yost R (1981) R*: an overview
Mendelson H (1985) Pricing computer services: queueing effects. Commun of the architecture. (IBM Research Report RJ3325), IBM Research
ACM 28:312–321 Laboratory, San Jose, Calif
Mendelson H, Saharia AN (1986) Incomplete information costs and data- Zhang H, Fisher T (1992) Preliminary measurement of the RMTP/RTIP. In:
base design. ACM Trans Database Syst 11:159–185 Proc Third Int Workshop on Network and Operating System Support
Miller MS, Drexler KE (1988) Markets and computation: agoric open for Digital Audio and Video, San Diego, Calif November
systems. In: Huberman BA (ed) The ecology of computation. North-
Holland, Amsterdam
Chapter 3
Data Storage and Access Methods

In this section we focus on data storage and access methods, postponing the issues of
transactional concurrency and recovery until the next chapter.

Files and indexes are present in most OS file systems, and an age-old controversy surrounds the
question of whether database access method services can (or should) be provided by a generic file
system. Our first paper in this chapter reflects the frustrations that arose from attempts to use the
original UNIX system services for database purposes. This paper is traditionally seen as a harsh
critique of the Operating Systems community’s work at the time, but the reader should recall that
it is a critique born out of good will: the INGRES project took a leap of faith in using UNIX and
C in their very early days, and the shortcomings of UNIX that are described here are the result of
that experience. After many years of OS research and market pressure from database vendors and
customers, one can now work around most of these shortcomings reasonably gracefully in most
modern UNIXes and other operating systems. Readers with an interest in the Operating Systems
literature are encouraged to survey the various workarounds that have emerged in that community
since the time of this paper (kernel threads, scheduler activations, memory mapped files, etc.),
and understand the degree to which they help solve these problems. Despite the near-universal
understanding of these issues today, it still remains difficult to tightly integrate database needs for
storage and buffering into the file system without sacrificing performance on file system
workloads. Microsoft is purportedly doing this for their next release of Windows, which will be
interesting to watch on two counts: whether it can be done, and whether it is usable for any
DBMS other than the one written at Microsoft.

After this introduction, we switch gears to the famous first paper on RAID, which is a low-level
storage technique that has become an industry of its own. RAID revisits the industrial-revolution
idea of using armies of cheap, replaceable labor instead of using expensive, highly-specialized
labor. In the case of RAID, large high-performance disks are replaced by arrays of smaller, cheap
disks; the challenge is to do this while maintaining reliability in the face of component disk
failures. This paper is well-known for defining five RAID “levels”, of which only two are
typically remembered: Level 1 (mirroring) and Level 5 (“full” RAID). Mirroring is a very old
technique, but is still the recommended storage scheme for database systems, since it provides
reliability while keeping the storage system’s performance overhead low; moreover, it allows the
DBMS to maintain control over disk layout. RAID 5 provides a much more storage-efficient
solution while maintaining good reliability, which makes it attractive to storage customers. But
in practice it has been observed that the raw performance penalty for writes in RAID 5 is quite
high, due to the need to read and update parity bits for every write. (The next paper in this section
quotes a 4x performance penalty for RAID writes, quite a bit worse than the back-of-the-envelope
predictions here.) This is particularly bad for the On-Line Transaction Processing applications
that form the traditional bread and butter of the database industry.

Given this somewhat negative introduction, some notes on RAID are in order. First, since the
time of this paper various projects have proposed hybrid RAID schemes to allow for both
efficient writes and compact storage. The HP AutoRAID system [WGSS96] is perhaps the best-
known of these, using mirroring for frequently-updated items, and RAID 5 for colder data, with
policies to “promote” and “demote” between the two. A second note arises from market realities.
The storage systems industry is driven by filesystem workloads first, and database workloads
second. Hence RAID became an entrenched technology without considering database
Introduction 203

performance requirements. Despite the inefficiencies of RAID 5 for database workloads, many
customers have insisted on using RAID 5 for its storage cost benefits, often blaming the database
vendors for the resulting bad performance. As a result, most database systems now have tuning
knobs that try to mask the inefficiencies of RAID 5, by tuning buffer replacement and log
flushing policies.

Our next paper by Gray and Graefe presents some rules of thumb for buffer replacement policies
in a number of settings, including RAID environments. The content of the paper is important, but
the style is equally important, since it is an example of one popular approach to systems research.
The paper is an exercise in a “scientific” (which is to say “observational”) style of systems
research, in which developments are viewed over a long period of time to try and extract
important technology trends. To their credit, Gray and Graefe avoid calling their observations
“laws” (a la Moore’s “Law”) and stick with the more accurate term “rules of thumb”. This style
of research necessarily glosses over the specific ideas behind technological innovations, in the
interest of seeing a bigger picture. Another important aspect of this paper is its stress on choosing
appropriate metrics, a key feature of good research. It is easy for engineers, researchers and
customers to become obsessed with a performance metric like “throughput” without asking
whether that metric accurately reflects their actual needs. This paper attempts to re-define storage
performance metrics to reflect the shifting economic cost of storage systems; as a result it focuses
on cost/performance ratios rather than raw performance, and examines both the cost and
performance trends of the technologies.

While we are on the topic of buffering, we note that we have not included a paper on buffer
replacement policy in this edition of the book. Database aficionados should have at least one
such scheme in their repertoire, so we discuss them briefly here. In previous editions we included
the DBMIN paper by Chou and DeWitt [CD85]; we dropped it in this edition because it is not
really a practical scheme. DBMIN is based on the idea that since the buffer replacement policy is
sensitive to the query plan, the optimizer can dictate different replacement policies to be used for
various blocks depending on the particular query. Unfortunately, in practice systems often run
more than one query at a time, and these queries may share blocks but use them in different ways.
Planning custom replacement policies for evolving, multi-query workloads becomes messy.
However the DBMIN paper remains a worthwhile read because it highlights many of the
important database access patterns that drive replacement policy decisions.

Two attractively simple replacement policies that work better than LRU are the LRU-k [OOW93]
and 2Q [JS94] schemes. Both schemes try to improve on LRU by tracking the inter-arrival rate
of requests for the same page, and using it to predict which page currently in the buffer pool will
be the last to be re-referenced. LRU-k is a generalization of LRU, in which instead of just
remembering the time of the last reference for each page, you remember the times of the last k
references. 2Q is an attempt to provide the LRU-2 behavior with a lower-overhead algorithm –
LRU-based schemes suffer from the requirement of managing a priority heap to remember what
the current “least” is. Unfortunately, inter-arrival time estimations like LRU-k and 2Q do not
help with the single-user looping access patterns mentioned at the beginning of this section: the
inter-arrival rate is the same for pages, and if ties are broken via LRU than nothing has been
gained.

We conclude the chapter by moving up from the storage and buffering layers to discuss access
methods. By now, most of the popular access methods (heap files, B+-trees, Linear Hashing) are
well covered in the better undergraduate database textbooks. Hence we only discuss the more
advanced multidimensional access methods, which index data along multiple dimensions
simultaneously. These access methods are most often used today for geographic data, to find data
204 Chapter 3: Data Storage and Access Methods

within a two-dimensional spatial range. They also have use in advanced applications like image
and string searching that are driven by “similar-to” queries (e.g. “find all images similar to this
one”). Similarity search is often supported by taking a complex object like an image, and
constructing a “signature” or “feature vector” that is an array of numbers. For example, an image
can be represented by the histogram of pixel-colors that appear in the image. These signatures
can often be in very large numbers of dimensions, but statistical techniques (e.g. the Singular
Value Decomposition) can often be used to project them down to 5 or 10 dimensions, often
without significantly affecting query results. The resulting low-dimensionality data can often be
indexed effectively by a multidimensional access method.

Probably the most-cited multidimensional search structure for databases is the R-tree [Gutt84],
which is a generalization of the B+-tree. Numerous variants and improvements to the R-tree have
been proposed since then, and we present the R*-tree in this section as a representative of that
body of work. The R*-tree is simple and intuitive, and has been shown to out-perform the R-tree
in many experiments beyond those presented in this paper. The R*-tree modifies the R-tree in
three key ways; two are minor, and one is major. The first two proposals include a different
heuristic for choosing leaf nodes during insertion, and a different heuristic for re-apportioning
data during page splits. More radically, the R*-tree proposes a “Forced Reinsertion” heuristic
that postpones page splits in favor of reinserting data from the top of the tree. The rationale for
this is that old insertion decisions in R*-trees can be sub-optimal in the face of subsequent
decisions. A disadvantage of this scheme is the impact on concurrency: it turns a single insertion
into multiple insertions, which translates into a higher probability of conflicts with other ongoing
transactions walking the tree.

The word “heuristic” appears very often in the previous paragraph because most of the practical
schemes for multidimensional indexing are heuristics – there are typically no proofs of worst-case
or average-case performance (unlike B+-trees, which are well understood both practically and
theoretically.) In fact, for most multidimensional indexing schemes it not difficult to construct an
“adversarial” workload (a set of insertions and queries) that makes them perform terribly – this is
an interesting exercise for the critical reader. Underlying this discussion is a need to understand
how well a truly excellent index could do if one existed; Indexability Theory [HKM+02] provides
an approach to developing such bounds, including tradeoffs between query I/Os and storage. A
certain degree of theoretical work has been done to design new multi-dimensional, disk-based
indexes that have understandable and even tight bounds (e.g. [ASV99]), but this work has yet to
be made practical for real systems.

Readers wanting to become experts in multidimensional indexing have a great deal of reading to
do, since the literature is littered with proposals that are hard to weigh against each other. A good
survey of multidimensional access methods was written a few years ago [GG98], though it does
not include many of the latest structures. We warn the reader to bring a critical eye to
multidimensional index papers. When reading these proposals, it is often worth asking (a)
whether the access method outperforms sequential scan of a heap file (many do not, due to the
overhead of random I/Os!) (b) whether the claims of performance benefits are on a realistic
workload (considering both the experimental data and the queries), and (c) whether other natural
workloads will result in poor performance. Unfortunately, this large body of work lacks
theoretical rigor, standard benchmarks, and industrial “war stories”. As a result, most
implementations stick with a few of the early structures like R-trees.

There are a number of schemes that attempt to map multi-dimensional data onto one dimension,
and translate multi-dimensional queries into B+-tree lookups [Jag90,BBK98,etc.]. These have
the attraction that they do not require new access method implementations, with the attendant
Introduction 205

complexities of index concurrency and recovery that we will discuss in the next chapter. There
are also useful extensions to multidimensional index schemes to support similarity search [HS95]
and bulk-loading [LLE97]. A difficult challenge in this arena is for an optimizer to estimate the
number of disk I/Os that a multidimensional index will perform [BF95,Aok99].

On the practical side, there has been almost no industrial-strength work on multidimensional
indexes that includes serious concurrency control and recovery. Most of the commercial database
systems do not currently have a multidimensional index tightly integrated into the system; instead
they provide “glue” to access remote multidimensional index servers, and/or they provide a table-
partitioning scheme instead of true index. Both of these approaches translate either into poor
concurrency or the potential for inconsistent indexes. An exception is the work of Kornacker
[KMH97, Kor99], whose approach for concurrency and recovery is applicable to many
multidimensional database indexes and was implemented in Informix.

We are no longer very optimistic about research into new multidimensional indexing tricks; there
have simply been too many proposals with too little evaluation, and new ideas are unlikely to
have any impact on real systems. What is needed in this space is either theoretically optimal
indexes that are practically useful in real systems, or agreed-upon benchmarks from popular
applications that can guide heuristic index designers. In the absence of such developments,
system designers are left hedging their bets. They can either minimize their implementation
investment by adopting simple but flawed schemes like R-trees, or they can invest in an
extensible framework to allow for application-specific index schemes now, and the ability to
quickly adopt provably good ideas in the future. We will discuss extensibility in detail in Chapter
5.

References

[Aok99] Paul M. Aoki. How to Avoid Building DataBlades® That Know the Value of
Everything and the Cost of Nothing. In Proc. 11th Int'l Conf. on Scientific and Statistical
Database Management (SSDBM), Cleveland, OH, July 1999, 122-133.

[ASV99] Lars Arge, Vasilis Samoladas and Jeffrey Scott Vitter. “On Two-Dimensional
Indexability and Optimal Range Search Index”. In Proc. ACM SIGACT-SIGMOD-SIGART
Symposium on Principles of Database Systems (PODS), May-June, 1999.

[BBK98] S. Berchtold, C. Böhm and H.-P. Kriegel. “The Pyramid-Technique: Towards Breaking
the Curse of Dimensionality.” In Proc. ACM-SIGMOD International Conference on
Management of Data, 1998.

[BF95] Alberto Belussi and Christos Faloutsos. “Estimating the Selectivity of Spatial Queries
Using the ‘Correlation’ Fractal Dimension.” In Proc. International Conference on Very Large
Data Bases (VLDB), pp. 299-310, 1995.

[CD85] Hong-Tai Chou and David J. DeWitt. An Evaluation of Buffer Management Strategies
for Relational Database Systems. In Proceedings of 11th International Conference on Very Large
Data Bases (VLDB), pages 127-141, Stockholm, Sweden, August 1985.

[GG98] Volker Gaede and Oliver Günther. “Multidimensional Access Methods”. ACM
Computing Surveys, 30(2), 1998.
206 Chapter 3: Data Storage and Access Methods

[Gutt84] Antonin Guttman. R-Trees: A Dynamic Index Structure For Spatial Searching. In Proc.
ACM-SIGMOD International Conference on Management of Data, pages 47-57, Boston, June
1984.

[HKM+02] Joseph M. Hellerstein, Elias Koutsoupias, Daniel P. Miranker, Christos H.


Papadimitriou and Vasilis Samoladas. “On a model of indexability and its bounds for range
queries.” Journal of the ACM (JACM) 49 (1):35-55, January, 2002.

[HS95] G. Hjaltason and H. Samet. “Ranking in Spatial Databases.” In Proc 4th Int. Symp. on
Spatial Databases (SSD), Portland, USA, pp.83-95, Aug. 1995.

[Jag90] H. V. Jagadish. “Linear Clustering of Objects with Multiple Atributes”. In Proc. ACM-
SIGMOD International Conference on Management of Data, pp. 332-342, 1990.

[JS94] T. Johnson and D. Shasha, "2Q: A low overhead high performance buffer management
replacement algorithm," In Proc. International Conference on Very Large Data Bases (VLDB),
pp. 297-306, 1994.

[KMH97] Marcel Kornacker, C. Mohan and Joseph M. Hellerstein. “Concurrency and Recovery
in Generalized Search Trees”. In Proc. ACM SIGMOD Conf. on Management of Data, Tucson,
AZ, May 1997, 62-72.

[Kor99] Marcel Kornacker. “High-Performance Extensible Indexing.”In Proc. of 25th


International Conference on Very Large Data Bases (VLDB), Edinburgh, Scotland, September
1999.

[LLE97] S.T. Leutenegger, M.A. Lopez and J.M. Edgington. STR: A Simple and Efficient
Algorithm for R-Tree Packing. In Proc. of the International Conference on Data Engineering
(ICDE), 1997.

[OOW93] Elizabeth J. O'Neil, Patrick E. O'Neil, and Gerhard Weikum. “The LRU-K Page
Replacement Algorithm For Database Disk Buffering.” In Proceedings ACM SIGMOD
International Conference on Management of Data, pages 297-306, Washington, D.C., May 1993.

[WGSS96] John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan. “The HP AutoRAID
Hierarchical Storage System.” ACM Transactions on Computer Systems, 14(1):108-136, Feb.
1996.
208 Chapter 3: Data Storage and Access Methods
The R*-tree: An Efficient and Robust Access Method for Points and Rectangles 209
210 Chapter 3: Data Storage and Access Methods
The R*-tree: An Efficient and Robust Access Method for Points and Rectangles 211
212 Chapter 3: Data Storage and Access Methods
The R*-tree: An Efficient and Robust Access Method for Points and Rectangles 213
214 Chapter 3: Data Storage and Access Methods
The R*-tree: An Efficient and Robust Access Method for Points and Rectangles 215
216 Chapter 3: Data Storage and Access Methods
218 Chapter 3: Data Storage and Access Methods
Operating System Support for Database Management 219
220 Chapter 3: Data Storage and Access Methods
Operating System Support for Database Management 221
222 Chapter 3: Data Storage and Access Methods
Operating System Support for Database Management 223
The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb 225
226 Chapter 3: Data Storage and Access Methods
The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb 227
228 Chapter 3: Data Storage and Access Methods
The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb 229
A Case for Redundant Arrays of Inexpensive Disks (RAID) 231
232 Chapter 3: Data Storage and Access Methods
A Case for Redundant Arrays of Inexpensive Disks (RAID) 233
234 Chapter 3: Data Storage and Access Methods
A Case for Redundant Arrays of Inexpensive Disks (RAID) 235
236 Chapter 3: Data Storage and Access Methods
A Case for Redundant Arrays of Inexpensive Disks (RAID) 237
Chapter 4
Transaction Management

As is well known, transaction management consists of concurrency control and crash recovery.
The seminal work in this area is the 1975 paper by Jim Gray and company, which contains a good
presentation of multi-granularity two-phase locking and degrees of consistency. Multi-
granularity locking is at the heart of every serious database system’s concurrency control scheme.
The definitions of the degrees of consistency in this paper are not entirely declarative – they
depend upon on a lock-based implementation of concurrency control. This has led to some
confusion the SQL standards [BBG+95]. Adya et al. propose a more robust but complex
definition [ALO00].

Locking is pessimistic in that a transaction is blocked if there is any possibility that a


nonserializable schedule could result. On the other hand, a DBMS could use an optimistic
algorithm that allowed a transaction to continue processing when serializability could be
compromised, in the belief that it probably won’t be. We have included the original paper on
optimistic methods as the second paper in this chapter.

There have been a number of simulation studies that compare dynamic locking to alternate
concurrency control schemes including optimistic methods – we include one as the third paper in
this chapter. This paper is an example of a category of systems research that is not often covered
in collections such as this, namely performance analysis. Many system design decisions are too
complex to think through analytically, and require detailed simulation to get right. An interesting
aspect of this paper is that it not only explores a performance problem, it also explains the
contradictory conclusions of previous analyses. The net result of this study is that dynamic
locking wins except when unrealistic assumptions are made, such as the existence of an arbitrary
number of processors. Hence all commercial relational systems use dynamic locking as the
mechanism of choice. However in some client-server environments, there is sometimes enough
idle time for optimistic techniques to make sense [FCL97]. For a good treatment of the various
other concurrency control techniques, the reader is directed to [BHG87].

Only a few embellishments on dynamic locking have been shown to make any sense. First, it is
reasonable for read transactions to access the database as of a time in the recent past. This allows
such a transaction to set no locks, as explained in [Cha82]. Permitting a reader to set no locks
increases parallelism and has been implemented commercial by several vendors. The second
embellishment concerns “hot spots” in databases. In high transaction rate systems there are often
records that many transactions wish to read and write. The ultimate hot spot is a one-record
database. In order to do a large number of transactions per second to a one-record database, some
new technique must be used. One possibility is to use some form of escrow transactions, as
discussed in [One86] and implemented in a variety of IBM products. A third embellishment is
that increased parallelism can be obtained in tree-based indexes if a special (non-two-phase)
locking protocol is used for index pages. One protocol for these special cases in B+-trees is
discussed in the fourth paper in this chapter by Lehman and Yao; this protocol has been extended
to other structures like R-trees as well [KMH97]. An alternative family of protocols are
described in [Moh96], but only work for B+ trees.

Crash recovery is the second required service performed by any transaction manager, and we
have included the very readable paper by Haerder and Reuter as an introduction to this topic. This
is universally done via some form of Write Ahead Log (WAL) technique in which information
Introduction 239

is written to a log to assist with recovery. After a crash, this log is processed backward, undoing
the effect of uncommitted transactions, and then forward, redoing the effect of uncommitted
transactions.

There are a variety of approaches to the contents of a log. At one extreme, a physical log records
all physical changes onto secondary storage; that is, the before- and after-image of each changed
bit in the database is recorded in the log. In this case, an insert of a new record in a relation will
require part of the data page on which it is placed to be logged. In addition, an insert must be
performed for each B-tree index defined on the relation in question. Inserting a new key into a B-
tree index will cause about half the bits on the corresponding page to be moved and result in a log
record on average the size of a page. To perform an insert into a relation with K indexes will
generate a collection of log records with combined length in excess of K pages on the average.

The objective of logical or event logging is to reduce the size of the log. In this case, a log is a
collection of events for each of which the system supplies an undo and redo routine. For
example, an event might be of the form “a record with values X1, …, Xn was inserted into relation
R.” If this event must be undone, then the corresponding undo routine would remove the record
from the relation in question and delete appropriate keys from any indexes. Similarly, the redo
routine would regenerate the insert and perform index insertions.

It is obvious that logical logging results in a log of reduced size but requires a longer recovery
time, because logical undo or redo is presumably slower than physical undo or redo. However
there are problems with logical logging. For example, during page splits a B-tree will be
physically inconsistent. A concurrency scheme will hide this inconsistency during normal
operation, but if a crash happens inopportunely, the B-tree’s structure of nodes and pointers may
be corrupted. This is no problem with physical logging, because the B-tree will be recovered
utilizing the log. However, with logical logging there is no B-tree log, and restoring structural
integrity must be guaranteed in another way. One option is to achieve this via careful ordered
writes to the database: allocate two new pages for the split page and write these pages to disk
before updating the parent page. Ordered writes complicate the buffer manager: it requires it to
keep track not only of which dirty pages may be replaced, but also of the order in which they can
be replaced. Alternatively, one can do physiological logging, e.g. physically log structural
modifications to an index tree, but logically log the insertion of the tuple. There are a large
number of possible systems utilizing combinations of physical and logical logging.

Haerder and Reuter categorize logging techniques as:


• Atomic vs. Not Atomic
• Force vs. No Force
• Steal vs. No Steal
Loosely, atomic means a shadow page recovery scheme, whereas not atomic represents an
update-in-place algorithm. Force represents the technique of forcing dirty pages from the buffer
pool when a transaction commits. No force is the converse. Last, steal connotes the possibility
that dirty data pages will be written to disk prior to the end of the transaction, whereas no steal is
the opposite. Any recovery scheme (e.g. some of the ones suggested in the OS literature) should
be evaluated by placing it into this taxonomy; it is then easy to decide whether the scheme is
suitable without worrying about its details. Although Haerder and Reuter discuss the subject as if
there were a collection of reasonable techniques, in fact most commercial database systems use:
• Not Atomic
• No Force
• Steal
240 Chapter 4: Transaction Management

The basic reasoning is fairly simple. Atomic writing of pages would require that the DBMS use a
shadow page technique. As discussed in [Gra81], use of this technique was one of the major
mistakes of System R and had to be corrected in DB2. All commercial systems do “update in
place”, which essentially entails that writes to the disk not be atomic. Second, a DBMS will go
vastly slower if it forces pages at commit time. Hence nobody takes “force” seriously. Also “no
steal” requires enough buffer space to hold the updates of the largest transaction. Because
“batch” transactions are still quite common, no commercial system is willing to make this
assumption. To this taxonomy one can also add Physical vs. Logical log records, though as we
discuss above hybrids are common.

As noted above, the conventional wisdom is to recover from crashes for which the disk is intact
by processing the log backward, performing undo, and then forward, performing redo. If the disk
is not intact, then the system must restore a dump and the process the log forward from the dump,
performing redo. Last, uncommitted transactions must be undone by processing the log
backward, performing undo. Recovering from these two kinds of crashes with different
techniques complicates transaction code. The work on ARIES, included as our next paper, shows
that a uniform algorithm can be used such that redo is always performed first, followed by undo.
The key idea in ARIES is to “repeat history”, and reconstruct the database system’s state at the
time of the crash; after that, the abort of any uncommitted transactions uses the same logic that is
used when aborting transactions during normal operation.

In theory, ARIES should result in simpler algorithms, but the ARIES paper is perhaps the most
complicated paper in this collection. There are two good overviews of ARIES that the reader
might consider before diving into the details: one is in Ramakrishnan and Gehrke’s undergraduate
textbook, the other is a survey paper by Mike Franklin in the Handbook of Computer Science
[Fra97].1 The full ARIES paper here is complicated significantly by its diversionary discussions
of the drawbacks of alternative design decisions along the way. On the first pass, the reader is
encouraged to ignore this material and focus solely on the ARIES approach; the drawbacks of
alternatives are important to understand, but should be saved for a more careful second read. The
actual ARIES protocols treat two issues that complicate the presentation further. One issue is the
support for efficiently managing internal database state like heap file free-space maps and
indexes. This leads to mechanisms for nested top actions and logical undo logging. Logical
undo logging has other uses beyond managing internal state – it is key to exotic concurrency
schemes like escrow transactions. The second issue is a set of tricks to minimize system
downtime during recovery. In practice, it is important for recovery time to appear as short as
possible, since many customers demand so-called 24×7 operation, i.e. continuous availability.

Our last two papers in this section focus on the complications that arise when implementing
concurrency and recovery in distributed databases, where coordination is done over a network,
and partial failure of the distributed system can lead to confusion if one machine tries to move
forward while disconnected from another.

Unfortunately, most concurrency control techniques discussed in the literature are not very
realistic. For a survey of available techniques consult [BG81]. For example, the SDD-1
concurrency control scheme was based on timestamps and conflict graphs [BG80]. This scheme
unfortunately does not allow a transaction to abort, assumes that transactions within a single
transaction class are sequenced outside the model, and allows a transaction to send only one
message to each site. All of these assumptions are unrealistic in a distributed environment, and

1
These overviews are summarized in a slide set available on the website for this book.
Introduction 241

timestamp techniques have not enjoyed any measure of success. Moreover, it is clearly difficult to
design conflict graphs, as transactions can arbitrarily be assigned to classes. Even CCA, who
invented the SDD-1 algorithms, gave up on them in their next prototype, ADAPLEX [Cha83].
The reader is advised to carefully evaluate the reasonableness of the assumptions required in
many of the schemes in the literature.

In our opinion, distributed concurrency control is quite simple. In practice, distributed


concurrency schemes must allow a heterogeneous set of systems to coordinate via an open,
standard API. In an open, interoperable architecture, distributed concurrency control must be built
on top of local facilities provided by each underlying data manager. At the moment, all
commercial products use some variation on locking. Unless there is some sort of global standard
that requires a local data manager to send its local “waits-for” graph to somebody else, it will be
impossible to do any sort of global deadlock detection because the prerequisite information
cannot be assembled from the local data managers. Hence, timeout is the only deadlock detection
scheme that will work in this environment. As a result, setting locks at the local sites within the
local data manager and using timeout for deadlock detection will be the solution used.

Crash recovery, on the other hand, is a much more complex subject. A distributed transaction
must be committed everywhere or aborted everywhere. Since there is a local data manager at each
site, it can successfully perform a local commit or abort. The only challenge is for a transaction
coordinator to ensure that all local data managers commit or all abort. The main idea is very
simple, and has come to be called a two-phase commit. When the coordinator is ready to commit
a global transaction he cannot simply send out a commit message to each site. The problem is that
site A must flush all its log pages for the local portion of the transaction and then write a commit
record. This could take one or more I/Os for a substantial transaction and consume perhaps
hundreds of milliseconds on a busy system. Add perhaps a second of message delay and
operating system overhead, and there is perhaps a two second period from the time the
coordinator sends out the commit message during which disaster is possible. Specifically, if site A
crashes then it will not have committed the transaction, and moreover, it will not be able to
commit later because the prerequisite log pages were still in main memory at the time of the crash
and therefore were lost in the crash. On the other hand, the other sites could have remained up
and successfully committed the transaction as directed. In this scenario all sites except A have
committed the transaction, and site A cannot commit. Hence, we have failed to achieve the
objective of every site committing. As a result, there is a window of uncertainty during the
commit process during which a failure will be catastrophic. Such windows of uncertainty have
been studied in [Coo82].

The basic solution to this problem is for the coordinator to send out a “prepare” message prior to
the commit. This will instruct each local site to force all the log pages for a transaction, so that the
transaction can be successfully committed even if there is a site crash. The basic algorithm is
described in the paper by Mohan et al., which is our next selection in this book. However the idea
seems to have been simultaneously invented by several researchers. With a two-phase commit, a
distributed DBMS can successfully recover from all single site failures, all multiple site failures
and certain cases of network partitions. The only drawback of a two-phase commit is that it
requires another round of messages in the protocol. Hence, this resiliency to crashes does not
come for free, and there is a definite “level of service” versus cost tradeoff.

Concerning multiple copies of objects, there are a large number of algorithms that have been
proposed, e.g. [Tho79, Gif79, Sto79, Her84]. Unfortunately virtually all algorithms are of limited
utility, because they fail to deal with constraints imposed by the reality of the commercial
marketplace. The first constraint is that a multiple copy algorithm must be optimized for the case
242 Chapter 4: Transaction Management

that the number of copies is exactly two. There are few DBMS clients interested in 20 copies of a
multi-terabyte database. In general they want two, to ensure that they can stay up in the presence
of a single failure. The second constraint is that a read request must be satisfied by performing a
single physical read to exactly one copy. Any scheme that slows down reads is not likely to win
much real-world acceptance. Consequently, schemes which require a transaction to lock a
quorum of the copies will fail this litmus test. They will require read locks to be set on both
copies in a 2 copy system in order to satisfy a read request. Such an algorithm will lose to a
scheme which locks both copies only on writes and one copy on reads. Such a “read-one-write-
all” algorithm is presented in [ESC85]. An interesting survey of other algorithms can be found in
[BHG87, DGS85].

More recently, the experience of real-world users with replication systems has generated the
following unfortunate state of affairs. If one wants to ensure transactional consistency between a
data set and its replica, then a two phase commit protocol must be utilized. The extra messages
required to commit a transaction entail an overhead and delay in ability to commit the transaction
that is unacceptable in the real world – particularly if some of the participants in a transaction are
likely to be disconnected frequently. On the other hand, if one implements a scheme that does not
include transactional consistency, then there is no semantic guarantee that can be made regarding
the relative states of the two replicas. As such, one can either implement an impossibly expensive
(but correct) replication scheme, or one that has no consistency guarantees at all. This obvious
dilemma has plagued users for some time. Our fourth paper in this chapter by Gray, et al.
attempts to quantify the problems that arise with either of these approaches; the analysis is fairly
pessimistic, but the point is well taken. It also presents one simple solution to this dilemma by
sticking with a “single-mastered” database that allows users to play with copies while they are
offline, but without promising that their actions will persist.

References

[ALO00] Atul Adya, Barbara Liskov, and Patrick O'Neil. Generalized Isolation Level
Definitions. In 16th International Conference on Data Engineering (ICDE), San Diego, CA,
February 2000.

[BBG+95] Hal Berenson, Philip A. Bernstein, Jim Gray, Jim Melton, Elizabeth J. O'Neil, and
Patrick E. O'Neil. A Critique of ANSI SQL Isolation Levels. In Proc. ACM SIGMOD
International Conference on Management of Data, pages 1-10, San Jose, CA, May 1995.

[BG80] Philip A. Bernstein and Nathan Goodman. “Timestamp-Based Algorithms for


Concurrency Control in Distributed Database Systems”. In Proc. Sixth International Conference
on Very Large Data Bases (VLDB). Montreal, Canada, October, 1980.

[BG81] Philip Bernstein and Nathan Goodman., "Concurrency Control in Distributed Database
Systems," Computing Surveys, June 1981, p185-222.

[BHG87] Philip Bernstein, Vassos Hadzilacos and Nathan Goodman., Concurrency Control and
Recovery in Database Systems, Addison-Wesley, Reading, MA, 1987.

[Cha82] Chan, A., et al. “The Implementation of an Integrated Concurrency Control and
Recovery Scheme,” Proc. 1982 ACM-SIGMOD Conference on Management of Data, Orlando,
FL, June 1982.
Introduction 243

[Coo82] Eric C. Cooper. “Analysis of Distributed Commit Protocols.” In Proc. ACM-SIGMOD


International Conference on Management of Data. June 2-4, 1982, Orlando, Florida.

[DGS85] Susan B. Davidson, Hector Garcia-Molina and Dale Skeen. “Consistency in Partitioned
Networks,” ACM Computing Surveys 17(3): 341-370, September 1985.

[ESC85] Amr El Abbadi and Dale Skeen and Flaviu Cristian. “An Efficient, Fault-Tolerant
Protocol for Replicated Data Management.” In Proceedings of the Fourth ACM SIGACT-
SIGMOD Symposium on Principle of Database Systems (PODS), March 25-27, 1985, Portland,
Oregon, pp. 215-229.

[FCL97] M. Franklin, M, Carey, M. Livny: "Transactional Client-Server Cache Consistency:


Alternatives and Performance". ACM Transactions on Database Systems, 22(3), September,
1997.

[Fra97] Franklin, M.J., “Concurrency Control and Recovery”, in The Handbook of Computer
Science and Engineering, A. Tucker, ed., CRC Press, Boca Raton, FL 1997.

[Gif79] Gifford, D., “Weighted Voting for Replicated Data,” Proc. 7th Symposium on Operating
System Principles, Dec. 1979.

[Gra81] Gray, J., et al., “The Recovery Manager of the System R Database Manager,”
Computing Surveys, June 1981.

[Her84] Herlihy, M., “General Quorum Consensus: A Replication Method for Abstract Data
Types,” Dept. of Computer Science, CMU, Pittsburgh, Pa., CMU-CS-84-164, Dec. 1984.

[KMH97] Marcel Kornacker, C. Mohan, and Joseph M. Hellerstein. “Concurrency and Recovery
in Generalized Search Trees”. In Proc. ACM SIGMOD International Conference on Management
of Data, pages 62-72, Tucson, AZ, May 1997.

[Moh96] Mohan, C., “Concurrency Control and Recovery Methods for B+-Tree Indexes:
ARIES/KVL and ARIES/IM,” Performance of Concurrency Control Mechanisms in Centralized
Database Systems, Vijay Kumar (ed.), Prentice-Hall, 1996.

[One86] O’Neil, P., “The Escrow Transactional Method,” ACM TODS 11(4), Dec. 1986.

[Sto79] Stonebraker, M., “Concurrency Control and Consistency of Multiple Copies in


Distributed INGRES,” IEEE Transactions on Software Engineering 5(3), March 1979.

[Tho79] Thomas, R., “A Majority Consensus Approach to Concurrency Control for Multiple
Copy Distributed Database Systems,” ACM Transactions On Database Systems (TODS) 4(2),
June 1979.
Granularity of Locks and Degrees of Consistency in a Shared Data Base 245
246 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 247
248 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 249
250 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 251
252 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 253
254 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 255
256 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 257
258 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 259
260 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 261
262 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 263
264 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 265
266 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 267
268 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 269
270 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 271
272 Chapter 4: Transaction Management
Granularity of Locks and Degrees of Consistency in a Shared Data Base 273
On Optimistic Methods for Concurrency Control 275
276 Chapter 4: Transaction Management
On Optimistic Methods for Concurrency Control 277
278 Chapter 4: Transaction Management
On Optimistic Methods for Concurrency Control 279
280 Chapter 4: Transaction Management
On Optimistic Methods for Concurrency Control 281
282 Chapter 4: Transaction Management
On Optimistic Methods for Concurrency Control 283
284 Chapter 4: Transaction Management
On Optimistic Methods for Concurrency Control 285
286 Chapter 4: Transaction Management
On Optimistic Methods for Concurrency Control 287
Concurrency Control Performance Modeling: Alternatives and Implications 289
290 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 291
292 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 293
294 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 295
296 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 297
298 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 299
300 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 301
302 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 303
304 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 305
306 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 307
308 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 309
310 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 311
312 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 313
314 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 315
316 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 317
318 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 319
320 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 321
322 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 323
324 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 325
326 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 327
328 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 329
330 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 331
332 Chapter 4: Transaction Management
Concurrency Control Performance Modeling: Alternatives and Implications 333
Efficient Locking for Concurrent Operations on B-Trees 335
336 Chapter 4: Transaction Management
Efficient Locking for Concurrent Operations on B-Trees 337
338 Chapter 4: Transaction Management
Efficient Locking for Concurrent Operations on B-Trees 339
340 Chapter 4: Transaction Management
Efficient Locking for Concurrent Operations on B-Trees 341

Efficient Locking for Concurrent Operations on B-Trees 657

Fig. 6. A node.

Another approach to concurrent operations on B-trees is currently under


investigation by Kwong and Wood

3.3 for Concurrency


The is a B*-tree modified by adding a single “link” pointer field to each
node Figure 6). (We pronounce as “B-link-tree.“)
This link field points to the next node at the same level of the tree as the
current node, except that the link pointer of the rightmost node on a level is a
null pointer. This definition for link pointers is consistent, since all leaf nodes lie
at the same level of the tree. The -tree has all of the nodes at a particular
level chained together into a linked list, as illustrated in Figure 7.
The purpose of the link pointer is to provide an additional method for reaching
a node. When a node is split because of data overflow, a single node is replaced
by two new The link pointer of the new node points to the second
node; the link pointer of the second node contains the old contents of the link
pointer field of the first node. Usually, the first new node occupies the same
physical page on the disk as the old single node. The intent of this scheme is that
the two nodes, since they are joined by a link pointer, are functionally essentially
the same as a single node until the proper pointer from their father can be added.
The precise search and insertion algorithms for are given in the next
two sections.
For any given node in the tree (except the first node on any level) there are
(usually) two pointers in the tree that point to that node (a “son” pointer from
the father of the node and a link pointer from the left twin of the node). One of
these pointers must be created when a node is inserted into the tree. We
specify that of these two, the link pointer must exist first; that is, it is legal to
have a node in the tree that has no parent, but has a left twin. This is still defined
to be a valid tree structure, since the new “right twin” is reachable from the “left
twin.” (These two twins might still be thought of as a single node.) Of course, the
pointer from the father must be added quickly for good search time.
Link pointers have the advantage that they are introduced simultaneously with
the splitting of the node. Therefore, the link pointer serves as a “temporary fix”
that allows correct concurrent operation, even before all of the usual tree pointers
are changed for a new (split) node. If the search key exceeds the highest value in
a node (as indicated by the high key), it indicates that the tree structure has been
changed, and that the twin node should be accessed using the link pointer. While
this is slightly less efficient (we need to do an extra disk read to follow a link
pointer), it is a correct method of reaching a leaf node. The link pointers
should be used relatively infrequently, since the splitting of a node is an excep-
tional case.
ACM Transactions on Database Systems, Vol. 6, No. 4, December 1981.
342 Chapter 4: Transaction Management
Efficient Locking for Concurrent Operations on B-Trees 343
344 Chapter 4: Transaction Management
Efficient Locking for Concurrent Operations on B-Trees 345
346 Chapter 4: Transaction Management
Efficient Locking for Concurrent Operations on B-Trees 347
348 Chapter 4: Transaction Management
Efficient Locking for Concurrent Operations on B-Trees 349
350 Chapter 4: Transaction Management
Efficient Locking for Concurrent Operations on B-Trees 351
352 Chapter 4: Transaction Management
Efficient Locking for Concurrent Operations on B-Trees 353
354 Chapter 4: Transaction Management
356 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 357
358 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 359
360 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 361
362 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 363
364 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 365
366 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 367
368 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 369
370 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 371
372 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 373
374 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 375
376 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 377
378 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 379
380 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 381
382 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 383
384 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 385
386 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 387
388 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 389
390 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 391
392 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 393
394 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 395
396 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 397
398 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 399
400 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 401
402 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 403
404 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 405
406 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 407
408 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 409
410 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 411
412 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 413
414 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 415
416 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 417
418 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 419
420 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 421
422 Chapter 4: Transaction Management
ARIES: A Transaction Recovery Method 423
Transaction Management in the R* Distributed Database Management System 425
426 Chapter 4: Transaction Management
Transaction Management in the R* Distributed Database Management System 427
428 Chapter 4: Transaction Management
Transaction Management in the R* Distributed Database Management System 429
430 Chapter 4: Transaction Management
Transaction Management in the R* Distributed Database Management System 431
432 Chapter 4: Transaction Management
Transaction Management in the R* Distributed Database Management System 433
434 Chapter 4: Transaction Management
Transaction Management in the R* Distributed Database Management System 435
436 Chapter 4: Transaction Management
Transaction Management in the R* Distributed Database Management System 437
438 Chapter 4: Transaction Management
Transaction Management in the R* Distributed Database Management System 439
440 Chapter 4: Transaction Management
Transaction Management in the R* Distributed Database Management System 441
442 Chapter 4: Transaction Management
444 Chapter 4: Transaction Management
The Dangers of Replication and a Solution 445
446 Chapter 4: Transaction Management
The Dangers of Replication and a Solution 447
448 Chapter 4: Transaction Management
The Dangers of Replication and a Solution 449
450 Chapter 4: Transaction Management
The Dangers of Replication and a Solution 451
452 Chapter 4: Transaction Management
Chapter 5
Extensibility

An extensible system is one that allows components to be added to the system’s core in the field.
It usually implies that such components can be written by third-party developers (“extenders”)
who are not experts in the system’s internals. The system designers publish an extensibility API
that exposes the relevant system interfaces, and this should be sufficient information for extenders
to add their application-specific code.

Extensibility is one of the major challenges in system architecture. Of course, any system can be
extended with sufficient effort by hacking the source code. But a good extensibility interface
should expose just enough of the system’s needs to make the extender’s application logic run fast.
Database systems research dealt with extensibility relatively early on, because (a) other than
database systems, most shared servers at the time were already general-purpose timesharing
systems with Turing-complete interfaces (i.e. programming shells), and (b) since database
applications are data-intensive, there were major performance benefits available in moving
application logic into the system, rather than copying large amounts of the database out of the
system into application space. There is a sizeable literature on extensible operating systems
[RTY+87, HC92, EKO95, SESS96, etc.], but this largely emerged later than the DBMS work,
and in part as a result of lessons from the database community (e.g. the Stonebraker paper on
“Operating System Support for Data Management” in this book.)

The unifying theme of extensible system research is to cleanly factor out components of a system,
identify the fundamental interactions among the components, and share components whenever
possible. The goal of achieving a “separation of concerns” is a part of system architecture
religion, but extensibility work really forces designers to observe the religion. A typical
characteristic of good extensible designs is that they teach something new about what were
thought to be well-understood ideas. When reading the papers in this chapter, it is important to
go beyond the specifics of the designs, and think about how the designs shed light on the problem
at hand: what are the key issues and how do they interact; what techniques are available to
address each issue; how can techniques be mixed and matched to achieve new points in the
design space?

Theoreticians often have a hard time appreciating these kinds of system architecture issues, but
there is an analogous exercise in mathematics: capturing a complex system via a minimal set of
simple axioms. Extensibility research is an effort to “axiomatize” system architectures. Complex
software systems are never as clean as mathematical axiom systems, but the spirit of the exercise
and the insights to be gained are analogous.

The Need to Extend Early Relational Systems

Codd’s relational model is admirably succinct; it takes a strong stand on the key issue of
normalization (“flat” relations with no pointers), but says very little else. In particular, it makes
no restrictions on the set of data types that can be used in the columns of the system, and no
restriction on the predicates that are applied to such data types. The initial relational systems
were thus free to choose whatever type systems and predicates they saw fit. However, neither of
the two famous prototype DBMSs focused on elaborate type systems. Both INGRES and System
R implemented a fixed, fairly traditional set of possible column types: basically numeric types of
various sorts, character types, and fixed- and variable-length strings – with lengths limited by a
454 Chapter 5: Extensibility

constant upper bound (in order to ensure that each tuple would fit on a disk page.) The predicates
were those that were natural to the types: arithmetic comparisons over arithmetic expressions, and
simple string matching.

The INGRES project was the first to run across the limitations of their fixed type system first.
The initial work on INGRES was funded under an urban planning grant, and geographic data
(roads, land boundaries, etc.) was always a scenario of interest in the project. Unfortunately, the
natural queries in a Geographic Information System (GIS) are geometric, and these are clumsy to
express using a relational language with simple types and predicates (an example is given in the
first paper in this section). Moreover, even if one can express these queries, the features of a
typical DBMS are not designed to make these specialized queries run fast. Special indexes and
optimizations can improve performance of these queries by orders of magnitude.

GIS is a specialized application, and one could build a special data model and DBMS for it.
However very similar problems arise in Computer Aided Design (CAD), which has circuit and
chip diagrams that are not unlike road data in GIS systems. And as time progresses there seem to
be more and more unusual kinds of information that do not mesh well with simple data types:
time-series data (e.g. stock histories), network layouts, multimedia, marked-up documents, and so
on. Each of these applications has its own idiosyncratic data types and predicates, but all of them
also shared traditional database modeling needs for many of their attributes. Given the difficulty
of designing and implementing a DBMS, the market cannot support a specialized database system
for every class of application. Some kind of more flexible system is required.

Extensibility in POSTGRES

ADT-INGRES was an early incremental effort to address type extensibility in a database system.
That effort morphed into the POSTGRES project at Berkeley, which discarded the INGRES code
base and began anew. Both systems attempted to build a relational database that allowed
Abstract Data Types (ADTs) to be used in column declarations for tables, and in the comparisons
and expressions in predicates.

The initial paper in this section was a “first strike” in that agenda. It proposes metadata that a
system must manage in order to allow post-hoc additions of ADTs to the system. All the type
and predicate information in the system is table-driven from the database catalog, and commands
are introduced to the data definition language to define new ADTs and associated functions. In
addition to language and catalog issues, the paper outlines some of the challenges in making such
a system provide respectable performance for queries over ADTs. In particular, it highlights the
need for an Extensible Access Method Interface (EAMI) for ADTs, and the ability for the
optimizer to reason about these access methods. The POSTGRES system was an almost direct
incarnation of this design, including the table-driven ADTs and the EAMI interface for new
access methods.

By any measure, the POSTGRES design was extremely influential. POSTGRES was
commercialized as the Illustra system, which was purchased by Informix, which marketed the
technology heavily. In response, IBM and Oracle stepped up their efforts to include ADT
features in their systems. Microsoft is finally catching up on this front as well. Thus essentially
all modern relational systems support ADTs in a manner analogous to the one proposed here. In
hindsight, the biggest extensibility issue that POSTGRES missed was security; although the
problem was mentioned in the paper included here, it was never addressed in the system or the
research. The possibility of server crashes and data corruption due to extensions eventually
Introduction 455

became a big issue in the commercial marketplace. The canonical solution today is to use a Java
Virtual Machine, or a script language interpreter (PERL, Python, etc.) Of course hindsight is
always 20-20. There was no Java in the days of POSTGRES, interpreted languages were too
slow on the hardware of the day, and the world of computing was quite a bit more idealistic and
cooperative than it is today. Still, security is an important theme in extensibility, and it’s one that
has seen more focus in the OS and language communities than in the DB community.

GiST

To demonstrate the benefits of extensible access methods, the POSTGRES group implemented R-
trees in the system; this gave POSTGRES a distinct performance advantage over traditional
systems for a variety of non-traditional applications.

However, history shows that nobody outside the POSTGRES group ever added another access
method to POSTGRES, largely because the EAMI proposed in the paper is at too low a level.
Users with extensibility needs were simply incapable of writing their own access methods – that
task required not only inventing and implementing such an access method, but also coding the
access method so that it interacted correctly with the DBMS’ pre-existing concurrency control
and recovery system. This was essentially impossible without becoming an expert on the
POSTGRES internals.

Our second paper on Generalized Search Trees (GiST) raises the level of abstraction for an access
method extensibility interface. It begins with the observation that most of the application-specific
indexing schemes posed in the database literature behave structurally very much like B+-trees. It
then attempts to design a minimalist extensibility API that removes all data semantics from B+-
trees, leaving only the structure of a balanced tree with data at the leaves that grows by splitting
upwards. By leaving all the structure modification logic opaque to the extender, the GiST can be
made to handle all the tricky concurrency and recovery logic internally [KMH97], with no need
for any application-specific knowledge. The result is a far more approachable extensibility
interface than the EAMI, with no sacrifice in performance. Many research groups have
implemented custom indexing schemes over GiST, something that never occurred with the
POSTGRES EAMI. In terms of performance, the flexibility of GiST often allows it to be tuned
to run faster than traditional “built-in” indexes. Informix had one of the only commercial R-tree
implementations in a DBMS. They implemented GiST in their engine as well, but discovered
that their R-tree extension over GiST ran faster than their “native” R-tree implementation
[Kor00]. The open-source PostGIS system, a GIS built over the open-source PostgreSQL
system, also uses GiST rather than the native POSTGRES R-trees (GiST was added to
PostgreSQL – via the EAMI interface – in the late 1990’s).

We include the first GiST paper here since it is a good introduction, but we note that the interface
was actually simplified over time. In particular, the reader should ignore the special-case
interfaces for simulating B+-trees efficiently. Cleaner versions of the standard interface appear in
Kornacker’s work on concurrency and recovery [KMH97], and tricks for special traversals
(including B+-tree traversal as well as nearest-neighbor traversals) are generalized and clarified
by Aoki [Aok98]. Theoretically-minded readers are also referred to [HKMPS02] for exposition
of the idea of indexability theory that is raised at the end of the paper included here.
456 Chapter 5: Extensibility

Extensible Optimizers

As noted in our first paper, extensibility in the query language has to be supported by the query
optimizer, or efficiency suffers enormously. POSTGRES addressed this issue to some extent; for
example, the EAMI allowed the optimizer to know about relevant indexes, and subsequent
research showed how the system could be made to efficiently optimize queries with time-
consuming code embedded in the ADTs [Hel98,CS99,Aok99].

Other extensibility projects were even more aggressive in exploring the possibilities of an
extensible query optimizer. The Starburst project at IBM had an optimizer was designed to
generalize the System R optimizer, and make it extensible in two ways. First, the set of physical
operators in the query executor could be extended, and it was important to be able to easily
“teach” the optimizer to use new operators intelligently. Second, the set of logical operators in
the query language could be extensible, in order to support new query language features. Our
next paper by Guy Lohman describes the Starburst design for an extensible query optimizer.
Essentially it abstracts the dynamic programming and pruning of the Selinger algorithm, and
exposes the expansions done during dynamic programming as grammar-like rules. This approach
is particularly attractive given the popularity of Selinger’s optimizer. A competing scheme was
proposed by Graefe and DeWitt in the Exodus project at Wisconsin, and was refined over the
years by Graefe in the Volcano [GM93], Cascades [Gra95], and MS SQL Server systems. The
Graefe/DeWitt approach generates a complete initial plan based on heuristics and explores the
search space from there, making simple local modifications to algebraic plans (e.g. replacing a
logical algebra operator with a physical operator, swapping physical operators, flipping the two
inputs to a join, reordering a pair of adjacent joins, etc.), until all legal physical plans are
considered. The legal modifications are expressed in Exodus as rules, somewhat similar to
Starburst, but the Exodus rules transform plans, while the Starburst rules generate them. The
subsequent work by Graefe is an interesting alternative to the Selinger approach with both pros
and cons, and anyone aspiring to be an expert on query optimization should know this work well.

Extensible query optimization is particularly interesting because it points the way to a very
general future for database optimization and execution architectures. If the details of the
relational algebra can be abstracted away from a query optimizer, then it should be possible to
apply database-like optimization schemes to any dataflow-oriented programming model, of which
there are many [KMC+00,LP95, AYKJ00, etc.]
Historical Context
Before we leave this topic, some context is useful for readers who wish to pursue more of the
literature in this area. Historically, database extensibility research was undertaken in system
prototypes that were exploring enhanced data models, in particular object-oriented data models.
In retrospect, it is very useful to separate the contributions of those systems into architectural
issues involving extensibility, and data modeling issues involving language and schema design.
Unfortunately there is no systematic paper that lays out all the various ideas in extensible
database systems and the data models that went with them, and tries to show which of the
architectural ideas can be combined with which of the modeling and language features. Two
systems with reasonable overview papers are Postgres [PGCACM] and O2 [D90]. Designers of
systems for new data models and languages (XML-based approaches come to mind, but there will
be others in future) are encouraged to read the prior work carefully, to tease apart the architectural
ideas from the specifics of the data model research, and consider mixing and matching the ideas
to solve current needs.
Introduction 457

As an alternative to extensible systems, there was also research into so-called “toolkits” for
generating app-specific DBMSs – notably the EXODUS project at Wisconsin, and the Genesis
project at UT-Austin. The toolkit idea was to decompose a DBMS cleanly into subsystems, so
that a collection of subsystems and extensions could be cobbled together easily to build an
application-specific DBMS. In these systems, the components had to be extensible to support
general reuse. For example, the optimizer in EXODUS had to support “any” query language and
“any” set of execution operators. In practice, neither EXODUS nor Genesis was seriously used in
any configurations other than the default. However, as noted above, the EXODUS optimizer was
influential, both as a precursor of the Cascades work [Cascades] now used in MS SQL Server,
and as the main point of contrast to the Starburst work (which lives on in IBM DB2).

References

[AYKJ00] Danielle Argiro, Mark Young, Steve Kubica, and Steve Jorgensen. “Khoros: An
Integrated Development Environment for Scientific Computing and Visualization.”
In Enabling Technologies for Computational Science: Frameworks, Middleware and
Environments, Kluwer Academic Publishers, March 2000, pp. 147-157.

[Aok98] P.M. Aoki. “Generalizing ‘Search’ in Generalized Search Trees”. In Proc. 14th IEEE
Int'l Conf. on Data Engineering (ICDE '98), Orlando, FL, Feb. 1998, 380-389.

[Aok99] P.M. Aoki. “How to Avoid Building DataBlades® That Know the Value of Everything
and the Cost of Nothing”. Proc. 11th IEEE Int'l Conf. on Scientific and Statistical Database
Mgmt. (SSDBM '99), Cleveland, OH, July 1999, 122-133.

[CS99] Surajit Chaudhuri and Kyuseok Shim. Optimization of Queries with User- Defined
Predicates. ACM Transactions on Database Systems (TODS), 24(2):177- 228, 1999.

[D90] O. Deux et al. "The Story of O2.” IEEE Trans. Knowledge and Data Eng. 2(1):91–108,
March, 1990.

[EKO95] Dawson R. Engler, M. Frans Kaashoek, and James O'Toole Jr. “Exokernel: an
operating system architecture for application-level resource management.” In Proceedings of the
15th ACM Symposium on Operating Systems Principles (SOSP), Copper Mountain Resort,
Colorado, December 1995.

[GM93] Goetz Graefe and William J. McKenna. The Volcano Optimizer Generator:
Extensibility and Efficient Search. In Proc. 9th International Conference on Data Engineering
(ICDE), pages 209-218, Vienna, Austria, April 1993.

[Gra95] Goetz Graefe. The Cascades Framework for Query Optimization. IEEE Data
Engineering Bulletin, 18(3):19-29, 1995.

[HC92] K. Harty and D. Cheriton. “Application controlled physical memory using external page
cache management.” In Fifth International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS), October 1992.

[Hel98] Joseph M. Hellerstein. Optimization Techniques for Queries with Expensive Methods.
ACM Transactions on Database Systems (TODS), 23(2):113-157, 1998.
458 Chapter 5: Extensibility

[HKMPS02] J. M. Hellerstein, E. Koutsoupias, D. P. Miranker, C. H. Papadimitriou, and V.


Samoladas. “On a model of indexability and its bounds for range queries”. Journal of the ACM
(JACM) 49(1):35--55, 2002.

[KMC+00] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti and
M. Frans Kaashoek. ACM Transactions on Computer Systems (TOCS) 18(3): 263--297, August,
2000.

[KMH97] Marcel Kornacker, C. Mohan and Joseph M. Hellerstein. “Concurrency and Recovery
in Generalized Search Trees”. In Proc. ACM SIGMOD Conf. on Management of Data, Tucson,
AZ, May 1997, 62-72.

[Kor99] Marcel Kornacker. High-Performance Extensible Indexing.


In Proceedings of 25th International Conference on Very Large Data Bases (VLDB), September
7-10, 1999, Edinburgh, Scotland.

[LP95] Edward A. Lee and Thomas M. Parks. “Dataflow Process Networks.” Proceedings of
the IEEE, 83(5):773-801, May, 1995.

[RTY+87] Richard Rashid, Avadis Tevanian, Jr., Michael Young, David Golub, Robert Baron,
David Black, William Bolosky, and Jonathan Chew. “Machine-Independent Virtual Memory
Management for Paged Uniprocessor and Multiprocessor Architectures”. In Proceedings of the
2nd Symposium on Architectural Support for Programming Languages and Operating Systems
(ASPLOS), October, 1987.

[SESS96] Margo Seltzer, Yasuhiro Endo, Christopher Small, and Keith A. Smith. “Dealing with
Disaster: Surviving Misbehaved Kernel Extensions”. In Proceedings of the 1996 Symposium on
Operating Systems Design and Implementation (OSDI), Seattle, WA, October 1996.

[SK91] Michael Stonebraker and Greg Kemnitz. “The POSTGRES next generation database
management system.” Communications of the ACM 34 (10): 78 – 92, October 1991.
460 Chapter 5: Extensibility
Inclusion of New Types in Relational Data Base Systems 461
462 Chapter 5: Extensibility
Inclusion of New Types in Relational Data Base Systems 463
464 Chapter 5: Extensibility
Inclusion of New Types in Relational Data Base Systems 465
466 Chapter 5: Extensibility
468 Chapter 5: Extensibility
Generalized Search Trees for Database Systems 469
470 Chapter 5: Extensibility
Generalized Search Trees for Database Systems 471
472 Chapter 5: Extensibility
Generalized Search Trees for Database Systems 473
474 Chapter 5: Extensibility
Generalized Search Trees for Database Systems 475
476 Chapter 5: Extensibility
Generalized Search Trees for Database Systems 477
478 Chapter 5: Extensibility
480 Chapter 5: Extensibility
Grammar-like Functional Rules for Representing Query Optimization Alternatives 481
482 Chapter 5: Extensibility
Grammar-like Functional Rules for Representing Query Optimization Alternatives 483
484 Chapter 5: Extensibility
Grammar-like Functional Rules for Representing Query Optimization Alternatives 485
486 Chapter 5: Extensibility
Grammar-like Functional Rules for Representing Query Optimization Alternatives 487
488 Chapter 5: Extensibility
Chapter 6
Database Evolution

A common adage of the IT world says that “there is no such thing as a static database.”
Generally speaking, a schema is developed for one application, and then over time that schema
must be modified as a result of a variety of factors, including:

• Changing business conditions (your bank buys another bank, and the IT systems must be
merged)

• Changing requirements (the government changes the rules)

• Changing application mix (the collection of access paths previously thought best are no
longer a good choice)

• The Web (you now need to exchange information with your customers and suppliers)

This is undoubtedly only a partial list of the reasons why schemas must be altered over time.

Whenever the schema must change, there are several issues to confront. These include physical
schema evolution, logical schema evolution, and change management. We discuss these three
topics in the rest of this introduction.

Physical Schema Evolution

Physical schema evolution includes modifying the tuning parameters of a DBMS as well as
changing the access paths to data. Current commercial DBMSs have scores of tuning parameters.
These include such esoterica as the size of the log buffer, how many fragments to horizontally
partition a table into, how big a scratch space to use for sorting tables, how many virtual
processes the system should allocate, and so on. Whenever the application mix changes, these
parameters must be reevaluated. In current systems, this task falls to a human data base
administrator.

Our personal experience is that there are not nearly enough competent DBAs to go around.
Moreover, in a desire to keep costs down enterprises always want to minimize the number of
DBAs they employ. As a result, when performance starts to degrade, the customer typically calls
the vendor who sold them the system. The vendor dispatches one of his system engineers (SEs)
to fix the problem. Hence, the complex tuning parameters typically get set by the SEs. Having
talked to many SEs, it is our opinion that most do not understand the complex interactions of
performance parameters. Therefore, many maintain a “crib sheet” of values that have worked in
other installations. A common practice is to start with the parameters that worked well in some
other installation that appears similar to this one. As a result, tuning parameters are often set by a
combination of folklore and what seemed to work elsewhere.

Obviously, it would be a good idea to set tuning parameters automatically using some sort of
automated logic, and all of the vendors are moving aggressively in this direction. Our first paper
in this section by Chaudhuri deals with choosing indexes with the help of a training set of queries
and special query evaluation calls to the optimizer. This technology has been added to the SQL
Server DBMS by Microsoft. Over time, we expect essentially all physical performance
parameters to be set automatically using this sort of technology.
490 Chapter 6: Database Evolution

Of course, this capability will have to be optional, since there are expert DBAs who demand
access to the tuning knobs. In the early days of query optimization, an expert human could
usually beat an automatic program. Over time, it has become harder and harder for a human to
beat the query optimizer. Hence, few humans demand the capability to override the optimizer. In
the long run, we expect this evolution to occur in physical performance tuning; i.e. it will become
totally automatic.

Logical Schema Evolution

When the logical schema changes, the DBA has the problem of converting his existing
application from the old schema to the new schema. The view mechanisms in relational DBMSs
were designed to minimize or eliminate the need for program maintenance in this situation.
Hence, the old schema is defined as a view on top of the new schema, and in theory all of the
applications, which used to work on real tables, should continue to work on virtual tables.
Unfortunately, it is difficult to process update commands against views, as noted in the first paper
of this book. Hence, the possibility of isolating applications from change by using views is not
universal.

If views do not work, then the DBA faces two problems. She must evolve the schema through
more complex transformations, and application program maintenance will be inevitable. It is not
clear how to design applications that make such program maintenance as easy to accomplish as
possible.

In addition, some sort of schema evolution tool is required that will provide assistance with more
complex transformations. The design of such a tool is an incredibly important (and incredibly
hard) problem. A start at such a tool is documented in the second paper in this section by
Bernstein.

Online Change Management

An obvious goal is to perform physical schema changes without taking the DBMS offline. There
is a clear trend toward “7 X 24” operation, and to the extent possible physical changes should be
accomplished while the system is running.

Hence, it should be possible to add or drop indexes, changing the number of disks over which a
table is horizontally partitioned, and alter the size of the buffer pool while a DBMS system is
running. More challenging, but seemingly possible, would be to change an existing access path
from one implementation to another, for example from hash to B-tree.

To include such capabilities in a DBMS requires substantial changes to the code. For example,
the size of the buffer pool is usually a static array. To change the size of the buffer pool in many
systems requires that the system be taken down and rebooted. Obviously, a dynamic array
implementation is required to alleviate this issue.

For changes in storage structures, more complex data structures are required to support on-line
changes. The third paper in this section discusses the sort of data structures that designers must
think about to accomplish this goal. Hopefully, additional work will be done in this important
area that will improve the technology in this area.
Introduction 491

The “holy grail” in this area would be to move from Version I to Version I+1 of a vendor’s
DBMS software without taking the data base offline.

Cut-Over

In the absence of truly online change facilities, DBMS installations must be occasionally taken
(partially or completely) offline by the DBA. When program maintenance is required, there is the
old code line and the new code line, and testing and cutover must be considered. Today’s
information systems have serious uptime requirements. Hence, it is often not practical to take a
system down for several days to reorganize the schema or install a new version of the DBMS.

As a result, it is common practice to have a second development and test system. The revised
application is built on this machine using the new configuration, and exhaustive testing is
performed. Once, the system administrator is comfortable with the reliability of the new system,
he must cut over to the new system. Often, there is not enough “dead time” in the application to
dump the data out of the old system and load it into the new on. Hence, cut over is a daunting
problem in applications, such as airline reservations, where little or no downtime can be tolerated.
Techniques must be developed to make this procedure much more seamless. The “holy of holy
grails” would be to change the logical schema without taking the data base offline.

A second cutover issue is what to do if the new system fails. Although exhaustive testing is
usually performed on the development system, there are cases where the new system fails when
put into production. In this case, one can only restore the old system and go back to the drawing
board. System availability, of course, suffers in this situation. Again more seamless cutover and
cut back would be very helpful.
AutoAdmin “What-if” Index Analysis Utility 493
494 Chapter 6: Database Evolution
AutoAdmin “What-if” Index Analysis Utility 495
496 Chapter 6: Database Evolution
AutoAdmin “What-if” Index Analysis Utility 497
498 Chapter 6: Database Evolution
AutoAdmin “What-if” Index Analysis Utility 499
500 Chapter 6: Database Evolution
AutoAdmin “What-if” Index Analysis Utility 501
502 Chapter 6: Database Evolution
AutoAdmin “What-if” Index Analysis Utility 503
Applying Model Management to Classical Meta Data Problems
Philip A. Bernstein
Microsoft Research
One Microsoft Way
Redmond, WA 98052-6399
philbe@microsoft.com

Abstract design and its implementation,


• mapping source makefiles into target makefiles, to
Model management is a new approach to meta data drive the transformation of make scripts from one
management that offers a higher level programming programming environment to another, and
interface than current techniques. The main abstrac- • mapping interfaces of real-time devices to the
tions are models (e.g., schemas, interface defini- interfaces required by a system management environ-
tions) and mappings between models. It treats these ment to enable it to communicate with the device.
abstractions as bulk objects and offers such opera- Following conventional usage, we classify these as meta
tors as Match, Merge, Diff, Compose, Apply, and data management applications, because they mostly
ModelGen. This paper extends earlier treatments of involve manipulating descriptions of data, rather than the
these operators and applies them to three classical data itself.
meta data management problems: schema integra- Today’s approach to implementing such applications is
tion, schema evolution, and round-trip engineering. to translate the given models into an object-oriented
representation and manipulate the models and mappings
1 Introduction in that representation. The manipulation includes
Many information system problems involve the design, designing mappings between the models, generating a
integration, and maintenance of complex application model from another model along with a mapping between
artifacts, such as application programs, databases, web them, modifying a model or mapping, interpreting a
sites, workflow scripts, formatted messages, and user mapping, and generating code from a mapping. Database
interfaces. Engineers who perform this work use tools to query languages offer little help for this kind of
manipulate formal descriptions, or models, of these manipulation. Therefore, most of it is programmed using
artifacts, such as object diagrams, interface definitions, object-at-a-time primitives.
database schemas, web site layouts, control flow We have proposed to avoid this object-at-a-time
diagrams, XML schemas, and form definitions. This programming by treating models and mappings as
manipulation usually involves designing transformations abstractions that can be manipulated by model-at-a-time
between models, which in turn requires an explicit and mapping-at-a-time operators [6]. We believe that an
representation of mappings, which describe how two implementation of these abstractions and operators, called
models are related to each other. Some examples are: a model management system, could offer an order-of-
• mapping between class definitions and relational magnitude improvement in programmer productivity for
schemas to generate object wrappers, meta data applications.
• mapping between XML schemas to drive message The approach is meant to be generic in the sense that a
translation, single implementation is applicable to all of the data
• mapping between data sources and a mediated schema models in the above examples. This is possible because
to drive heterogeneous data integration, the same modeling concepts are used in virtually all mod-
• mapping between a database schema and its next eling environments, such as UML, extended ER (EER),
release to guide data migration or view evolution, and XML Schema. Thus, an implementation that uses a
• mapping between an entity-relationship (ER) model representation of models that includes most of those
and a SQL schema to navigate between a database concepts would be applicable to all such environments.
There are many published approaches to the list of meta
Permission to copy without fee all or part of this material is granted data problems above and others like them. We borrow
provided that the copies are not made or distributed for direct
commercial advantage, the VLDB copyright notice and the title of the from these approaches by abstracting their algorithms into
publication and its date appear, and notice is given that copying is by a small set of operators and generalizing them across
permission of the Very Large Data Base Endowment. To copy otherwise, applications and, to some extent, across data models. We
or to republish, requires a fee and/or special permission from the
Endowment.
Proceedings of the 2003 CIDR Conference
Applying Model Management to Classical Meta Data Problems 505

thereby hope to offer a more powerful database platform map1


S1 SW Given: S1, S2, map1, SW
for such applications than is available today.
1. map2 = Match(S1, S2)
In a model management system, models and mappings 1. map2 2. map3 =
are syntactic structures. They are expressed in a type 2. map3
Compose(map1, map2)
system, but do not have additional semantics based on a 3. < S3 ,map4> =
3. map4 S3
constraint language or query language. Despite this S2 Diff(S2, map3)
limited expressiveness, model management operators are
powerful enough to avoid most object-at-a-time pro- Figure 1 Using model management to help generate a
gramming in meta data applications. And it is precisely data warehouse loading script
this limited expressiveness that makes the semantics and
amples to demonstrate that model management is a credi-
implementation of the operators tractable.
ble approach to solving problems of this type. Although
Still, for a complete solution, meta data problems often this paper is not the first overview of model management,
require some semantic processing, typically the manipula- it is the most complete proposal to date. Past papers pre-
tion of formulas in a mathematical system, such as logic sented a short vision [5,6], an example of applying model
or state machines. To cope with this, model management management to a data warehouse loading scenario [7], an
offers an extension mechanism to exploit the power of an application of Merge to mediated schemas [22], and an
inferencing engine for any such mathematical system. initial mathematical semantics for model management [1].
Before diving into details, we offer a short preview to We also studied the match operator [23], which has
see what model management consists of and how it can developed into a separate research area. This paper offers
yield programmer productivity improvements. First, we the following new contributions to the overall program:
summarize the main model management operators: • The first full description of all of the model
• Match – takes two models as input and returns a management operators.
mapping between them • New details about two of the operators, Diff and
• Compose – takes a mapping between models A and B Compose, and a new proposed operator, ModelGen.
and a mapping between models B and C, and returns a • Applications of model management to three well
mapping between A and C known meta data problems: schema integration,
• Diff – takes a model A and mapping between A and schema evolution, and round-trip engineering.
some model B, and returns the sub-model of A that We regard the latter as particularly important, since they
does not participate in the mapping offer the first detailed demonstration that model manage-
• ModelGen – takes a model A, and returns a new ment can help solve a wide range of meta data problems.
model B based on A (typically in a different data The paper is organized as follows: Section 2 describes
model than A’s) and a mapping between A and B the two main structures of model management, models
• Merge – takes two models A and B and a mapping and mappings. Section 3 describes the operators on
between them, and returns the union C of A and B models and mappings. Section 4 presents walkthroughs of
along with mappings between C and A, and C and B. solutions to schema integration, schema evolution, and
Second, to see how the operators might be used, consi- round-trip engineering. Section 5 gives a few thoughts
der the following example [7]: Suppose we are given a about implementing model management. Section 6
mapping map1 from a data source S1 to a data warehouse discusses related work. Section 7 is the conclusion.
SW, and want to map a second source S2 to SW, where S2 is
similar to S1. See Figure 1. (We use S1, SW, and S2 to 2 Models and Mappings
name both the schemas and databases.) First we call 2.1 Models
Match(S1, S2) to obtain a mapping map2 between S1 and For the purposes of this paper, the exact choice of model
S2, which shows where S2 is the same as S1. Second, we representation is not important. However, there are
call Compose(map1, map2) to obtain a mapping map3 several technical requirements on the representation of
between S2 and SW, which maps to SW those objects of S2 models, which the definitions of mappings and model
that correspond to objects of S1. To map the other objects management operators depend on.
of S2 to SW, we call Diff(S2, map3) to find the sub-model First, a model must contain a set of objects, each of
S3 of S2 that is not mapped by map3 to SW, and map4 to which has an identity. A model needs to be a set so that
identify corresponding objects of S2 and S3. We can then its content is well-defined (i.e., some objects are in the set
call other operators to generate a warehouse schema for S3 while others are not). By requiring that objects have iden-
and merge it into SW. The latter details are omitted, but we tity, we can define a mapping between models in terms of
will see similar operator sequences later in the paper. mappings between objects or combinations of objects.
The main purpose of this paper is to define the seman- Second, we want the expressiveness of the representa-
tics of the operators in enough detail to make the above tion of models to be comparable to that of EER models.
sketchy example concrete, and to present additional ex- That is, objects can have attributes (i.e., properties), and
506 Chapter 6: Database Evolution

can be related by is-a (i.e., generalization) relationships, Emp Mapee Employee


has-a (i.e., aggregation or part-of) relationships, and asso-
ciations (i.e., relationships with no special semantics). As
well, there may be some built-in types of constraints, such Emp# 1 EmployeeID
as the min and max cardinality of set-valued properties.
Third, since a model is an object structure, it needs to Name 2 FirstName
support the usual object-at-a-time operations to create or
delete an object, read or write a property, and add or A morphism between LastName
remove a relationship. Emp and Mapee 3 4
Fourth, we expect objects, properties and relationships Figure 2 An example of a mapping
to have types. Thus, there are (at least) three meta-levels
in the picture. Using conventional meta data terminology, sents it as a set of objects (each of which can relate
we have: instances, which are models; a meta-model that objects in the two models). In our experience, this
consists of the type definitions for the objects of models; reification is often needed for satisfactory expressiveness.
and the meta-meta-model, which is the representation For example, if the mapping in Figure 2 were represented
language in which models and meta-models are express- as a relationship, it would presumably include the pairs
ed. We avoid using the term “data model,” because it is <Name, FirstName> and <Name, LastName>, which
ambiguous in the meta data world. In some contexts, it loses the structure in Mapee that shows FirstName and
means the meta-meta-model, e.g., in a relational database LastName as components of Name.
system, the relational data model is the meta-meta-model. In addition to enabling more structural expressiveness,
In other contexts, it means the meta-model; for example, reifying a mapping also allows us to attach custom se-
in a model management system, a relational schema (such mantics to it. We can do this by having a property called
as the personnel schema) is a model, which is an instance Expression for each object m in a mapping, which is an
of the relational meta-model (which says that a relational expression whose variables include the objects that m
schema consists of table definitions, columns definitions, directly or indirectly references in M1 and M2. For exam-
etc.), where both the model and meta-model are repre- ple, in Figure 2 we could associate an expression with
sented in the meta-meta-model (such as an EER model). object 2 that says Name equals the concatenation of First-
Name and LastName. We will have more to say about the
Since a goal of model management is to be as generic as
nature of these expressions at the end of Section 3.
possible, a rich representation is desirable so that when a
model is imported from another data model, little or no Despite these benefits of reifying mappings as models,
semantics is lost. However, to ensure model management we expect there is value in specializing model manage-
operators are implementable, some compromises are ment operators to operate directly on morphisms, rather
inevitable between expressiveness and tractability. than mappings. However, such a specialization is outside
the scope of this paper. Thus, the operators discussed here
To simplify the discussion in this paper, we define a work on models and mappings, but not on morphisms
model to be a set objects, each of which has properties, (separately from the mappings that contain them).
has-a relationships, and associations. We assume that a
model is identified by its root object and includes exactly
the set of objects reachable from the root by paths of has- 3 Model Management Algebra
a relationships. In an implementation, we would expect a 3.1 Match
richer model comparable to EER models. The operator Match takes two models as input and returns
a mapping between them. The mapping identifies combi-
2.2 Mappings
nations of objects in the input models that are either equal
Given two models M1 and M2, a morphism over M1 and
or similar, based on some externally provided definition
M2 is a binary relation over the objects of the two models.
of equality and similarity. In some cases, the definition is
That is, it is a set of pairs <o1, o2> where o1 and o2 are in
quite simple. For example, the equality of two objects
M1 and M2 respectively. A mapping between models M1
may be based on equality of their identifiers or names. In
and M2 is a model, map12, and two morphisms, one be-
other cases, it is quite complex and perhaps subjective.
tween map12 and M1 and another between map12 and M2.
For example, the equality of database schema objects for
Thus, each object m in mapping map12 can relate a set of
databases that were independently developed by different
objects in M1 to a set of objects in M2, namely the objects
enterprises may depend on different terminologies used to
that are related to m via the morphisms. For example, in
name objects.
Figure 2, Mapee is a mapping between models Emp and
This range of definitions of equality leads to two
Employee, where has-a relationships are represented by
versions of the match operator: Elementary Match and
solid lines and morphisms by dashed lines.
Complex Match. Elementary Match is based on the
In effect, a mapping reifies the concept of a relationship
simple definition of equality. It is used where that simple
between models. That is, instead of representing the
definition is likely to yield an accurate mapping, e.g.,
relationship as a set of pairs (of objects), a mapping repre-
Applying Model Management to Classical Meta Data Problems 507

when one model is known to be an incremental Second, recall that a model is the set of objects reach-
modification of another model. able by paths of has-a relationships from the root. Since
Complex Match is based on complex definitions of the result of Diff may equal any subset of the objects of
equality. Although it need not set the Expression property M1, some of those objects may not be connected to the
on mapping objects, it should at least distinguish sets of Diff result’s root. If they are not, the result of Diff is not a
objects that are equal (=) from those that are only similar model. For example, consider Diff(Employee, Mapee) on
(≅). By similar, we mean that they are related but we do the models and mapping in Figure 4. Since FirstName and
not express exactly how. For example, in Figure 3, object LastName are not referenced by Mapee’s morphism
1 says that Emp# and EmployeeID are equal, while object between Employee and Mapee, they are in the result.
2 says that Name is similar to a combination of FirstName However, Name is not in the result, so FirstName and
and LastName. A human mapping designer might update LastName are not connected to the root, Employee, of the
object 2’s Expression property to say that Name equals result and therefore are not in that model. This is undesir-
the concatenation of FirstName and LastName. able, since such objects cannot be subsequently processed
by other operators, all of which expect a model as input.
Emp Mapee Employee Therefore, to ensure that the result of Diff is a well-
= formed model, for every object o in the result, we require
EmployeeID the result to include all objects O on a path of has-a
Emp# 1
relationships from the M1 object referenced by map1’s
=
root to o. Objects in O that are referenced in map1’s
Name FirstName morphism to M1 are called support objects, because they
2 are added only to support the structural integrity of the
≅ model. For example, in Figure 5, Name is a support object
LastName
in the result of Diff(Employee, Mapee).
Figure 3 A mapping output from Complex Match
Emp Mapee Employee
In practice, Complex Match is not an algorithm that
returns a mapping but rather is a design environment to
help a human designer develop a mapping. It potentially Emp# 1 EmployeeID
benefits from using technology from a variety of fields:
graph isomorphism to identify structural similarity in Name 2 Name
large models; natural language processing to identify
similarity of names or to analyze text documentation of a
FirstName
model; domain-specific thesauri; and machine learning
and data mining to use similarity of data instances to infer
the equality of model objects. A recent survey of LastName
approaches to Complex Match is [23]. Figure 4 Diff(Employee, mapee ) includes FirstName
3.2 Diff and LastName but not Name
Intuitively, the difference between two models is the set Having made this decision, we now have a third
of objects in one model that do not correspond to any problem, namely, in the model that is returned by Diff,
object in the other model. One part of computing a how to distinguish support objects from objects that are
difference is determining which objects do correspond. meant to be in the result of Diff (i.e., that do not
This is the main function of Match. Rather than repeating participate in map1)? We could simply mark support
this semantics as part of the diff operator, we compute a objects in the result. But this introduces another structure,
difference relative to a given mapping, which may have namely a marked model. To avoid this complication, we
been computed by an invocation of Match. Thus, given a use our two existing structures to represent the result,
mapping map1 between models M1 and M2, the operator namely, model and mapping. That is, the result of Diff is
Diff(M1, map1) returns the objects of M1 that are not a pair <M1′, map2>, where
referenced in map1’s morphism between M1 and map1. • M1′ includes a copy of: the M1 object r referenced by
There are three problems with this definition of Diff, map1’s root; the set S of objects in M1 that are not
which require changing it a bit. First, the root of map1 referenced by map1’s morphism between map1 and
always references an object (often the root) of M1, so the M1; all support objects, i.e., those on a path of has-a
result of Diff(M1, map1) would not include that object. relationships from r to an object in S that are not
This is inconvenient, because it makes it hard to align the otherwise required in M1′; every has-a relationship
result of Diff with M1 in subsequent operations. We will between two objects of M1 that are also in M1′; and
see examples of this in Section 4. Therefore, we alter the every association between two objects in S or between
definition of Diff to require that the result includes the an object in S and an object outside of M1.
object of M1 referenced by map1’s root.
508 Chapter 6: Database Evolution

Employee Mapee′ Employee′ Emp′

EmployeeID Emp#

Name Name Name

FirstName FirstName
1 FirstName LastName

LastName 2 LastName Figure 6 The result of Merge applied to Figure 2


The effect of collapsing objects into a single object can
Figure 5 The result of Diff(Employee, Mapee) is cause the output of Merge to violate basic constraints that
<Employee′′, Mapee′> models must satisfy. For example, suppose map1 declares
objects m1 of M1 and m2 of M2 to be equal, and suppose
• map2 connects the root of M1′ to r in M1 and connects m1 is of type integer and m2 is of type image. The type of
each object of S to the corresponding object of M1′. the merged object m3 is both integer and image. If a
For example, given Employee and Mapee in Figure 4, the constraint on models is that each object is allowed to have
result of Diff(Employee, Mapee) is <Employee′, Mapee′> at most one type, then m3 manifests a constraint violation
as shown in Figure 5. that must be repaired, either as part of Merge or in a post-
3.3 Merge processing step. A solution to this specific problem
The merge operation returns a copy of all of the objects appears in [9]. A more general discussion of constraint
of the input models, except that objects of the input violations in merge results appears in [15].
models that are equal are collapsed into a single object in 3.4 Compose
the output. Stating this more precisely, given two models
The composition operator, represented by •, creates a
M1 and M2 and a mapping map1 between them, mapping by combining two other mappings. If map1
Merge(M1, M2, map1) returns a model M3 such that relates models M1 and M2, and map2 relates M2 and M3,
• M3 includes a copy of all of the objects of M1, M2, and then the composition map3 = map2 • map1 is a mapping
map1, except that for each object m of map1 that that relates M1 and M3 (i.e., map3(M1) ≡ map2(map1(M1)).
declares objects of M1 and M2 to be equal, those equal To explain the semantics of composition, we will use
objects are dropped from M3 and their properties and mathematical function terminology: For each object m1 in
relationships are added to m. The root of map1 must map1, we refer to the objects that m1 references in M1 as
declare the roots of M1 and M2 to be equal. its domain, and those that m1 references in M2 as its
• All relationships in M1, M2, and map1 are copied to the range. That is, domain(m1) ⊆ M1 and range(m1) ⊆ M2.
corresponding objects in M3. For example, in Figure 6 Similarly, for each object m2 in map2, domain(m2) ⊆ M2
Emp′ is the result of Merge(Emp, Employee, Mapee) and range(m2) ⊆ M3.
on the models and mappings of Figure 2.
In principle, a composition can be driven by either the
• Merge also returns two mappings, map13 between M1
left mapping (map1) or right mapping (map2). However,
and M3 and map23 between M2 and M3, which relate
in this paper we restrict our attention to right
each object of M3 to the objects from which it was
compositions, since that is enough for the examples in
derived. Thus, the output of Merge is a triple <M3,
Section 4. In a right composition, the structure of map2
map13, map23>. For example, Figure 7 shows the map
determines the structure of the output mapping.
pings between the merge result in Figure 6 and the
two input models of the merge, Emp and Employee.

Emp MapEmp-Emp′ Emp′ MapEmp′-Employee Employee

Emp# Emp# EmployeeID

Name Name FirstName

FirstName LastName LastName

Figure 7 The merge result, Emp′′, of Figure 2 with its mappings to the input models Emp and Employee
Applying Model Management to Classical Meta Data Problems 509

To compute the composition, for each object m2 in map2, Step 3 defines the domain of each object m3 in map3.
we identify each object m1 in map1 where range(m1) ∩ Input(m3) is the set of all objects in map1 whose range
domain(m2) ≠ ∅, which means that range(m1) can supply intersects the domain of m3. If the union of the ranges of
at least one object to domain(m2). For example, in Figure Input(m3) contains the domain of m3, then the union of the
8, the ranges of 4, 5, and 6 in map1 can each supply one domains of Input(m3) becomes the domain of m3. Other-
object to domain(11) in map2. Suppose objects m11, …, wise, m3 is not in the composition, so it is either deleted
m1n in map1 together supply all of domain(m2), and each (if it is not a support object, required to maintain the well-
m1i (1≤i≤n) supplies at least one object to domain(m2). formed-ness of map3), or its domain and range are cleared
That is, U range(m1i ) ⊇ domain(m 2 ) and (range(m1i) ∩ (since it does not compose with objects in map1).
1≤i ≤ n
Sometimes it is useful to keep every object of map2 in
domain(m2)) ≠ ∅ for 1≤i≤n. Then m2 should generate an map3 even though its Input set does not cover its domain.
output object m3 in map3 such that range(m3) = range(m2) This is called a right outer composition, because all
and domain(m3) = U domain(m1i ) . objects of the right operand, map2, are retained. Its
1≤i ≤ n

For example, in Figure 8, range(4) and range(5) can semantics is the same as right composition, except that
supply all of domain(11). That is, range(4) ∪ range(5) = step 3b is replaced by “else set domain(m3) = ∅.”
{7, 8, 9} ⊇ domain(11) = {7, 9}. Then object 11 should A definition of composition that allows a more flexible
generate an output object m3 in map3 (not shown in the choice of inputs to m2 is in [7]. It is more complex than
figure), such that range(m3) = range(m2) = {13} and the one above and is not required for the examples in
domain(m3) = domain(4) ∪ domain(5) = {1,2}. Section 4, so we omit it here.
3.5 Apply
M1 map1 M2 map2 M3
The operator Apply takes a model and an arbitrary
function f as inputs and applies f to every object of the
1 4 7 10 12 model. In many cases, f modifies the model, for example,
by modifying certain properties and relationships of each
object. The purpose of Apply is to reduce the need for
2 5 8 application programs to do object-at-a-time navigation
11 13
over a model. There can be variations of the operator for
3 different traversal strategies, such as pre-order or post-
6 9
order over has-a relationships with the proviso that it does
Figure 8 Mappings map1 and map2 can be composed not visit any object twice (in the event of cycles).
3.6 Copy
There is a problem, though: for a given m2 in map2, The operator Copy takes a model as input and returns a
there may be more than one set of objects m11, …, m1n in copy of that model. The returned model includes all of the
map1 that can supply all of domain(m2). For example, in relationships of the input model, including those that
Figure 8, {4, 5} and {4, 6} can each supply all of connect its objects to objects outside the model.
domain(11). When defining composition, which set do we One variation of Copy is of special interest to us,
choose? In this paper, rather than choosing among them, namely DeepCopy. It takes a model and mapping as
we use all of them. That is, we compose each m2 in map2 input, where the mapping is incident to the model. It
with the union of all objects m1 in map1 where range(m1) returns a copy of both the model and mapping as output.
∩ domain(m2) ≠ ∅ ({4,5,6} in the example). This seman- In essence, DeepCopy treats the input model and mapping
tics supports all of the application scenarios in Section 4. as a single model, creating a copy of both of them
Given this decision, we define the right composition together. To see the need for DeepCopy, consider how
map3 of map1 and map2 constructively as follows: complicated it would be to get its effect without it, by
1. (Copy) Create a copy map3 of map2. Note that map3 copying the model and mapping independently. Several
has the same morphisms to M2 and M3 as map2 and, other variations of Copy are discussed in [6].
therefore, the same domains and ranges. 3.7 ModelGen
2. (Precompute Input) For each object m3 in map3, let Applications of model management usually involve the
Input(m3) be the set of all objects m1 in map1 such generation of a model in one meta-model from a model in
that range(m1) ∩ domain(m2) ≠ ∅. another meta-model. Examples are the generation of a
3. (Define domains) For each m3 in map3, SQL schema from an ER diagram, interface definitions
a. if Um ∈Input ( m ) range(m1i ) ⊇ domain(m3 ) , then set from a UML model, or HTML links from a web site map.
1i 3

domain(m3) = Um1i ∈Input ( m3 ) domain(m1i ) . A model generator is usually meta-model specific. For
example, the behavior of an ER-to-SQL generator very
b. else if m3 is not needed as a support object (be- much depends on the source and target being ER and
cause none of its descendants satisfies (3a)), then SQL models respectively. Therefore, one would not
delete it, else set domain(m3) = range(m3) = ∅.
510 Chapter 6: Database Evolution

expect model generation to be a generic, i.e., meta-model- 3.10 Semantics


independent, operator. The model management operators defined in Section 3
Still, there is some common structure across all model are purely syntactic. That is, they treat models and
generators worth abstracting. One is that the generation mappings as graph structures, not as schemas that are
step should produce not only the output model but also a templates for instances. The syntactic orientation is what
mapping from the input model to the output model. This enables model and mapping manipulation operators to be
allows later operators to propagate changes from one relatively generic. Still, in most applications, to be useful,
model to the other. For example, if an application devel- models and mappings must ultimately be regarded as
oper modifies a SQL schema, it helps to know how the templates for instances. That is, they must have
modified objects relate to the ER model, so the ER model semantics. Thus, there is a semantic gap between model
can be made consistent with the revised SQL schema. management and applications that needs to be filled.
This scenario is developed in some detail in Section 4.3. The gap can be partially filled by making the meta-
A second common structure is that most model genera- meta-model described in Sections 2.1 more expressive
tors simply traverse the input model in a predetermined and extending the behavior of the operators to exploit that
order, much like Apply, and generate output model extra expressiveness. So, rather than knowing only about
objects based on the type of input object it is visiting. For has-a and association relationships, the meta-meta-model
example, a SQL generator might generate a table defini- should be extended to include is-a, data types, keys, etc.
tion for each entity type, a column definition for each Another way to introduce semantics is to use the
attribute type, a foreign key for each 1:n relationship type, Expression property in each mapping object m. Recall
and so on. In effect, the generator is a case-statement, that such an expression’s variables are the objects
where the case-statement variable is the type of the object referenced by m in the two models being related. To
being visited. If the case-statement is encapsulated as a exploit these expressions, the model management
function, it can be executed using the operator Apply. operators that generate mappings should be extended to
Since the case-statement is driven by object types, one produce expressions for any mapping objects they
can go a step further in automating model generation by generate. For example, when Compose combines several
tagging each meta-model object (which is a type objects from the two input mappings into an output
definition) by the desired generation behavior for model mapping object m, it would also generate an expression
objects of that type, as proposed in [10]. Using it, model for m based on the expressions on the input mapping
generation could be encapsulated as a model management objects. Similarly, for Diff and Merge.
operator, which we call ModelGen. The expression language is meta-model-specific, e.g.,
for the relational data model, it could be conjunctive
3.8 Enumerate
queries. Therefore, the extensions to model management
Although our goal is to capture as much model manipula-
tion as possible in model-at-a-time operators, there will be operators that deal with expressions must be meta-model-
times when iterative object-at-a-time code is needed. To specific too and should be performed by a meta-model-
specific expression manipulation engine. For example, the
simplify application programming in this case, we offer
expression language extension for Compose would call
an operator called Enumerate, which takes a model as
this engine to generate an expression for each output
input and returns a “cursor” as output. The operator Next,
mapping object it creates [16]. Some example walk-
when applied to a cursor, returns an object in the model
that was the input to Enumerate, or null when it hits the throughs of these extensions for SQL queries are given in
end of the cursor. Like Apply, Enumerate may offer [7]. However, a general-purpose interface between model
management operators and expression manipulation
variations for different traversal orderings.
engines has not yet been worked out.
3.9 Other Data Manipulation Operators
Another approach to adding semantics to mappings is to
Since models are object structures, they can be manipu-
develop a design tool for the purpose, such as Clio
lated by the usual object-at-a-time operators: read an
[17,27].
attribute; traverse a relationship, create an object, update
an attribute, add or remove a relationship, etc. In addition,
there are two other bulk database operators of interest: 4 Application Scenarios
• Select – Return the subset of a model that satisfies a In this section, we discuss three common meta data
qualification formula. The returned subset includes management problems that involve the manipulation of
additional support objects, as in Diff. Like Diff, it also models and mappings: schema integration, schema
returns a mapping between the returned model and the evolution, and round-trip engineering. We describe each
input model, to identify the non-support objects. problem in terms of models and mappings and show how
• Delete – This deletes all of the objects in a given to use model management operators to solve it.
model, except for those that are reachable by paths of 4.1 Schema Integration
has-a relationships from other models. The problem is to create: a schema S3 that represents all
of the information expressed in two given database
Applying Model Management to Classical Meta Data Problems 511

schemas, S1 and S2; and mappings between S1 and S3 and similar to FirstName and LastName, these objects are
between S2 and S3 (see Figure 9). The schema integration partially integrated in S12 under an object labeled ≅, which
literature offers many algorithms for doing this [1,8,23]. is a placeholder for an expression that relates Name to
They all consist of three main activities: identifying FirstName and LastName.
overlapping information in S1 and S2; using the identified
overlaps to guide a merge of S1 and S2; and resolving Emp′
conflict situations (i.e., where the same information was
represented differently in S1 and S2) during or after the Emp# ≅
merge. The main differentiator between these algorithms Name
is in the conflict resolution approaches.
Address
S3 FirstName
map13 map23 Phone LastName
S1 S2
S1 S2
Figure 11 The result of merging Emp and Employee
Figure 9 The schema integration problem based on Mapee in Figure 10
If each schema is regarded as a model, then we can The sub-structure rooted by “≅” represents a conflict
express the first two activities using model management between the two input schemas. A schema integration
operators as follows: algorithm needs rules to cope with such conflicts. In this
1. map12 = Match(S1, S2). This step identifies the equal case it could consult a knowledge base that explains that
and similar objects in S1 and S2. Since Match is first name concatenated with last name is a name. It could
creating a mapping between two independently use this knowledge to replace the sub-structure rooted by
developed schemas, this is best done with a Complex ≅ either by FirstName and LastName, since together they
Match operator (rather than Elementary Match). subsume Name, or by a nested structure Name with sub-
2. <S3, map13, map23> = Merge(S1, S2, map12). Given the objects FirstName and LastName. The latter is probably
mapping created in the previous step, Merge produces preferable in a data model that allows nested structures,
the integrated schema S3 and the desired mappings. such as XML Schema. The former is probably necessary
when nested structures are not supported, as in SQL.
For example, in Figure 10, Mapee could be the result of
Overall, the resolution strategy depends on the capabili-
Match(Emp, Employee). Notice that this is similar to
ties of the knowledge base and on the expressiveness of
Figure 3, except that Emp has an additional object
the output data model. So this activity is not captured by
Address and Employee has an additional object Phone,
the generic model management operators. Instead, it
neither of which are mapped to objects in the other model.
should be expressed in an application-specific function.
Emp Mapee Employee When application-specific conflict resolution functions
are used, the apply operator can help by executing a
1 EmployeeID conflict resolution rule on all objects of the output of
Emp#
= Merge. The rule tests for an object that is marked by ≅,
FirstName and if so applies its action to that object and its sub-
Name structure (knowledge-base lookup plus meta-model-
2 LastName specific merge). This avoids the need for the application-
≅ specific code to include logic to navigate the model.
Address
Phone To finish the job, the mappings map12 and map13 that
are returned by Merge must be translated into view defini-
Figure 10 The result of matching Emp and Employee tions. To do this, the models and mappings can no longer
be regarded only as syntactic structures. Rather, they need
Figure 11 shows the result of merging Emp and semantics. Thus, creating view definitions requires
Employee with respect to Mapee. (The mappings between semantic reasoning: the manipulation of expressions that
Emp′ and Emp and between Emp′ and Employee are explain the semantics of mappings. In Section 3.10 we
omitted, to avoid cluttering the figure.) Since Mapee says explained in broad outline how to do this, though as we
that the Emp# and EmployeeID objects are equal, they are said there, the details are beyond the scope of this paper.
collapsed into a single object Emp#. The two objects have
different names; Merge chose the name of the left object, 4.2 Schema Evolution
Emp#, one of the many details to nail down in a complete The schema evolution problem arises when a change to a
specification of Merge’s semantics. Since Address and database schema breaks views that are defined on it [3,
Phone are not referenced by Mapee, they are simply 12]. Stated more precisely, we are given a base schema
copied to the output. Since Mapee says that Name is S1, a set of view schemas V1 over S1, and a mapping map1
that maps objects of S1 to objects of V1. (See Figure 12.)
512 Chapter 6: Database Evolution

For example, if S1 and V1 are relational schemas, then we At this point we have successfully completed the task.
would expect each object m of map1 to contain a An alternative to steps 4 and 5 is to be more selective in
relational view definition that tells how to derive a view deleting view objects, based on knowledge about the
relation in V1 from some of the relations in S1; the syntax and semantics of the mapping expressions. For
morphisms of m would refer to the objects of S1 and V1 example, suppose the schemas and views are in the
that are mentioned in m’s view definition. Then, given a relational data model and S2 is missing an attribute that is
new version S2 of S1, the problem is to define a new used to populate an attribute of a view in V2. In the
version V2 of V1 that is consistent with S2 and a mapping previous approach, if each view is defined by one object
map2 from S2 to V2. in map1, then the entire view would be an orphan and
deleted. Instead, we could drop the attribute from the
V1 V2
view without dropping the entire view relation that con-
map2 tains it. To get this effect, we could replace Step 2 above
map1
by a right outer composition, so that all objects of map1
S1 S2 S2 are copied to map4, even if they connect to S1 objects that
have no counterpart in S2. Then we can write a function f
Figure 12 The schema evolution problem that encapsulates the semantic knowledge necessary to
strip out parts of a view definition and replace steps 4 and
We can solve this problem using model management 5 by Apply(f, map2). Thus, f gives us a way of exploiting
operators as follows (Figure 13): non-generic model semantics while still working within
1. map3 = Match(S1, S2). This returns a mapping between the framework of the model management algebra.
S1 and S2 that identifies what is unchanged in S2
relative to S1. If we know that S2 is an incremental 4.3 Round-Trip Engineering
modification of S1, then this can be done by Elemen- Consider a design tool that generates a compiled version
tary Match. If not, then Complex Match is required. of a high-level specification, such as an ER modeling tool
that generates SQL DDL or a UML modeling tool that
2. map4 = map1 • map3. This is a right composition. In- generates C++ interfaces. After a developer modifies the
tuitively, each mapping object in map4 describes a part generated version of such a specification (e.g., SQL
of map1 that is unaffected by the change from S1 to S2. DDL), the modified generated version is no longer
A mapping object m in map1 survives the composition consistent with its specification. Repairing the specifica-
(i.e., becomes an object of map4) if every object in S1 tion is called round-trip engineering, because the tool
that is connected to m is also connected to some object forward-engineers the specification into a generated
of S2 via map3. If so, then m is transformed into m′ in version after which the modified generated version is
map4 by replacing each reference from m to an object reverse-engineered back to a specification.
of S1 by a reference to the corresponding objects in S2. Stating this scenario more precisely, we are given a
map5 specification S1, a generated model G1 that was derived
V1 V2 V 2′ from S1, a mapping map1 from S1 to G1, and a modified
version G2 of G1. The problem is to produce a revised
map1 map4 map2 specification S2 that is consistent with G2 and a mapping
map2 between S2 and G2. See Figure 14. Notice that
S1 S2 diagrammatically, this is isomorphic to the schema evolu-
map3 tion problem; it is exactly like Figure 12, with S1 and S2
replacing V1 and V2, and G1 and G2 replacing S1 and S2.
Figure 13 Result of schema evolution solution
S1 – original spec
S1 S2
Some objects of V1 may now be “orphans” in the sense G1 – generated schema
that they are not incident to map4. An orphan arises G2 – modified
map1 map2 generated schema
because it maps via map1 to an object in S1 that has no
S2 – modified spec
corresponding object in S2 via map3. One way to deal with G1 G2 G2 for G2
orphans is to eliminate them. Since doing this would
corrupt map1, we first make a copy of V1 and then delete Figure 14 The round-trip engineering problem
the orphans from the copy:
3. <V2, map2> = DeepCopy(V1, map4). This makes a As in schema evolution, we start by matching G1 and
copy V2 of V1 along with a copy map2 of map4. G2, composing the resulting mapping with map1, and
doing a deep copy of the mapping produced by Compose:
4. <V2′, map5> = Diff(V2, map2). Identify the orphans.
1. map3 = Match(G1, G2). This returns a mapping that
5. For each e in Enumerate(map5), delete domain(e) from identifies what is unchanged in G2 relative to G1.
V2. This enumerates the orphans and deletes them. Since G2 is an incremental modification of G1,
Notice that we are treating map5 as a model. Elementary Match should suffice. See Figure 15a.
Applying Model Management to Classical Meta Data Problems 513

2. map4 = map1 • map3. Mapping map4, between S1 and column into an ER attribute, each table into either an
G2, includes a copy of each object in map1 all of whose entity type or relationship type (depending on the key
incident G1 objects are still present in G2. structure of the table), etc.
3. <S3, map5> = DeepCopy(S1, map4). This makes a copy We need to merge S3 and S3′ into a single model S2,
S3 of S1 along with a copy map5 of map4. which is half of the desired result. (The other half is map2,
Steps 2 and 3 eliminate from the specification S3 all coming soon.) To do this, we need to create a mapping
objects that do not correspond to generated objects in G2. between S3 and S3′ that connects objects of S3 and S3′ that
One could retain these objects by replacing the composi- represent the same thing. Continuing the example after
tion in step 2 by outer composition. The remaining steps step 4 above, where G2′ introduces a new column C into
in this section would then proceed without modification. table T, the desired mapping should connect the reverse
Next, we need to reverse engineer the new objects that engineered object for T in S3′ (e.g., an entity type) with
were introduced in G2 and merge them with S3. Here is the original object for T in S3 (e.g., the entity type that
one way to do it (see Figure 15a): was used to generate T in G2 in the first place). By
4. <G2′, map6> = Diff(G2, map3). This produces a model contrast, the reverse engineered object for C in S3′ will not
G2′ that includes objects of G2 that do not participate map to any object in S3 because it is a new object that was
in the mapping map3, which are exactly the new introduced in G2′, and therefore was not present S3. We
objects of G2, plus support objects O that are needed can create the desired mapping by a Match followed by
to keep G2′ well-formed. Mapping map6 maps each ob- two compositions, after which we can do the merge, as
ject of G2′ not in O to the corresponding object of G2. follows (see Figure 15b):
S3 - deep copy
6. map8 = Match(G2, G2′). This matches every object in
S1 S3 S3′ G2′ with its corresponding copy in G2. Unlike map6,
of S1 objects
that map to G2 map8 connects to all objects in G2′, including support
map1 map4 map5 map7
G2′ - new objects objects.
of G2 7. map9 = map7 • map8. This right composition creates a
map3 map6 S3′ - reverse eng’d mapping map9 between the objects of G2 that are also
G1 G2 G 2′
spec for G2′ in G2′ and their corresponding objects of S3′. Since
S2 - merge of S3
(a) After Step 5 map8 is incident to all objects of G2′, every object of
and S3′ (= modi-
map7 generates a map9 object that connects to G2.
fied spec for G2)
8. map10 = map5 • map9. If there are mapping objects of
S2
map5 and map9 that connect an object of G2 (e.g., T) to
map11 map11′ both S3 and S3′, then those mapping objects compose
and the corresponding objects of S3 and S3′ are related
map10 by map10. This should be an “inner” Compose, which
S1 S3 S 3′
only returns objects that connect to both S3 and S3′.
map1 map4 map5 map9 map7 9. <S2, map11, map11′> = Merge(S3, S3′, map10). This
merges the reverse engineered objects of S3′ (which
map6 came from the new objects introduced in G2) with S3,
G1 G2 G2′ producing the desired model S2 (cf. Figure 14).
map3 map8
Finally, we need to produce the desired mapping map2
(b) After Step 8 between G2 and S2. This is the union (i.e., merge) of
Figure 15 Result of round-trip engineering solution map11 • map5 and map11′• map9. To see why this is what
we want, recall that G2′ contains the objects of G2 that do
For example, suppose G2 and G2′ are SQL schemas, and not map to S3 via map5. Mapping map7 connects those ob-
G2′ introduced a new column C into table T. In the model jects to S3′, as does map9, except on the original objects in
management representation G2 of the schema, C is an
G2 rather than on the copies in G2′. Hence, every object in
object that is a child of object T. Since C is new, it is not
G2 connects to a mapping object in either map5 or map9.
connected via map3 to G1, so it is in the result of Diff.
So to start, we need to compute these compositions:
However, to keep G2′ connected, since C is a child of T, T
is also in the result of Diff as a support object, though it is 10. map2′ = map11 • map5
not connected to G2 via map6. 11. map2″ = map11′ • map9
5. <S3′, map7> = ModelGen(G2′). In this case, ModelGen Next, we need the union of map2′ and map2″. But there
is customized to reverse engineer each object of G2′ is a catch: an object of G2 could be connected to objects
into an object of the desired form for integration into in both map5 and map9. Continuing our example, table T
S2. For example, if G2′ is a SQL schema and the Si’s is such an object because it is mapped to S3 as well as re-
are ER models, then ModelGen maps each SQL verse engineered to S3′. Such objects have two mappings
514 Chapter 6: Database Evolution

to G2 via the union of the compositions, which is probably Model-driven generator of user interface – Much like
not what is desired. Getting rid of the duplicates is a bit of an advanced drawing tool, one can tag meta-model
effort. One way is to merge the mappings. To do this, we objects with descriptions of objects and their behavior
need to match map2′ and map2″ from steps 10 and 11 to (e.g., a table definition is a blue rectangle and a column
find the duplicates (which we can do because mappings definition is a line within its table’s rectangle).
are models), and then merge the mappings based on the
Generic tools over models and mappings – browser,
match result. Here are the steps (not shown in Figure 15):
editor, catalog, import/export, scripting.
12. map12 = Match(map2′, map2″). Objects m2′ in map2′
and m2″ in map2″ match if they connect to exactly the 6 Related Work
same objects of G2 and S2. To use this matching Although the model management approach is new, much
condition, one needs to regard the morphisms of map2′ of the existing literature on meta data management offers
and map2″ as parts of each map’s model; e.g., the either algorithms that can be generalized for use in model
morphisms could be available as relationships on each management or examples that can be studied as chal-
map’s model. Using this simple match criterion, lenges for the model management operators. This litera-
Elementary Match suffices. ture is too large to cite here, but we can highlight a few
13. map2 = Merge(map2′, map2″, map12). The morphisms areas where there is obvious synergy worth exploring.
of map2′ and map2″ should be merged like ordinary Some of them were mentioned earlier: schema matching
relationships. That is, if map12 connects m2′ in map2′ (see the survey in [23]); schema integration [1,8,15,25],
and m2″ in map2″, then Merge collapses m2′ and m2″ which is both an example and a source of algorithms for
into a single object m2. Object m2 should have only Match and Merge; and adding semantics to mappings
one copy of the mapping connections that m2′ and m2″ [7,17,21,27]. Others include:
had to G2 and S2. • Data translation [24];
We now have map2 and S2, so we’re done! Cf. Figure 14. • Differencing [11,19,26]; and
• EER-style representations and their expressive
5 Implementation power, which may help select the best representation
We envision an implementation of models, mappings, and for models and mappings [2,14,15,18,20].
model management operators on a persistent object-
oriented system. Given technology trends, an object- 7 Conclusion
relational system is likely to be the best choice, but an In this paper, we described model management — a new
XML database system might also be suitable. The system approach to manipulating models (e.g., schemas) and
consists of four layers: mappings as bulk objects using operators such as Match,
Models and mappings – This layer supports the model Merge, Diff, Compose, Apply, Copy, Enumerate, and
and mapping abstractions, each implemented as an object- ModelGen. We showed how to apply these operators to
oriented structure, both on disk and heavily cached for three classical meta data management problems: schema
fast navigation. The representation of models should be integration, schema evolution, and round-trip engineering.
extensible, so that the system can be specialized to more We believe these example solutions strongly suggest that
expressive meta-meta-models. And it should be semi- an implementation of model management would provide
structured, so that models can be imported from more major programming productivity gains for a wide variety
expressive representations without loss of information. of meta data management problems. Of course, to make
This layer supports: this claim compelling, an implementation is needed. If
• Models – We need the usual object-at-a-time opera- successful, such an implementation could be the
tions on objects in models, plus GetSubmodels (of a prototype for a new category of database system products.
given model) and DeleteSubmodel, where a submod- In addition to implementation, there are many other
el is a model rooted by an object in another model. areas where work is needed to fully realize the potential
Also Copy (deep and shallow) is supported here. of this approach. Some of the more pressing ones are:
• Mappings - CreateMapping returns a model and two • Choosing a representation that captures most of the
morphisms. GetSource and GetTarget return the constructs of models and mappings of interest, yet is
morphisms of a given mapping. tractable for model management operators.
• Morphisms – These are accessible and updatable like • More detailed semantics of model management opera-
normal relationships. tors. There is substantial work on Match. Merge,
Algebraic operators – This layer implements Match, Compose, and ModelGen are less well developed.
Merge, Diff, Compose, Apply, ModelGen, and Enumer- • A mathematical semantics of model management. The
ate. It should have an extension mechanism for handling beginnings of a category-theoretic approach appears
semantics, such as an expression manipulation engine as in [1], but there is much left to do. A less abstract
discussed in Section 3.10. analysis that can speak to the completeness of the set
Applying Model Management to Classical Meta Data Problems 515

of operators would help define the boundary of useful 11. Chawathe, Sudarshan S. and Hector Garcia-Molina:
model management computations. Meaningful Change Detection in Structured Data.
• Mechanisms are needed to fill the gap between SIGMOD 1997: 26-37.
models and mappings, which are syntactic structures, 12. Claypool, K. T., J. Jin, E. A. Rundensteiner: SERF:
and their semantics, which treat models as templates Schema Evolution through an Extensible, Re-usable
for instances and mappings as transformations of and Flexible Framework. CIKM 1998: 314-321.
instances. Various theories of conjunctive queries are 13. Claypool, K.T., E.A. Rundensteiner, X. Zhang, H.
likely to be helpful. Su, H.A. Kuno, W-C Lee, G. Mitchell: Gangam - A
• Trying to apply model management to especially chal- Solution to Support Multiple Data Models, their
lenging meta data management problems, to identify Mappings and Maintenance. SIGMOD 2001
limits to the approach and opportunities to extend it. 14. Hull, Richard and Roger King: Semantic Database
Modeling:Survey, Applications, and Research Issues.
This is a broad agenda that will take many years and ACM Computing Surveys 19(3): 201-260 (1987)
many research groups to develop. Although it will be a lot 15. Larson, James A., Shamkant B. Navathe, and Ramez
of work, we believe the potential benefits of the approach Elmasri. A theory of attribute equivalence in
make the agenda well worth pursuing. databases with application to schema integration.
Acknowledgments Trans. on Soft. Eng. 15(4):449-463 (April 1989).
The ideas in this paper have benefited greatly from my 16. Madhavan, J., P. A. Bernstein, P. Domingos, A.Y.
ongoing collaborations with Suad Alagi , Alon Halevy, Halevy: Representing and Reasoning About
Renée Miller, Rachel Pottinger, and Erhard Rahm. I also Mappings between Domain Models. 18th National
thank the many people whose discussions have stimulated Conference on Artificial Intelligence (AAAI 2002).
me to extend and sharpen these ideas, especially Kajal 17. Miller, R.J., L. M. Haas, M. A. Hernández: Schema
Claypool, Jayant Madhavan, Sergey Melnik, Peter Mork, Mapping as Query Discovery. VLDB 2000: 77-88.
John Mylopoulos, Arnie Rosenthal, Elke Rundensteiner, 18. Miller, R. J., Y. E. Ioannidis, Raghu Ramakrishnan:
Aamod Sane, and Val Tannen. Schema equivalence in heterogeneous systems:
bridging theory and practice. Information Systems
8 References 19(1): 3-31 (1994)
1. Alagic, S. and P.A. Bernstein, “A Model Theory for 19. Myers, E.: An O(ND) Difference Algorithm and its
Generic Schema Management,” Proc. DBPL 2001, Variations. Algorithmica 1(2): 251-266 (1986).
Springer Verlag LNCS. 20. Mylopoulos, John, Alexander Borgida, Matthias
2. Atzeni, Paolo and Riccardo Torlone: Management of Jarke, Manolis Koubarakis: Telos: Representing
Multiple Models in an Extensible Database Design Knowledge About Information Systems. TOIS 8(4):
Tool. EDBT 1996: 79-95 325-362 (1990).
3. Banerjee, Jay, Won Kim, Hyoung-Joo Kim, Henry F. 21. Popa, Lucian, Val Tannen: An Equational Chase for
Korth: Semantics and Implementation of Schema Path-Conjunctive Queries, Constraints, and Views.
Evolution in Object-Oriented Databases. SIGMOD ICDT 1999: 39-57.
Conference 1987: 311-322 22. Pottinger, Rachel A. and Philip A. Bernstein. Creat-
4. Beeri, C. and T. Milo: Schemas for integration and ing a Mediated Schema Based on Initial Correspon-
translation of structured and semi-structured data. dences. IEEE Data Engineering Bulletin, Sept. 2002.
ICDT, 1999: 296-313,. 23. Rahm, Erhard and Philip A. Bernstein. A survey of
5. Bernstein, P.A.: Generic Model Management − A approaches to automatic schema matching. VLDB J.
Database Infrastructure for Schema Manipulation. 10(4):334-350 (2001).
Springer Verlag LNCS 2172, CoopIS 2001: 1-6. 24. Shu, Nan C., Barron C. Housel, R. W. Taylor, Sakti
6. Bernstein, Philip A., Alon Y. Halevy, and Rachel A. P. Ghosh, Vincent Y. Lum: EXPRESS: A Data
Pottinger. A vision of management of complex mod- EXtraction, Processing, amd REStructuring System.
els. SIGMOD Record 29(4):55-63 (2000). TODS 2(2): 134-174 (1977).
7. Bernstein, Philip A., Erhard Rahm: Data Warehouse 25. Spaccapietra, Stefano and Christine Parent. View
Scenarios for Model Management. ER 2000: 1-15. integration: A step forward in solving structural
8. Biskup, J. and B. Convent. A formal view integration conflicts. TKDE 6(2): 258-274 (April 1994).
method. SIGMOD 1986: 398-407. 26. J. T-L. Wang, D. Shasha, G. J-S. Chang, L. Relihan,
9. Buneman, P., S.B. Davidson, A. Kosky. Theoretical K. Zhang, G. Patel: Structural Matching and
aspects of schema merging. EDBT 1992: 152-167. Discovery in Document Databases. SIGMOD 1997:
10. Cattell, R.G.G., D. K. Barry, M. Berler, J. Eastman, 560-563
D. Jordan, D. Russell, O. Schadow, T. Stanienda, and 27. Yan, Ling-Ling, Renée J. Miller, Laura M. Haas,
F. Velez, editors: The Object Database Standard: Ronald Fagin: Data-Driven Understanding and
ODMG 3.0. Morgan Kaufmann Publishers, 2000. Refinement of Schema Mappings. SIGMOD 2001.
Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates 517
518 Chapter 6: Database Evolution
Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates 519
520 Chapter 6: Database Evolution
Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates 521
522 Chapter 6: Database Evolution
Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates 523
524 Chapter 6: Database Evolution
Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates 525
Chapter 7
Data Warehousing

This section of the book deals with data warehousing techniques and practices. Because this
topic largely originated in industrial practice and was more recently studied by the research
community, it is widely misunderstood. Hence, we spend some of this introduction explaining
the issues involved and also include a survey paper by Chaudhuri and Dayal that goes into more
detail.

A typical large enterprise has a multitude of mission-critical operational systems, typically


numbering in the tens or even the hundreds. The reason for this number of important computer
systems is the decentralization practiced by most large enterprises. A (typically self-contained)
business unit is set up and charged with moving the enterprise into some new area. This business
unit often sets up its own computer system, so it can be in charge of its own destiny, rather than
be at the mercy of some other group’s computer system and priorities. Over a couple of decades,
this leads to a proliferation of systems. These days it is also leading to a serious initiative in
many enterprises aimed at server consolidation.

Each operational system is run by a system administrator (SA) who lives (and dies) by his ability
to keep the system up and functioning during appropriate business hours and by his ability to
provide good response time to the users of the system (who are typically entering transactions).
To accomplish his goals, the SA will invariably jealously guard his computing resources and
refuse to make modifications to his system, unless essential to the business unit. This creates a
self-contained “island of information”, and a typical enterprise has hundreds of such islands.
Each island refuses to provide access to others in the enterprise (since this would degrade
response time) and refuses to change the system to accommodate the needs of others (since this
will lead to instability and lower system availability).

In this sort of world, a given customer of the enterprise will appear in the operational system of
each business unit with which he has a relationship. There are obvious business opportunities if
the islands of information can be integrated. A typical scenario is called “cross-selling”: offering
a customer additional products and services that may be of interest based on their history. An
example of this nirvana would be to respond to a customer, who called his bank to discuss his
checking account, with a suggestion that he investigate a refinance of his home since his rate of
X% is enough above the current rate of Y% to make it worthwhile. This cross selling situation
requires the checking account clerk to know that the customer on the phone has a home loan with
the same bank.

To perform this information integration without changing every operational system (a no-no), an
obvious strategy is to buy a large machine and then periodically copy relevant data from each
operational system (typically during “dead” time) to the big machine. On this machine, the
required data integration can be performed.

A second situation that arises frequently in retail applications is exemplified by the following
example. An international chain store has a few hundred stores around North America. Each in-
store system records every item that passes through a checkout lane. The data from each store is
sent to a central place, and kept for (say) a year. Corporate buyers interrogate this system to see
what is selling and what is not. For “hot” items, the buyer submits a large order to the
manufacturer to tie up his production capability, and deny merchandise to his competitor. “Cold”
Introduction 527

items are put on sale or returned to the supplier, if possible. Hence, this integration of historical
data can be used to help with stock rotation and purchasing.

Every large enterprise we know of has set up a central data warehouse to perform information
integration like our first example or historical data integration like our second example. Such
warehouses invariably have a collection of common characteristics to which we now turn.

Warehousing projects are typically way over budget and way behind schedule. The problem is
always semantic heterogeneity. The various operational systems store data elements in different
ways and with different semantics. For example, there are many different meanings for “two day
delivery”. Also, one of us is Mike Stonebraker in one system, and M. R. Stonebraker in another.
Figuring out the exact semantics of data elements and then writing the conversion routines to
change them to a common representation is very tedious, time consuming and expensive.
Furthermore, there are cases where this conversion is near impossible. If you call an item “rubber
hand protectors” and I call them “latex gloves”, who is to say whether they are the same or
different products?

Essentially all warehouses are loaded periodically and otherwise are read-only. The only
exception is cleaning operations to remove errors. For historical warehouses, which record
transactional data that has happened in past months, the correct way to organize such data is
invariably to have a large fact table that records each transaction (who bought what, where, when,
for how much, etc). The fact table is then joined to a collection of dimension tables, which
record information about each customer, store, product, time period, etc. The fact table usually
contains a huge number of rows, each of which is filled mostly with ID numbers (foreign keys to
dimension table tuples). The number of rows in the dimension tables is usually tiny in
comparison. Hence, the size of the warehouse is determined by the size of the fact table.

If one imagines the fact table in the center of a picture and the dimensions on the periphery, then
the reasonable joins between the dimension tables and the fact table form a star. Hence, the
name star schema is used to describe the logical data base design used in most warehouses.
Sometimes, dimensions have multiple levels. For example, time period can be composed of
quarters, made up of months, and then days. If there is a table for each level of a dimension, then
a snowflake schema results.

Warehouse queries from business analysts are often very complex. They typically entail
computing some aggregate over the fact table after joining it to a couple of dimension tables,
filtered according to some other data elements in the dimension tables, and grouping the elements
in the fact table by yet other elements of the dimension tables. An example query would be to
compute the sales volume of the jewelry department for each store in the retail chain for each
month for the last year. The result of this query often inspires the business analyst to ask a
different query, which is some other aggregate grouping on different data elements in the same or
different dimensions.

As a result, it is essentially impossible to choose a good primary key for the fact table. One
would like to have it sorted in the order of the filtering attribute(s) (which is in some other table),
but this attribute changes from query to query. Hence, the best tactic for improving warehouse
performance is to materialize views, which sort the fact table in the order required by popular
queries.
528 Chapter 7: Data Warehousing

Warehouse queries are usually run on large multi-processors. It is sometimes important to utilize
intra-query parallelism in order to run especially hard queries in parallel using multiple disks and
multiple CPUs.

Many of the data elements have small cardinality. For example, there are only 50 states and less
than 100,000 cities in the USA. Therefore, it is often desirable to code data elements in the
warehouse. Instead of storing “state” as two ASCII characters consuming 16 bits, one can store it
as a 6 bit code. Also, it is invariably true that bit map indexes outperform traditional B-tree
indexes. For example, in an N record table the data element “state” can be indexed by 50 bit
strings of length N, one per state. Since each bit string is sparse, it can be run-length encoded to
further reduce space by at least a factor of 2. In contrast, a B-tree requires more than 32 bits per
record. Also, bit map indexes can be intersected and unioned very efficiently to deal with
Boolean combinations of predicates. B-trees are much less efficient in this kind of processing.
In many practical warehousing scenarios, bit map indexes take up far less space and are far more
efficient than B-tree ones. Since a warehouse is updated in batch, the maintenance cost of
updating bitmap indexes is tolerable. Hence, bit map indexes are a key indexing technology in
data warehouses

In the early 1990’s the term on line analytical processing (OLAP) was coined to loosely stand
for the functionality in the Arbor product, Essbase. This system allowed one to define a
collection of hierarchical dimension similar to the ones discussed above. Then, for display
purposes, any two dimensions could be chosen for the X and Yaxes, and an aggregate could be
shown in the cells for each pair of dimension values. This multidimensional structure was called
a data cube, because two dimensions of an N-dimensional cube could be displayed at any one
time. Also, for the hierarchical dimensions found in snowflake schemas, one could zoom in and
out of a dimension that was displayed. Hence OLAP became synonymous with a data
visualization system for N-dimensional data. Drill-down was the term used for getting more
detail from a hierarchical dimension, Roll-up the term used for aggregating data up to a higher
level of the hierarachy.

Arbor’s product contained special (non-tabular) data structures to efficiently compute and
maintain data cubes. Hence, Essbase was a cube-oriented data visualization system written for
specialized data storage structures, and this architecture became synonymous with OLAP. A data
cube interface can also be put on top of a relational product, and these were called relational
OLAP products (or ROLAP). The advantage of ROLAP products is that one could subset the
data, and then investigate data cubes for ad-hoc data sets. This capability was difficult in the
original OLAP products, which came to be known as Multidimensional OLAP (MOLAP) to
distinguish them from ROLAP solutions.

Most newer warehouse-oriented products have been called business intelligence (BI) tools. They
allow one to form ad-hoc queries and then visualize the result in a variety of ways. They are
invariably built on top of relational DBMSs. In this environment, data cubes are merely one of
several visualization techniques.

Business analysts interact with most warehouses using BI tools and submit ad-hoc queries for
visualization. Some business analysts also run data mining code against a warehouse.

The periodic loading required for all warehouses is a complex task. The data must be extracted
from various operational systems, transformed to a common schema, cleaned (if there are errors
present), and then loaded into the warehouse. Products which assist in this extract-transform-load
Introduction 529

process have been called ETL tools. Transformations are often quite complex, and are typically
described in an ETL tool using some sort of workflow representation.

We begin this section with a detailed survey by Chaudhuri and Dayal that describes the
warehouse environment in greater detail than what was presented above. Following this survey
article, we include a paper by O’Neill and Quass on bitmap indexes as well a variety of
extensions. This paper further explains why bit-oriented indexes are especially valuable in a
warehouse environment.

Because materialized views are so valuable in a warehouse environment, we have included two
papers on this topic. When the underlying tables used in the materialized view are updated, the
MV is rendered invalid. One can discard and then recreate the MV, a costly operation, or one can
attempt to update the MV in place. Research on updating MVs is represented by the next paper
of this chapter by Ceri and Widom. Our commercial experience with warehouse applications is
that most MVs are relatively easy to update in place because of their simple structure (e.g. all
joins between a dimension table and the fact table are 1-n). Hence, the updating of real world
MVs is well within the state of the art.

However, the more difficult problem is to decide which MVs to keep in the first place.
Essentially all of the major DBMS vendors have patents on algorithms that examine a collection
of queries (the training set) and then create a carefully-chosen set of MVs that provide good
performance on the training queries subject to some space limitation. The interested reader is
directed to the on-line patent repository to explore these algorithms. From the open literature we
have chosen a representative, relatively practical paper on the choice of MVs, by Kotidis and
Roussopoulos.

Data cubes are one of the popular interfaces for BI users. Hence, we have included two papers on
cubes in this section. The first one deals with a collection of query language extensions that will
“slice and dice” data into cubes. This paper by Gray et. al. brings the traditionally non-standard
data cube model into traditional SQL environments. The second paper deals with the
simultaneous computation of cube elements in both MOLAP and ROLAP systems and is written
by Zhao, Deshpande and Naughton.

Warehouse queries are often long-running, and it is clear that parallel query processing is
desirable to lower response times. The obvious technique is to horizontally partition each table,
and then perform the query on each partition in parallel. This is a well known technique
exploited by the “software data base machines”, such as Gamma [DEWI86] and Bubba
[BORA90]. The survey paper on parallel query processing by Dewitt and Gray in Chapter 2
provides good coverage of this topic.

Even with parallelism, some queries take a very long time. In such cases, it is sometimes sensible
to trade off result accuracy for better response time. Since most long-running warehouse queries
compute summary statistics (i.e. aggregates), statistical techniques can approximate the query
results very effectively. Various schemes have been proposed for approximate query answering,
including sampling (e.g. [HOT88, OLK93]), as well as precomputed “synopses” including
wavelets (e.g. [CHAK00]) and “sketches” based on random projections [AMS96] and
probabilistic counting [FM85]. There are two practical problems with this work. First, the
synopsis work typically ignores the systems issues in doing full-featured query processing; most
of the synopsis schemes are not arbitrarily composable with an algebra for multi-operator queries
(e.g. two or more joins with selection and projection). Second, there is a problem trying to sell
this kind of technology to customers. If you ask the typical BI customer whether they are willing
530 Chapter 7: Data Warehousing

to use an approximate answer, they will typically say no, regardless of what you may tell them
about the statistical validity of the answer. This is an end-user issue that cannot be neglected.

The CONTROL project at Berkeley and IBM took a more user-centric approach to
approximation, starting with the work on online aggregation that is the focus of our last paper in
this section. With online or progressive approximations, the user is given a running, easily
visualized estimate of the query results during execution, and can “stop early” when they see fit,
or let the query run to completion if they prefer. Most users are quite comfortable with this
approach even if they are generally uncomfortable with approximation schemes, because it
provides intuitive feedback and gives the user control of the tradeoff between accuracy and time,
This work also focused on end-to-end systems solutions for sampling-based approximation of a
large class of SQL queries. While attractive, these ideas have been tough to translate to products.
To deliver online query processing functionality, one has to change not only the DBMS, but also
the BI applications that run over the engine – everything must become interactive. The
applications are often provided by a family of vendors different from the DBMS vendor, so any
changes to the engine require significant buy-in from an entire sub-industry in order to be worth
the DBMS vendor’s investment. Similarly, a lightweight startup company has a tough time
pushing this agenda, since it requires support from the core DBMS engine.

The state of the art today in commercial query approximation is still painfully crude. Despite
IBM’s expertise in the area, both IBM and Oracle only support “Bernoulli” (coin-toss) sampling
of base tables – for every tuple from a given table in the FROM clause, a weighted coin is flipped
to decide whether to look at that tuple in the query. Bernoulli sampling of base tables does not
provide any way to characterize the quality of the answers returned from join queries, for
example, nor any way to make the sampling progressive and interactive a la online aggregation.
In short, query approximation is a technique where the research is well ahead of the marketplace.

We close this introduction with a disturbing prediction. Warehouses have found near universal
acceptance in large enterprises. However, most warehouse administrators have quickly
discovered a serious flaw, namely that the warehouse is stale by half the refresh interval on
average. This staleness is not an issue for historical queries, but it becomes very problematic in
other circumstances. One warehouse user confided to us that he was frustrated because he could
not compare yesterday against today, because yesterday was in the warehouse but today was still
in the operational system. When one wants analyses closer to real time, staleness becomes an
issue. In addition, some users want “real time warehouses”. By this they often mean the ability
to use data warehouse data to decide what to do with an operational transaction. For example,
one might want to make a credit decision in a current transaction based on the transactions that
the customer had executed in the recent past.

Put differently, business intelligence applications can be performed on historical data or on a mix
of historical and current data. As enterprises strive to make decisions based on timely
information (the so-called real time enterprise), then warehouses will have to become more
current.

Although one can run the ETL process more often (say every hour) and thereby get the
warehouse to be less stale, this will call into question some of the performance tactics currently
used such as bitmap indexes and materialized views. Moreover, such a warehouse is still stale by
an average of 30 minutes. Getting more current than this will require a fundamental rethinking of
the way enterprises deal with operational systems and BI systems, and hence a rethinking of the
basic architecture of major enterprise systems.
Introduction 531

References

[AMS96] N. Alon, Y. Matias and M. Szegedy, “The Space Complexity of Approximating the
Frequency Moments”. In Proceedings of the 28th Annual ACM Symposium on Theory of
Computing (STOC), pp. 20-29, May, 1996.

[BORA90] Boral, H. et. al: Prototyping Bubba: A Highly Parallel Database System, IEEE
Knowledge and Data Engineering, 2(1), (March 1990).

[CHAK00] Kaushik Chakrabarti , Minos Garofalakis, Rajeev Rastogi , and Kyuseok Shim
."Approximate Query Processing Using Wavelets" In Proc. International Conference on Very
Large Data Bases (VLDB), Cairo, Egypt, September 2000, pp. 111-122.

[DEWI86] Dewitt, D. et. al: Gamma – A High Performance Dataflow Machine. In Proc.
International Conference on Very Large Data Bases (VLDB), Tokyo, Japan, Sept. 1986.

[FM85] Philippe Flajolet and G. Nigel Martin. Probabilistic Counting Algorithms for Data Base
Applications. Journal of Computing System Science, 31(2):182-209, 1985.

[HOT88] Wen-Chi Hou , Gultekin Özsoyoglu, Baldeo K. Taneja : Statistical Estimators for
Relational Algebra Expressions. In Proc. ACM SIGACT-SIGMOD-SIGART Symposium on
Principles of Database Systems (PODS), 1988, pp. 276-287,

[OLK93] Olken, F., Random Sampling from Databases. PhD. dissertation, Lawrence Berkeley
Laboratory Tech. Report LBL-32883, April, 1993.
An Overview of Data Warehousing and OLAP Technology

Surajit Chaudhuri Umeshwar Dayal


Microsoft Research, Redmond Hewlett-Packard Labs, Palo Alto
surajitc@microsoft.com dayal@hpl.hp.com

Abstract
A data warehouse is a “subject-oriented, integrated, time-
Data warehousing and on-line analytical processing (OLAP)
varying, non-volatile collection of data that is used primarily
are essential elements of decision support, which has
in organizational decision making.”1 Typically, the data
increasingly become a focus of the database industry. Many
warehouse is maintained separately from the organization’s
commercial products and services are now available, and all
operational databases. There are many reasons for doing this.
of the principal database management system vendors now
The data warehouse supports on-line analytical processing
have offerings in these areas. Decision support places some
(OLAP), the functional and performance requirements of
rather different requirements on database technology
which are quite different from those of the on-line transaction
compared to traditional on-line transaction processing
processing (OLTP) applications traditionally supported by the
applications. This paper provides an overview of data
operational databases.
warehousing and OLAP technologies, with an emphasis on
their new requirements. We describe back end tools for
OLTP applications typically automate clerical data processing
extracting, cleaning and loading data into a data warehouse;
tasks such as order entry and banking transactions that are the
multidimensional data models typical of OLAP; front end
bread-and-butter day-to-day operations of an organization.
client tools for querying and data analysis; server extensions
These tasks are structured and repetitive, and consist of short,
for efficient query processing; and tools for metadata
atomic, isolated transactions. The transactions require
management and for managing the warehouse. In addition to
detailed, up-to-date data, and read or update a few (tens of)
surveying the state of the art, this paper also identifies some
records accessed typically on their primary keys. Operational
promising research issues, some of which are related to
databases tend to be hundreds of megabytes to gigabytes in
problems that the database research community has worked
size. Consistency and recoverability of the database are
on for years, but others are only just beginning to be
critical, and maximizing transaction throughput is the key
addressed. This overview is based on a tutorial that the
performance metric. Consequently, the database is designed
authors presented at the VLDB Conference, 1996.
to reflect the operational semantics of known applications,
and, in particular, to minimize concurrency conflicts.
1. Introduction
Data warehousing is a collection of decision support Data warehouses, in contrast, are targeted for decision
technologies, aimed at enabling the knowledge worker support. Historical, summarized and consolidated data is
(executive, manager, analyst) to make better and faster more important than detailed, individual records. Since data
decisions. The past three years have seen explosive growth, warehouses contain consolidated data, perhaps from several
both in the number of products and services offered, and in operational databases, over potentially long periods of time,
the adoption of these technologies by industry. According to they tend to be orders of magnitude larger than operational
the META Group, the data warehousing market, including databases; enterprise data warehouses are projected to be
hardware, database software, and tools, is projected to grow hundreds of gigabytes to terabytes in size. The workloads are
from $2 billion in 1995 to $8 billion in 1998. Data query intensive with mostly ad hoc, complex queries that can
warehousing technologies have been successfully deployed in access millions of records and perform a lot of scans, joins,
many industries: manufacturing (for order shipment and and aggregates. Query throughput and response times are
customer support), retail (for user profiling and inventory more important than transaction throughput.
management), financial services (for claims analysis, risk
analysis, credit card analysis, and fraud detection), To facilitate complex analyses and visualization, the data in a
transportation (for fleet management), telecommunications warehouse is typically modeled multidimensionally. For
(for call analysis and fraud detection), utilities (for power example, in a sales data warehouse, time of sale, sales district,
usage analysis), and healthcare (for outcomes analysis). This salesperson, and product might be some of the dimensions
paper presents a roadmap of data warehousing technologies, of interest. Often, these dimensions are hierarchical; time of
focusing on the special requirements that data warehouses sale may be organized as a day-month-quarter-year hierarchy,
place on database management systems (DBMSs). product as a product-category-industry hierarchy. Typical
An Overview of Data Warehousing and OLAP Technology 533

OLAP operations include rollup (increasing the level of


aggregation) and drill-down (decreasing the level of In Section 2, we describe a typical data warehousing
aggregation or increasing detail) along one or more architecture, and the process of designing and operating a
dimension hierarchies, slice_and_dice (selection and data warehouse. In Sections 3-7, we review relevant
projection), and pivot (re-orienting the multidimensional view technologies for loading and refreshing data in a data
of data). warehouse, warehouse servers, front end tools, and
warehouse management tools. In each case, we point out
Given that operational databases are finely tuned to support what is different from traditional database technology, and we
known OLTP workloads, trying to execute complex OLAP mention representative products. In this paper, we do not
queries against the operational databases would result in intend to provide comprehensive descriptions of all products
unacceptable performance. Furthermore, decision support in every category. We encourage the interested reader to look
requires data that might be missing from the operational at recent issues of trade magazines such as Databased
databases; for instance, understanding trends or making Advisor, Database Programming and Design, Datamation,
predictions requires historical data, whereas operational and DBMS Magazine, and vendors’ Web sites for more
databases store only current data. Decision support usually details of commercial products, white papers, and case
requires consolidating data from many heterogeneous studies. The OLAP Council2 is a good source of information
sources: these might include external sources such as stock on standardization efforts across the industry, and a paper by
market feeds, in addition to several operational databases. Codd, et al.3 defines twelve rules for OLAP products. Finally,
The different sources might contain data of varying quality, or a good source of references on data warehousing and OLAP
use inconsistent representations, codes and formats, which is the Data Warehousing Information Center4.
have to be reconciled. Finally, supporting the
multidimensional data models and operations typical of Research in data warehousing is fairly recent, and has focused
OLAP requires special data organization, access methods, primarily on query processing and view maintenance issues.
and implementation methods, not generally provided by There still are many open research problems. We conclude in
commercial DBMSs targeted for OLTP. It is for all these Section 8 with a brief mention of these issues.
reasons that data warehouses are implemented separately
from operational databases. 2. Architecture and End-to-End Process
Figure 1 shows a typical data warehousing architecture.
Data warehouses might be implemented on standard or
extended relational DBMSs, called Relational OLAP Monitoring & Admnistration
(ROLAP) servers. These servers assume that data is stored in
relational databases, and they support extensions to SQL and Metadata
Repository
special access and implementation methods to efficiently OLAP
Servers Analysis
implement the multidimensional data model and operations. Data Warehouse
In contrast, multidimensional OLAP (MOLAP) servers are External Extract
sources Transform Query/Reporting
servers that directly store multidimensional data in special Load
Operational Refresh Serve
data structures (e.g., arrays) and implement the OLAP dbs Data Mining
operations over these special data structures.

There is more to building and maintaining a data warehouse Data sources


Data Marts Tools
than selecting an OLAP server and defining a schema and
some complex queries for the warehouse. Different Figure 1. Data Warehousing Architecture
architectural alternatives exist. Many organizations want to
implement an integrated enterprise warehouse that collects It includes tools for extracting data from multiple operational
information about all subjects (e.g., customers, products, databases and external sources; for cleaning, transforming
sales, assets, personnel) spanning the whole organization. and integrating this data; for loading data into the data
However, building an enterprise warehouse is a long and warehouse; and for periodically refreshing the warehouse to
complex process, requiring extensive business modeling, and reflect updates at the sources and to purge data from the
may take many years to succeed. Some organizations are warehouse, perhaps onto slower archival storage. In addition
settling for data marts instead, which are departmental to the main warehouse, there may be several departmental
subsets focused on selected subjects (e.g., a marketing data data marts. Data in the warehouse and data marts is stored
mart may include customer, product, and sales information). and managed by one or more warehouse servers, which
These data marts enable faster roll out, since they do not present multidimensional views of data to a variety of front
require enterprise-wide consensus, but they may lead to end tools: query tools, report writers, analysis tools, and data
complex integration problems in the long run, if a complete mining tools. Finally, there is a repository for storing and
business model is not developed.
534 Chapter 7: Data Warehousing

managing metadata, and tools for monitoring and There are three related, but somewhat different, classes of
administering the warehousing system. data cleaning tools. Data migration tools allow simple
transformation rules to be specified; e.g., “replace the string
The warehouse may be distributed for load balancing, gender by sex”. Warehouse Manager from Prism is an
scalability, and higher availability. In such a distributed example of a popular tool of this kind. Data scrubbing tools
architecture, the metadata repository is usually replicated with use domain-specific knowledge (e.g., postal addresses) to do
each fragment of the warehouse, and the entire warehouse is the scrubbing of data. They often exploit parsing and fuzzy
administered centrally. An alternative architecture, matching techniques to accomplish cleaning from multiple
implemented for expediency when it may be too expensive to sources. Some tools make it possible to specify the “relative
construct a single logically integrated enterprise warehouse, is cleanliness” of sources. Tools such as Integrity and Trillum
a federation of warehouses or data marts, each with its own fall in this category. Data auditing tools make it possible to
repository and decentralized administration. discover rules and relationships (or to signal violation of
stated rules) by scanning data. Thus, such tools may be
Designing and rolling out a data warehouse is a complex
considered variants of data mining tools. For example, such a
process, consisting of the following activities5.
tool may discover a suspicious pattern (based on statistical
• Define the architecture, do capacity planning, and select analysis) that a certain car dealer has never received any
the storage servers, database and OLAP servers, and complaints.
tools.
• Integrate the servers, storage, and client tools. Load
• Design the warehouse schema and views. After extracting, cleaning and transforming, data must be
• Define the physical warehouse organization, data loaded into the warehouse. Additional preprocessing may still
placement, partitioning, and access methods. be required: checking integrity constraints; sorting;
summarization, aggregation and other computation to build
• Connect the sources using gateways, ODBC drivers, or the derived tables stored in the warehouse; building indices
other wrappers. and other access paths; and partitioning to multiple target
• Design and implement scripts for data extraction, storage areas. Typically, batch load utilities are used for this
cleaning, transformation, load, and refresh. purpose. In addition to populating the warehouse, a load
• Populate the repository with the schema and view utility must allow the system administrator to monitor status,
definitions, scripts, and other metadata. to cancel, suspend and resume a load, and to restart after
• Design and implement end-user applications. failure with no loss of data integrity.
• Roll out the warehouse and applications.
The load utilities for data warehouses have to deal with much
larger data volumes than for operational databases. There is
3. Back End Tools and Utilities only a small time window (usually at night) when the
Data warehousing systems use a variety of data extraction and warehouse can be taken offline to refresh it. Sequential loads
cleaning tools, and load and refresh utilities for populating can take a very long time, e.g., loading a terabyte of data can
warehouses. Data extraction from “foreign” sources is usually take weeks and months! Hence, pipelined and partitioned
implemented via gateways and standard interfaces (such as parallelism are typically exploited 6. Doing a full load has the
Information Builders EDA/SQL, ODBC, Oracle Open advantage that it can be treated as a long batch transaction
Connect, Sybase Enterprise Connect, Informix Enterprise that builds up a new database. While it is in progress, the
Gateway). current database can still support queries; when the load
transaction commits, the current database is replaced with the
Data Cleaning new one. Using periodic checkpoints ensures that if a failure
Since a data warehouse is used for decision making, it is occurs during the load, the process can restart from the last
important that the data in the warehouse be correct. However, checkpoint.
since large volumes of data from multiple sources are
involved, there is a high probability of errors and anomalies However, even using parallelism, a full load may still take too
in the data.. Therefore, tools that help to detect data long. Most commercial utilities (e.g., RedBrick Table
anomalies and correct them can have a high payoff. Some Management Utility) use incremental loading during refresh
examples where data cleaning becomes necessary are: to reduce the volume of data that has to be incorporated into
inconsistent field lengths, inconsistent descriptions, the warehouse. Only the updated tuples are inserted.
inconsistent value assignments, missing entries and violation However, the load process now is harder to manage. The
of integrity constraints. Not surprisingly, optional fields in incremental load conflicts with ongoing queries, so it is
data entry forms are significant sources of inconsistent data. treated as a sequence of shorter transactions (which commit
periodically, e.g., after every 1000 records or every few
seconds), but now this sequence of transactions has to be
An Overview of Data Warehousing and OLAP Technology 535

coordinated to ensure consistency of derived data and indices correct updates for incrementally updating derived data
with the base data. (materialized views) has been the subject of much research 7 8
9 10
. For data warehousing, the most significant classes of
Refresh derived data are summary tables, single-table indices and
Refreshing a warehouse consists in propagating updates on join indices.
source data to correspondingly update the base data and
derived data stored in the warehouse. There are two sets of
issues to consider: when to refresh, and how to refresh. 4. Conceptual Model and Front End Tools
Usually, the warehouse is refreshed periodically (e.g., daily or
weekly). Only if some OLAP queries need current data (e.g.,
up to the minute stock quotes), is it necessary to propagate A popular conceptual model that influences the front-end
every update. The refresh policy is set by the warehouse tools, database design, and the query engines for OLAP is the
administrator, depending on user needs and traffic, and may multidimensional view of data in the warehouse. In a
be different for different sources. multidimensional data model, there is a set of numeric
measures that are the objects of analysis. Examples of such
Refresh techniques may also depend on the characteristics of measures are sales, budget, revenue, inventory, ROI (return
the source and the capabilities of the database servers. on investment). Each of the numeric measures depends on a
Extracting an entire source file or database is usually too set of dimensions, which provide the context for the measure.
expensive, but may be the only choice for legacy data For example, the dimensions associated with a sale amount
sources. Most contemporary database systems provide can be the city, product name, and the date when the sale was
replication servers that support incremental techniques for made. The dimensions together are assumed to uniquely
propagating updates from a primary database to one or more determine the measure. Thus, the multidimensional data
replicas. Such replication servers can be used to views a measure as a value in the multidimensional space of
incrementally refresh a warehouse when the sources change. dimensions. Each dimension is described by a set of
There are two basic replication techniques: data shipping and attributes. For example, the Product dimension may consist of
transaction shipping. four attributes: the category and the industry of the product,
year of its introduction, and the average profit margin. For
In data shipping (e.g., used in the Oracle Replication Server, example, the soda Surge belongs to the category beverage
Praxis OmniReplicator), a table in the warehouse is treated as and the food industry, was introduced in 1996, and may have
a remote snapshot of a table in the source database. After_row an average profit margin of 80%. The attributes of a
triggers are used to update a snapshot log table whenever the dimension may be related via a hierarchy of relationships. In
source table changes; and an automatic refresh schedule (or a the above example, the product name is related to its category
manual refresh procedure) is then set up to propagate the and the industry attribute through such a hierarchical
updated data to the remote snapshot. relationship.

In transaction shipping (e.g., used in the Sybase Replication Dimensions: Product, City, Date
W Hierarchical summarization paths
Server and Microsoft SQL Server), the regular transaction log S
ty
Ci

N Industry Country Year


is used, instead of triggers and a special snapshot log table. Juice
Product

10
At the source site, the transaction log is sniffed to detect Cola 50

Milk Category State Quarter


updates on replicated tables, and those log records are 20

Cream 12
transferred to a replication server, which packages up the Toothpaste 15 Product City Month Week
corresponding transactions to update the replicas. Transaction Soap 10

shipping has the advantage that it does not require triggers, 1 2 3 4 5 67 Date
which can increase the workload on the operational source Date
databases. However, it cannot always be used easily across Figure 2. Multidimensional data
DBMSs from different vendors, because there are no standard
APIs for accessing the transaction log. Another distinctive feature of the conceptual model for
OLAP is its stress on aggregation of measures by one or
Such replication servers have been used for refreshing data more dimensions as one of the key operations; e.g.,
warehouses. However, the refresh cycles have to be properly computing and ranking the total sales by each county (or by
chosen so that the volume of data does not overwhelm the each year). Other popular operations include comparing two
incremental load utility. measures (e.g., sales and budget) aggregated by the same
dimensions. Time is a dimension that is of particular
In addition to propagating changes to the base data in the significance to decision support (e.g., trend analysis). Often,
warehouse, the derived data also has to be updated it is desirable to have built-in knowledge of calendars and
correspondingly. The problem of constructing logically other aspects of the time dimension.
536 Chapter 7: Data Warehousing

data. These applications often use raw data access tools and
optimize the access patterns depending on the back end
Front End Tools database server. In addition, there are query environments
(e.g., Microsoft Access) that help build ad hoc SQL queries
The multidimensional data model grew out of the view of
by “pointing-and-clicking”. Finally, there are a variety of
business data popularized by PC spreadsheet programs that
data mining tools that are often used as front end tools to data
were extensively used by business analysts. The spreadsheet
warehouses.
is still the most compelling front-end application for OLAP.
The challenge in supporting a query environment for OLAP
can be crudely summarized as that of supporting spreadsheet 5. Database Design Methodology
operations efficiently over large multi-gigabyte databases.
Indeed, the Essbase product of Arbor Corporation uses The multidimensional data model described above is
Microsoft Excel as the front-end tool for its multidimensional implemented directly by MOLAP servers. We will describe
engine. these briefly in the next section. However, when a relational
ROLAP server is used, the multidimensional model and its
We shall briefly discuss some of the popular operations that operations have to be mapped into relations and SQL queries.
are supported by the multidimensional spreadsheet In this section, we describe the design of relational database
applications. One such operation is pivoting. Consider the schemas that reflect the multidimensional views of data.
multidimensional schema of Figure 2 represented in a
spreadsheet where each row corresponds to a sale . Let there Entity Relationship diagrams and normalization techniques
be one column for each dimension and an extra column that are popularly used for database design in OLTP
represents the amount of sale. The simplest view of pivoting environments. However, the database designs recommended
is that it selects two dimensions that are used to aggregate a by ER diagrams are inappropriate for decision support
measure, e.g., sales in the above example. The aggregated systems where efficiency in querying and in loading data
values are often displayed in a grid where each value in the (including incremental loads) are important.
(x,y) coordinate corresponds to the aggregated value of the
Most data warehouses use a star schema to represent the
measure when the first dimension has the value x and the
multidimensional data model. The database consists of a
second dimension has the value y. Thus, in our example, if
single fact table and a single table for each dimension. Each
the selected dimensions are city and year, then the x-axis may
tuple in the fact table consists of a pointer (foreign key - often
represent all values of city and the y-axis may represent the
uses a generated key for efficiency) to each of the dimensions
years. The point (x,y) will represent the aggregated sales for
that provide its multidimensional coordinates, and stores the
city x in the year y. Thus, what were values in the original
numeric measures for those coordinates. Each dimension
spreadsheets have now become row and column headers in
table consists of columns that correspond to attributes of the
the pivoted spreadsheet.
dimension. Figure 3 shows an example of a star schema.

Other operators related to pivoting are rollup or drill-down.


Rollup corresponds to taking the current data object and
doing a further group-by on one of the dimensions. Thus, it is Order ProdNo
possible to roll-up the sales data, perhaps already aggregated OrderNo ProdName
OrderDate ProdDescr
on city, additionally by product. The drill-down operation is Fact table Category
the converse of rollup. Slice_and_dice corresponds to CategoryDescr
Customer OrderNo
reducing the dimensionality of the data, i.e., taking a UnitPrice
CustomerNo SalespersonID
projection of the data on a subset of dimensions for selected CustomerNo QOH
CustomerName
values of the other dimensions. For example, we can ProdNo Date
CustomerAddress
slice_and_dice sales data for a specific product to create a City DateKey DateKey
table that consists of the dimensions city and the day of sale. CityName Date
The other popular operators include ranking (sorting), Salesperson Quantity Month
TotalPrice Year
selections and defining computed attributes. SalespersonID
City
SalespesonName
City CityName
Although the multidimensional spreadsheet has attracted a lot Quota State
of interest since it empowers the end user to analyze business Country
data, this has not replaced traditional analysis by means of a Figure 3. A Star Schema.
managed query environment. These environments use stored
procedures and predefined complex queries to provide
packaged analysis tools. Such tools often make it possible for Star schemas do not explicitly provide support for attribute
the end-user to query in terms of domain-specific business hierarchies. Snowflake schemas provide a refinement of star
An Overview of Data Warehousing and OLAP Technology 537

schemas where the dimensional hierarchy is explicitly


represented by normalizing the dimension tables, as shown in Data warehouses may contain large volumes of data. To
Figure 4. This leads to advantages in maintaining the answer queries efficiently, therefore, requires highly efficient
dimension tables. However, the denormalized structure of the access methods and query processing techniques. Several
dimensional tables in star schemas may be more appropriate issues arise. First, data warehouses use redundant structures
for browsing the dimensions. such as indices and materialized views. Choosing which
indices to build and which views to materialize is an
Fact constellations are examples of more complex structures important physical design problem. The next challenge is to
in which multiple fact tables share dimensional tables. For effectively use the existing indices and materialized views to
example, projected expense and the actual expense may form answer queries. Optimization of complex queries is another
a fact constellation since they share many dimensions. important problem. Also, while for data-selective queries,
efficient index scans may be very effective, data-intensive
Order Category
queries need the use of sequential scans. Thus, improving the
ProdNo efficiency of scans is important. Finally, parallelism needs to
OrderNo CategoryName
ProdName
OrderDate CategoryDescr be exploited to reduce query response times. In this short
ProdDescr
Fact table
Category paper, it is not possible to elaborate on each of these issues.
Customer OrderNo UnitPrice
CustomerNo SalespersonID QOH Therefore, we will only briefly touch upon the highlights.
CustomerName CustomerNo
CustomerAddress DateKey Date Month Year
City CityName
DateKey Month
Index Structures and their Usage
ProdNo
Date Year A number of query processing techniques that exploit indices
Salesperson Quantity
Month
TotalPrice are useful. For instance, the selectivities of multiple
SalespersonID
City State
SalespesonName conditions can be exploited through index intersection. Other
City CityName
Quota State useful index operations are union of indexes. These index
operations can be used to significantly reduce and in many
Figure 4. A Snowflake Schema.
cases eliminate the need to access the base tables.

In addition to the fact and dimension tables, data warehouses


Warehouse servers can use bit map indices, which support
store selected summary tables containing pre-aggregated data.
efficient index operations (e.g., union, intersection). Consider
In the simplest cases, the pre-aggregated data corresponds to
a leaf page in an index structure corresponding to a domain
aggregating the fact table on one or more selected
value d. Such a leaf page traditionally contains a list of the
dimensions. Such pre-aggregated summary data can be
record ids (RIDs) of records that contain the value d.
represented in the database in at least two ways. Let us
However, bit map indices use an alternative representation of
consider the example of a summary table that has total sales
the above RID list as a bit vector that has one bit for each
by product by year in the context of the star schema of Figure
record, which is set when the domain value for that record is
3. We can represent such a summary table by a separate fact
d. In a sense, the bit map index is not a new index structure,
table which shares the dimension Product and also a separate
but simply an alternative representation of the RID list. The
shrunken dimension table for time, which consists of only the
popularity of the bit map index is due to the fact that the bit
attributes of the dimension that make sense for the summary
vector representation of the RID lists can speed up index
table (i.e., year). Alternatively, we can represent the summary
intersection, union, join, and aggregation11. For example, if
table by encoding the aggregated tuples in the same fact table
we have a query of the form column1 = d & column2 = d’,
and the same dimension tables without adding new tables.
then we can identify the qualifying records by taking the
This may be accomplished by adding a new level field to each
AND of the two bit vectors. While such representations can
dimension and using nulls: We can encode a day, a month or
be very useful for low cardinality domains (e.g., gender), they
a year in the Date dimension table as follows: (id0, 0, 22, 01,
can also be effective for higher cardinality domains through
1960) represents a record for Jan 22, 1960, (id1, 1, NULL,
compression of bitmaps (e.g., run length encoding). Bitmap
01, 1960) represents the month Jan 1960 and (id2, 2, NULL,
indices were originally used in Model 204, but many products
NULL, 1960) represents the year 1960. The second attribute
support them today (e.g., Sybase IQ). An interesting question
represents the new attribute level: 0 for days, 1 for months, 2
is to decide on which attributes to index. In general, this is
for years. In the fact table, a record containing the foreign key
really a question that must be answered by the physical
id2 represents the aggregated sales for a Product in the year
database design process.
1960. The latter method, while reducing the number of tables,
is often a source of operational errors since the level field
needs be carefully interpreted. In addition to indices on single tables, the specialized nature
of star schemas makes join indices especially attractive for
decision support. While traditionally indices map the value in
6. Warehouse Servers a column to a list of rows with that value, a join index
538 Chapter 7: Data Warehousing

maintains the relationships between a foreign key with its materialized views is that of estimating the effect of
matching primary keys. In the context of a star schema, a join aggregation on the cardinality of the relations.
index can relate the values of one or more attributes of a
dimension table to matching rows in the fact table. For A simple, but extremely useful, strategy for using a
example, consider the schema of Figure 3. There can be a materialized view is to use selection on the materialized view,
join index on City that maintains, for each city, a list of RIDs or rollup on the materialized view by grouping and
of the tuples in the fact table that correspond to sales in that aggregating on additional columns. For example, assume that
city. Thus a join index essentially precomputes a binary join. a materialized view contains the total sales by quarter for
Multikey join indices can represent precomputed n-way joins. each product. This materialized view can be used to answer a
For example, over the Sales database it is possible to query that requests the total sales of Levi’s jeans for the year
construct a multidimensional join index from (Cityname, by first applying the selection and then rolling up from
Productname) to the fact table. Thus, the index entry for quarters to years. It should be emphasized that the ability to
(Seattle, jacket) points to RIDs of those tuples in the Sales do roll-up from a partially aggregated result, relies on
table that have the above combination. Using such a algebraic properties of the aggregating functions (e.g., Sum
multidimensional join index can sometimes provide savings can be rolled up, but some other statistical function may not
over taking the intersection of separate indices on Cityname be).
and Productname. Join indices can be used with bitmap
representations for the RID lists for efficient join
processing12. In general, there may be several candidate materialized views
that can be used to answer a query. If a view V has the same
set of dimensions as Q, if the selection clause in Q implies the
Finally, decision support databases contain a significant selection clause in V, and if the group-by columns in V are a
amount of descriptive text and so indices to support text subset of the group-by columns in Q, then view V can act as a
search are useful as well. generator of Q. Given a set of materialized views M, a query
Q, we can define a set of minimal generators M’ for Q (i.e.,
Materialized Views and their Usage smallest set of generators such that all other generators
Many queries over data warehouses require summary data, generate some member of M’). There can be multiple
and, therefore, use aggregates. Hence, in addition to indices, minimal generators for a query. For example, given a query
materializing summary data can help to accelerate many that asks for total sales of clothing in Washington State, the
common queries. For example, in an investment environment, following two views are both generators: (a) total sales by
a large majority of the queries may be based on the each state for each product (b) total sales by each city for
performance of the most recent quarter and the current fiscal each category. The notion of minimal generators can be used
year. Having summary data on these parameters can by the optimizer to narrow the search for the appropriate
significantly speed up query processing. materialized view to use. On the commercial side, HP
Intelligent Warehouse pioneered the use of the minimal
generators to answer queries. While we have defined the
The challenges in exploiting materialized views are not unlike
notion of a generator in a restricted way, the general problem
those in using indices: (a) identify the views to materialize,
of optimizing queries in the presence of multiple
(b) exploit the materialized views to answer queries, and (c)
materialized views is more complex. In the special case of
efficiently update the materialized views during load and
Select-Project-Join queries, there has been some work in this
refresh. The currently adopted industrial solutions to these
area.14 15 16
problems consider materializing views that have a relatively
simple structure. Such views consist of joins of the fact table
with a subset of dimension tables (possibly after some Transformation of Complex SQL Queries
selections on those dimensions), with the aggregation of one The problem of finding efficient techniques for processing
or more measures grouped by a set of attributes from the complex queries has been of keen interest in query
dimension tables. The structure of these views is a little more optimization. In a way, decision support systems provide a
complex when the underlying schema is a snowflake. testing ground for some of the ideas that have been studied
before. We will only summarize some of the key
Despite the restricted form, there is still a wide choice of contributions.
views to materialize. The selection of views to materialize
must take into account workload characteristics, the costs for There has been substantial work on “unnesting” complex
incremental update, and upper bounds on storage SQL queries containing nested subqueries by translating them
requirements. Under simplifying assumptions, a greedy into single block SQL queries when certain syntactic
algorithm was shown to have good performance13. A related restrictions are satisfied17 18 19 20. Another direction that has
problem that underlies optimization as well as choice of been pursued in optimizing nested subqueries is reducing the
number of invocations and batching invocation of inner
An Overview of Data Warehousing and OLAP Technology 539

subqueries by semi-join like techniques21 22. Likewise, the systems. However, intrinsic mismatches between OLAP-
problem of flattening queries containing views has been a style querying and SQL (e.g., lack of sequential
topic of interest. The case where participating views are SPJ processing, column aggregation) can cause performance
queries is well understood. The problem is more complex bottlenecks for OLAP servers.
when one or more of the views contain aggregation23. • MOLAP Servers: These servers directly support the
Naturally, this problem is closely related to the problem of multidimensional view of data through a
commuting group-by and join operators. However, multidimensional storage engine. This makes it possible
commuting group-by and join is applicable in the context of to implement front-end multidimensional queries on the
single block SQL queries as well.24 25 26 An overview of the storage layer through direct mapping. An example of
field appears in a recent paper27. such a server is Essbase (Arbor). Such an approach has
the advantage of excellent indexing properties, but
Parallel Processing provides poor storage utilization, especially when the
data set is sparse. Many MOLAP servers adopt a 2-level
Parallelism plays a significant role in processing massive
storage representation to adapt to sparse data sets and
databases. Teradata pioneered some of the key technology.
use compression extensively. In the two-level storage
All major vendors of database management systems now representation, a set of one or two dimensional subarrays
offer data partitioning and parallel query processing that are likely to be dense are identified, through the use
technology. The article by Dewitt and Gray provides an of design tools or by user input, and are represented in
overview of this area28 . One interesting technique relevant to the array format. Then, the traditional indexing structure
the read-only environment of decision support systems is that is used to index onto these “smaller” arrays. Many of the
of piggybacking scans requested by multiple queries (used in techniques that were devised for statistical databases
Redbrick). Piggybacking scan reduces the total work as well appear to be relevant for MOLAP servers.
as response time by overlapping scans of multiple concurrent
requests. SQL Extensions
Several extensions to SQL that facilitate the expression and
Server Architectures for Query Processing processing of OLAP queries have been proposed or
Traditional relational servers were not geared towards the implemented in extended relational servers. Some of these
intelligent use of indices and other requirements for extensions are described below.
supporting multidimensional views of data. However, all • Extended family of aggregate functions: These include
relational DBMS vendors have now moved rapidly to support support for rank and percentile (e.g., all products in the
these additional requirements. In addition to the traditional top 10 percentile or the top 10 products by total Sale) as
relational servers, there are three other categories of servers well as support for a variety of functions used in
that were developed specifically for decision support. financial analysis (mean, mode, median).
• Reporting Features: The reports produced for business
• Specialized SQL Servers: Redbrick is an example of this
analysis often requires aggregate features evaluated on a
class of servers. The objective here is to provide
time window, e.g., moving average. In addition, it is
advanced query language and query processing support
important to be able to provide breakpoints and running
for SQL queries over star and snowflake schemas in
totals. Redbrick’s SQL extensions provide such
read-only environments.
primitives.
• ROLAP Servers: These are intermediate servers that sit
between a relational back end server (where the data in • Multiple Group-By: Front end tools such as
the warehouse is stored) and client front end tools. multidimensional spreadsheets require grouping by
Microstrategy is an example of such servers. They different sets of attributes. This can be simulated by a set
extend traditional relational servers with specialized of SQL statements that require scanning the same data
middleware to efficiently support multidimensional set multiple times, but this can be inefficient. Recently,
OLAP queries, and they typically optimize for specific two new operators, Rollup and Cube, have been
back end relational servers. They identify the views that proposed to augment SQL to address this problem29.
are to be materialized, rephrase given user queries in Thus, Rollup of the list of attributes (Product, Year, City )
terms of the appropriate materialized views, and generate over a data set results in answer sets with the following
multi-statement SQL for the back end server. They also applications of group by: (a) group by (Product, Year,
provide additional services such as scheduling of queries City) (b) group by (Product, Year), and (c) group by
and resource assignment (e.g., to prevent runaway Product. On the other hand, given a list of k columns, the
queries). There has also been a trend to tune the ROLAP Cube operator provides a group-by for each of the 2k
servers for domain specific ROLAP tools. The main combinations of columns. Such multiple group-by
strength of ROLAP servers is that they exploit the operations can be executed efficiently by recognizing
scalability and the transactional features of relational
540 Chapter 7: Data Warehousing

commonalties among them30. Microsoft SQL Server statistics and making suggestions to the administrator: usage
supports Cube and Rollup. of partitions and summary tables, query execution times,
• Comparisons: An article by Ralph Kimball and Kevin types and frequencies of drill downs or rollups, which users
Strehlo provides an excellent overview of the or groups request which data, peak and average workloads
deficiencies of SQL in being able to do comparisons that over time, exception reporting, detecting runaway queries,
are common in the business world, e.g., compare the and other quality of service metrics. System and network
difference between the total projected sale and total management tools (e.g., HP OpenView, IBM NetView,
actual sale by each quarter, where projected sale and Tivoli) are used to measure traffic between clients and
actual sale are columns of a table31. A straightforward servers, between warehouse servers and operational
execution of such queries may require multiple databases, and so on. Finally, only recently have workflow
sequential scans. The article provides some alternatives management tools been considered for managing the extract-
to better support comparisons. A recent research paper scrub-transform-load-refresh process. The steps of the
also addresses the question of how to do comparisons process can invoke appropriate scripts stored in the
among aggregated values by extending SQL32. repository, and can be launched periodically, on demand, or
when specified events occur. The workflow engine ensures
successful completion of the process, persistently records the
7. Metadata and Warehouse Management success or failure of each step, and provides failure recovery
with partial roll back , retry, or roll forward.
Since a data warehouse reflects the business model of an
enterprise, an essential element of a warehousing architecture
is metadata management. Many different kinds of metadata
8. Research Issues
have to be managed. Administrative metadata includes all of
the information necessary for setting up and using a We have described the substantial technical challenges in
warehouse: descriptions of the source databases, back-end developing and deploying decision support systems. While
and front-end tools; definitions of the warehouse schema, many commercial products and services exist, there are still
derived data, dimensions and hierarchies, predefined queries several interesting avenues for research. We will only touch
and reports; data mart locations and contents; physical on a few of these here.
organization such as data partitions; data extraction, cleaning,
and transformation rules; data refresh and purging policies; Data cleaning is a problem that is reminiscent of
and user profiles, user authorization and access control heterogeneous data integration, a problem that has been
policies. Business metadata includes business terms and studied for many years. But here the emphasis is on data
definitions, ownership of the data, and charging policies. inconsistencies instead of schema inconsistencies. Data
Operational metadata includes information that is collected cleaning, as we indicated, is also closely related to data
during the operation of the warehouse: the lineage of mining, with the objective of suggesting possible
migrated and transformed data; the currency of data in the inconsistencies.
warehouse (active, archived or purged); and monitoring
information such as usage statistics, error reports, and audit
The problem of physical design of data warehouses should
trails.
rekindle interest in the well-known problems of index
selection, data partitioning and the selection of materialized
Often, a metadata repository is used to store and manage all
views. However, while revisiting these problems, it is
the metadata associated with the warehouse. The repository
important to recognize the special role played by aggregation.
enables the sharing of metadata among tools and processes
Decision support systems already provide the field of query
for designing, setting up, using, operating, and administering
optimization with increasing challenges in the traditional
a warehouse. Commercial examples include Platinum
questions of selectivity estimation and cost-based algorithms
Repository and Prism Directory Manager.
that can exploit transformations without exploding the search
space (there are plenty of transformations, but few reliable
Creating and managing a warehousing system is hard. Many
cost estimation techniques and few smart cost-based
different classes of tools are available to facilitate different
algorithms/search strategies to exploit them). Partitioning the
aspects of the process described in Section 2. Development
functionality of the query engine between the middleware
tools are used to design and edit schemas, views, scripts,
(e.g., ROLAP layer) and the back end server is also an
rules, queries, and reports. Planning and analysis tools are
interesting problem.
used for what-if scenarios such as understanding the impact
of schema changes or refresh rates, and for doing capacity
planning. Warehouse management tools (e.g., HP Intelligent The management of data warehouses also presents new
Warehouse Advisor, IBM Data Hub, Prism Warehouse challenges. Detecting runaway queries, and managing and
Manager) are used for monitoring a warehouse, reporting scheduling resources are problems that are important but have
not been well solved. Some work has been done on the
An Overview of Data Warehousing and OLAP Technology 541

logical correctness of incrementally updating materialized


15
views, but the performance, scalability, and recoverability Levy A., Mendelzon A., Sagiv Y. “Answering Queries Using
properties of these techniques have not been investigated. In Views” Proc. of PODS, 1995.
particular, failure and checkpointing issues in load and refresh 16
Yang H.Z., Larson P.A. “Query Transformations for PSJ
in the presence of many indices and materialized views needs Queries”, Proc. of VLDB, 1987.
further research. The adaptation and use of workflow 17
technology might help, but this needs further investigation. Kim W. “On Optimizing a SQL-like Nested Query” ACM
TODS, Sep 1982.
18
Some of these areas are being pursued by the research Ganski,R., Wong H.K.T., “Optimization of Nested SQL
community33 34, but others have received only cursory Queries Revisited ” Proc. of SIGMOD Conf., 1987.
attention, particularly in relationship to data warehousing. 19
Dayal, U., “Of Nests and Trees: A Unified Approach to
Processing Queries that Contain Nested Subqueries,
Aggregates and Quantifiers” Proc. VLDB Conf., 1987.
Acknowledgement
20
We thank Goetz Graefe for his comments on the draft. Murlaikrishna, “Improved Unnesting Algorithms for Join
Aggregate SQL Queries” Proc. VLDB Conf., 1992.
21
Seshadri P., Pirahesh H., Leung T. “Complex Query
References Decorrelation” Intl. Conference on Data Engineering, 1996.
1 22
Inmon, W.H., Building the Data Warehouse. John Wiley, 1992. Mumick I.S., Pirahesh H. “Implementation of Magic Sets in
2 Starburst” Proc.of SIGMOD Conf., 1994.
http://www.olapcouncil.org
23
3 Chaudhuri S., Shim K. “Optimizing Queries with Aggregate
Codd, E.F., S.B. Codd, C.T. Salley, “Providing OLAP (On-Line
Views”, Proc. of EDBT, 1996.
Analytical Processing) to User Analyst: An IT Mandate.”
24
Available from Arbor Software’s web site Chaudhuri S., Shim K. “Including Group By in Query
http://www.arborsoft.com/OLAP.html. Optimization”, Proc. of VLDB, 1994.
4 25
http://pwp.starnetinc.com/larryg/articles.html Yan P., Larson P.A. “Eager Aggregation and Lazy
5 Aggregation”, Proc. of VLDB, 1995.
Kimball, R. The Data Warehouse Toolkit. John Wiley, 1996.
26
6 Gupta A., Harinarayan V., Quass D. “Aggregate-Query
Barclay, T., R. Barnes, J. Gray, P. Sundaresan, “Loading
Processing in Data Warehouse Environments”, Proc. of VLDB,
Databases using Dataflow Parallelism.” SIGMOD Record, Vol.
1995.
23, No. 4, Dec.1994.
27
7 Chaudhuri S., Shim K. “An Overview of Cost-based
Blakeley, J.A., N. Coburn, P. Larson. “Updating Derived
Optimization of Queries with Aggregates” IEEE Data
Relations: Detecting Irrelevant and Autonomously Computable
Enginering Bulletin, Sep 1995.
Updates.” ACM TODS, Vol.4, No. 3, 1989.
28
8 Dewitt D.J., Gray J. “Parallel Database Systems: The Future of
Gupta, A., I.S. Mumick, “Maintenance of Materialized Views:
High Performance Database Systems” CACM, June 1992.
Problems, Techniques, and Applications.” Data Eng. Bulletin,
29
Vol. 18, No. 2, June 1995. Gray J. et.al. “Data Cube: A Relational Aggregation Operator
9 Generalizing Group-by, Cross-Tab and Sub Totals” Data
Zhuge, Y., H. Garcia-Molina, J. Hammer, J. Widom, “View
Mining and Knowledge Discovery Journal, Vol 1, No 1, 1997.
Maintenance in a Warehousing Environment, Proc. of
30
SIGMOD Conf., 1995. Agrawal S. et.al. “On the Computation of Multidimensional
10 Aggregates” Proc. of VLDB Conf., 1996.
Roussopoulos, N., et al., “The Maryland ADMS Project: Views
31
R Us.” Data Eng. Bulletin, Vol. 18, No.2, June 1995. Kimball R., Strehlo., “Why decision support fails and how to
11 fix it”, reprinted in SIGMOD Record, 24(3), 1995.
O’Neil P., Quass D. “Improved Query Performance with
32
Variant Indices”, To appear in Proc. of SIGMOD Conf., 1997. Chatziantoniou D., Ross K. “Querying Multiple Features in
12 Relational Databases” Proc. of VLDB Conf., 1996.
O’Neil P., Graefe G. “Multi-Table Joins through Bitmapped
33
Join Indices” SIGMOD Record, Sep 1995. Widom, J. “Research Problems in Data Warehousing.” Proc.
13 4th Intl. CIKM Conf., 1995.
Harinarayan V., Rajaraman A., Ullman J.D. “ Implementing
34
Data Cubes Efficiently” Proc. of SIGMOD Conf., 1996. Wu, M-C., A.P. Buchmann. “Research Issues in Data
14 Warehousing.” Submitted for publication.
Chaudhuri S., Krishnamurthy R., Potamianos S., Shim K.
“Optimizing Queries with Materialized Views” Intl.
Conference on Data Engineering, 1995.
Improved Query Performance with Variant Indexes 543
544 Chapter 7: Data Warehousing
Improved Query Performance with Variant Indexes 545
546 Chapter 7: Data Warehousing
Improved Query Performance with Variant Indexes 547
548 Chapter 7: Data Warehousing
Improved Query Performance with Variant Indexes 549
550 Chapter 7: Data Warehousing
Improved Query Performance with Variant Indexes 551
552 Chapter 7: Data Warehousing
Improved Query Performance with Variant Indexes 553
Data Mining and Knowledge Discovery 1, 29–53 (1997)

c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands.

Data Cube: A Relational Aggregation Operator


Generalizing Group-By, Cross-Tab, and Sub-Totals∗
JIM GRAY Gray@Microsoft.com
SURAJIT CHAUDHURI SurajitC@Microsoft.com
ADAM BOSWORTH AdamB@Microsoft.com
ANDREW LAYMAN AndrewL@Microsoft.com
DON REICHART DonRei@Microsoft.com
MURALI VENKATRAO MuraliV@Microsoft.com
Microsoft Research, Advanced Technology Division, Microsoft Corporation, One Microsoft Way, Redmond,
WA 98052

FRANK PELLOW Pellow@vnet.IBM.com


HAMID PIRAHESH Pirahesh@Almaden.IBM.com
IBM Research, 500 Harry Road, San Jose, CA 95120

Editor: Usama Fayyad

Received July 2, 1996; Revised November 5, 1996; Accepted November 6, 1996

Abstract. Data analysis applications typically aggregate data across many dimensions looking for anomalies
or unusual patterns. The SQL aggregate functions and the GROUP BY operator produce zero-dimensional or
one-dimensional aggregates. Applications need the N -dimensional generalization of these operators. This paper
defines that operator, called the data cube or simply cube. The cube operator generalizes the histogram, cross-
tabulation, roll-up, drill-down, and sub-total constructs found in most report writers. The novelty is that cubes
are relations. Consequently, the cube operator can be imbedded in more complex non-procedural data analysis
programs. The cube operator treats each of the N aggregation attributes as a dimension of N -space. The aggregate
of a particular set of attribute values is a point in this space. The set of points forms an N -dimensional cube.
Super-aggregates are computed by aggregating the N -cube to lower dimensional spaces. This paper (1) explains
the cube and roll-up operators, (2) shows how they fit in SQL, (3) explains how users can define new aggregate
functions for cubes, and (4) discusses efficient techniques to compute the cube. Many of these features are being
added to the SQL Standard.

Keywords: data cube, data mining, aggregation, summarization, database, analysis, query

1. Introduction

Data analysis applications look for unusual patterns in data. They categorize data values and
trends, extract statistical information, and then contrast one category with another. There
are four steps to such data analysis:
formulating a query that extracts relevant data from a large database,
extracting the aggregated data from the database into a file or table,

∗ An extended abstract of this paper appeared in Gray et al. (1996).


Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 555

30 GRAY ET AL.

visualizing the results in a graphical way, and


analyzing the results and formulating a new query.

Visualization tools display data trends, clusters, and differences. Some of the most
exciting work in visualization focuses on presenting new graphical metaphors that allow
people to discover data trends and anomalies. Many of these visualization and data analysis
tools represent the dataset as an N -dimensional space. Visualization tools render two and
three-dimensional sub-slabs of this space as 2D or 3D objects.
Color and time (motion) add two more dimensions to the display giving the potential for
a 5D display. A spreadsheet application such as Excel is an example of a data visualiza-
tion/analysis tool that is used widely. Data analysis tools often try to identify a subspace of
the N -dimensional space which is “interesting” (e.g., discriminating attributes of the data
set).
Thus, visualization as well as data analysis tools do “dimensionality reduction”, often
by summarizing data along the dimensions that are left out. For example, in trying to
analyze car sales, we might focus on the role of model, year and color of the cars in sale.
Thus, we ignore the differences between two sales along the dimensions of date of sale or
dealership but analyze the totals sale for cars by model, by year and by color only. Along
with summarization and dimensionality reduction, data analysis applications extensively
use constructs such as histogram, cross-tabulation, subtotals, roll-up and drill-down.
This paper examines how a relational engine can support efficient extraction of infor-
mation from a SQL database that matches the above requirements of the visualization and
data analysis. We begin by discussing the relevant features in Standard SQL and some
vendor-specific SQL extensions. Section 2 discusses why GROUP BY fails to adequately
address the requirements. The CUBE and the ROLLUP operators are introduced in Section 3
and we also discuss how these operators overcome some of the shortcomings of GROUP
BY. Sections 4 and 5 discuss how we can address and compute the Cube.

Figure 1. Data analysis tools facilitate the Extract-Visualize-Analyze loop. The cube and roll-up operators along
with system and user-defined aggregates are part of the extraction process.
556 Chapter 7: Data Warehousing

DATA CUBE: A RELATIONAL AGGREGATION OPERATOR 31

Table 1.

Weather

Altitude Temp. Pres.


Time (UCT) Latitude Longitude (m) (c) (mb)

96/6/1:1500 37:58:33N 122:45:28W 102 21 1009


Many more rows like the ones above and below
96/6/7:1500 34:16:18N 27:05:55W 10 23 1024

1.1. Relational and SQL data extraction

How do traditional relational databases fit into this multi-dimensional data analysis picture?
How can 2D flat files (SQL tables) model an N -dimensional problem? Furthermore, how
do the relational systems support operations over N -dimensional representations that are
central to visualization and data analysis programs?
We address two issues in this section. The answer to the first question is that relational
systems model N -dimensional data as a relation with N -attribute domains. For example,
4-dimensional (4D) earth temperature data is typically represented by a Weather table
(Table 1). The first four columns represent the four dimensions: latitude, longitude, altitude,
and time. Additional columns represent measurements at the 4D points such as temperature,
pressure, humidity, and wind velocity. Each individual weather measurement is recorded
as a new row of this table. Often these measured values are aggregates over time (the hour)
or space (a measurement area centered on the point).
As mentioned in the introduction, visualization and data analysis tools extensively use di-
mensionality reduction (aggregation) for better comprehensibility. Often data along the other
dimensions that are not included in a “2-D” representation are summarized via aggregation
in the form of histogram, cross-tabulation, subtotals etc. In the SQL Standard, we depend
on aggregate functions and the GROUP BY operator to support aggregation.
The SQL standard (IS 9075 International Standard for Database Language SQL, 1992)
provides five functions to aggregate the values in a table: COUNT(), SUM(), MIN(),
MAX(), and AVG( ). For example, the average of all measured temperatures is expressed as:

SELECT AVG(Temp)
FROM Weather;

In addition, SQL allows aggregation over distinct values. The following query counts
the distinct number of reporting times in the Weather table:

SELECT COUNT(DISTINCT Time)


FROM Weather;

Aggregate functions return a single value. Using the GROUP BY construct, SQL can also
create a table of many aggregate values indexed by a set of attributes. For example, the
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 557

32 GRAY ET AL.

Figure 2. The GROUP BY relational operator partitions a table into groups. Each group is then aggregated by a
function. The aggregation function summarizes some column of groups returning a value for each group.

following query reports the average temperature for each reporting time and altitude:
SELECT Time, Altitude, AVG(Temp)
FROM Weather
GROUP BY Time, Altitude;

GROUP BY is an unusual relational operator: It partitions the relation into disjoint tuple
sets and then aggregates over each set as illustrated in figure 2.
SQL’s aggregation functions are widely used in database applications. This popularity is
reflected in the presence of aggregates in a large number of queries in the decision-support
benchmark TPC-D (The Benchmark Handbook for Database and Transaction Processing
Systems, 1993). The TPC-D query set has one 6D GROUP BY and three 3D GROUP BYs.
One and two dimensional GROUP BYs are the most common. Surprisingly, aggregates ap-
pear in the TPC online-transaction processing benchmarks as well (TPC-A, B and C). Table 2
shows how frequently the database and transaction processing benchmarks use aggregation
and GROUP BY. A detailed description of these benchmarks is beyond the scope of the paper
(see (Gray, 1991) and (The Benchmark Handbook for Database and Transaction Processing
Systems, 1993).
Table 2. SQL aggregates in standard benchmarks.

Benchmark Queries Aggregates GROUP BYs

TPC-A, B 1 0 0
TPC-C 18 4 0
TPC-D 16 27 15
Wisconsin 18 3 2
AS3 AP 23 20 2
SetQuery 7 5 1

1.2. Extensions in some SQL systems

Beyond the five standard aggregate functions defined so far, many SQL systems add sta-
tistical functions (median, standard deviation, variance, etc.), physical functions (center of
558 Chapter 7: Data Warehousing

DATA CUBE: A RELATIONAL AGGREGATION OPERATOR 33

mass, angular momentum, etc.), financial analysis (volatility, Alpha, Beta, etc.), and other
domain-specific functions.
Some systems allow users to add new aggregation functions. The Informix Illustra
system, for example, allows users to add aggregate functions by adding a program with the
following three callbacks to the database system (DataBlade Developer’s Kit):

Init(&handle): Allocates the handle and initializes the aggregate computation.


Iter(&handle, value): Aggregates the next value into the current aggregate.
value = Final(&handle): Computes and returns the resulting aggregate by using data
saved in the handle. This invocation deallocates the handle.

Consider implementing the Average() function. The handle stores the count and
the sum initialized to zero. When passed a new non-null value, Iter() increments the
count and adds the sum to the value. The Final() call deallocates the handle and returns
sum divided by count. IBM’s DB2 Common Server (Chamberlin, 1996) has a similar
mechanism. This design has been added to the Draft Proposed standard for SQL (1997).
Red Brick systems, one of the larger UNIX OLAP vendors, adds some interesting ag-
gregate functions that enhance the GROUP BY mechanism (RISQL Reference Guide, Red
Brick Warehouse VPT, 1994):

Rank(expression): Returns the expressions rank in the set of all values of this domain
of the table. If there are N values in the column, and this is the highest value, the rank
is N , if it is the lowest value the rank is 1.
N_tile(expression, n): The range of the expression (over all the input values of the
table) is computed and divided into n value ranges of approximately equal population. The
function returns the number of the range containing the expression’s value. If your bank
account was among the largest 10% then your rank(account.balance,10) would
return 10. Red Brick provides just N_tile(expression,3).
Ratio_To_Total(expression): Sums all the expressions. Then for each instance,
divides the expression instance by the total sum.

To give an example, the following SQL statement

SELECT Percentile, MIN(Temp), MAX(Temp)


FROM Weather
GROUP BY N_tile(Temp,10) as Percentile
HAVING Percentile = 5;

returns one row giving the minimum and maximum temperatures of the middle 10% of all
temperatures.
Red Brick also offers three cumulative aggregates that operate on ordered tables.

Cumulative(expression): Sums all values so far in an ordered list.


Running_Sum(expression,n): Sums the most recent n values in an ordered list. The
initial n-1 values are NULL.
Running_Average(expression,n): Averages the most recent n values in an ordered
list. The initial n-1 values are NULL.
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 559

34 GRAY ET AL.

These aggregate functions are optionally reset each time a grouping value changes in an
ordered selection.

2. Problems with GROUP BY

Certain common forms of data analysis are difficult with these SQL aggregation constructs.
As explained next, three common problems are: (1) Histograms, (2) Roll-up Totals and
Sub-Totals for drill-downs, (3) Cross Tabulations.
The standard SQL GROUP BY operator does not allow a direct construction of histograms
(aggregation over computed categories). For example, for queries based on the Weather
table, it would be nice to be able to group times into days, weeks, or months, and to group
locations into areas (e.g., US, Canada, Europe,...). If a Nation() function maps latitude
and longitude into the name of the country containing that location, then the following
query would give the daily maximum reported temperature for each nation.

SELECT day, nation, MAX(Temp)


FROM Weather
GROUP BY Day(Time) AS day,
Nation(Latitude, Longitude) AS nation;

Some SQL systems support histograms directly but the standard does not1 . In standard
SQL, histograms are computed indirectly from a table-valued expression which is then
aggregated. The following statement demonstrates this SQL92 construct using nested
queries.

SELECT day, nation, MAX(Temp)


FROM (SELECT Day(Time) AS day,
Nation(Latitude, Longitude) AS nation,
Temp
FROM Weather
) AS foo
GROUP BY day, nation;

A more serious problem, and the main focus of this paper, relates to roll-ups using totals
and sub-totals for drill-down reports. Reports commonly aggregate data at a coarse level,
and then at successively finer levels. The car sales report in Table 3 shows the idea (this
and other examples are based on the sales summary data in the table in figure 4). Data
is aggregated by Model, then by Year, then by Color. The report shows data aggregated
at three levels. Going up the levels is called rolling-up the data. Going down is called
drilling-down into the data. Data aggregated at each distinct level produces a sub-total.
Table 3a suggests creating 2 N aggregation columns for a roll-up of N elements. Indeed,
Chris Date recommends this approach (Date, 1996). His design gives rise to Table 3b.
The representation of Table 3a is not relational because the empty cells (presumably
NULL values), cannot form a key. Representation 3b is an elegant solution to this problem,
but we rejected it because it implies enormous numbers of domains in the resulting tables.
560 Chapter 7: Data Warehousing

DATA CUBE: A RELATIONAL AGGREGATION OPERATOR 35

Table 3a. Sales Roll Up by Model by Year by Color.

Sales
by Model Sales
by Year by Model Sales
Model Year Color by Color by Year by Model

Chevy 1994 Black 50


White 40
90
1995 Black 85
White 115
200
290

Table 3b. Sales Roll-Up by Model by Year by Color as recommended by Chris Date (Date, 1996).

Sales
by Model Sales
Model Year Color Sales by Year by Model

Chevy 1994 Black 50 90 290


Chevy 1994 White 40 90 290
Chevy 1995 Black 85 200 290
Chevy 1995 White 115 200 290

Table 4. An Excel pivot table representation of Table 3 with Ford sales data included.

Year/Color
Sum
1994 1995
sales 1994 1995 Grand
Model Black White total Black White total total

Chevy 50 40 90 85 115 200 290


Ford 50 10 60 85 75 160 220
Grand total 100 50 150 170 190 360 510

We were intimidated by the prospect of adding 64 columns to the answer set of a 6D TPCD
query. The representation of Table 3b is also not convenient—the number of columns grows
as the power set of the number of aggregated attributes, creating difficult naming problems
and very long names. The approach recommended by Date is reminiscent of pivot tables
found in Excel (and now all other spreadsheets) (Microsoft Excel, 1995), a popular data
analysis feature of Excel2 .
Table 4 an alternative representation of Table 3a (with Ford Sales data included) that
illustrates how a pivot table in Excel can present the Sales data by Model, by Year, and then
by Color. The pivot operator transposes a spreadsheet: typically aggregating cells based on
values in the cells. Rather than just creating columns based on subsets of column names,
pivot creates columns based on subsets of column values. This is a much larger set. If one
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 561

36 GRAY ET AL.

pivots on two columns containing N and M values, the resulting pivot table has N × M
values. We cringe at the prospect of so many columns and such obtuse column names.
Rather than extend the result table to have many new columns, a more conservative ap-
proach prevents the exponential growth of columns by overloading column values. The idea
is to introduce an ALL value. Table 5a demonstrates this relational and more convenient rep-
resentation. The dummy value “ALL” has been added to fill in the super-aggregation items:
Table 5a is not really a completely new representation or operation. Since Table 5a is a
relation, it is not surprising that it can be built using standard SQL. The SQL statement to
build this SalesSummary table from the raw Sales data is:

SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Sales)


FROM Sales
WHERE Model = ‘Chevy’
UNION
SELECT Model, ‘ALL’, ‘ALL’, SUM(Sales)
FROM Sales
WHERE Model = ‘Chevy’
GROUP BY Model
UNION
SELECT Model, Year, ‘ALL’, SUM(Sales)
FROM Sales
WHERE Model = ‘Chevy’
GROUP BY Model, Year
UNION
SELECT Model, Year, Color, SUM(Sales)
FROM Sales
WHERE Model = ‘Chevy’
GROUP BY Model, Year, Color;

This is a simple 3-dimensional roll-up. Aggregating over N dimensions requires N such


unions.

Table 5a. Sales summary.

Model Year Color Units

Chevy 1994 Black 50


Chevy 1994 White 40
Chevy 1994 ALL 90
Chevy 1995 Black 85
Chevy 1995 White 115
Chevy 1995 ALL 200
Chevy ALL ALL 290
562 Chapter 7: Data Warehousing

DATA CUBE: A RELATIONAL AGGREGATION OPERATOR 37

Roll-up is asymmetric—notice that Table 5a aggregates sales by year but not by color.
These missing rows are shown in Table 5b.

Table 5b. Sales summary rows missing form Table 5a to convert the roll-up into a cube.

Model Year Color Units

Chevy ALL Black 135


Chevy ALL White 155

These additional rows could be captured by adding the following clause to the SQL
statement above:

UNION
SELECT Model, ‘ALL’, Color, SUM(Sales)
FROM Sales
WHERE Model = ‘Chevy’
GROUP BY Model, Color;

The symmetric aggregation result is a table called a cross-tabulation, or cross tab for
short. Tables 5a and 5b are the relational form of the crosstabs, but crosstab data is routinely
displayed in the more compact format of Table 6.
This cross tab is a two-dimensional aggregation. If other automobile models are added,
it becomes a 3D aggregation. For example, data for Ford products adds an additional cross
tab plane.
The cross-tab-array representation (Tables 6a and b) is equivalent to the relational repre-
sentation using the ALL value. Both generalize to an N -dimensional cross tab. Most report
writers build in a cross-tabs feature, building the report up from the underlying tabular
data such as Table 5. See for example the TRANSFORM-PIVOT operator of Microsoft Ac-
cess (Microsoft Access Relational Database Management System for Windows, Language
Reference, 1994).

Table 6a. Chevy sales cross tab.

Chevy 1994 1995 Total (ALL)

Black 50 85 135
White 40 115 155
Total (ALL) 90 200 290

Table 6b. Ford sales cross tab.

Ford 1994 1995 Total (ALL)

Black 50 85 135
White 10 75 85
Total (ALL) 60 160 220
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 563

38 GRAY ET AL.

The representation suggested by Table 5 and unioned GROUP BYs “solve” the problem of
representing aggregate data in a relational data model. The problem remains that expressing
roll-up, and cross-tab queries with conventional SQL is daunting. A six dimension cross-
tab requires a 64-way union of 64 different GROUP BY operators to build the underlying
representation.
There is another very important reason why it is inadequate to use GROUP BYs. The
resulting representation of aggregation is too complex to analyze for optimization. On most
SQL systems this will result in 64 scans of the data, 64 sorts or hashes, and a long wait.

3. CUBE and ROLLUP operators

The generalization of group by, roll-up and cross-tab ideas seems obvious: Figure 3 shows
the concept for aggregation up to 3-dimensions. The traditional GROUP BY generates the
N -dimensional data cube core. The N − 1 lower-dimensional aggregates appear as points,
lines, planes, cubes, or hyper-cubes hanging off the data cube core.
The data cube operator builds a table containing all these aggregate values. The total
aggregate using function f() is represented as the tuple:

ALL, ALL, ALL, . . . , ALL, f(*)

Points in higher dimensional planes or cubes have fewer ALL values.

Figure 3. The CUBE operator is the N -dimensional generalization of simple aggregate functions. The 0D data
cube is a point. The 1D data cube is a line with a point. The 2D data cube is a cross tabulation, a plane, two lines,
and a point. The 3D data cube is a cube with three intersecting 2D cross tabs.
564 Chapter 7: Data Warehousing

DATA CUBE: A RELATIONAL AGGREGATION OPERATOR 39

Figure 4. A 3D data cube (right) built from the table at the left by the CUBE statement at the top of the figure.

Creating a data cube requires generating the power set (set of all subsets) of the aggrega-
tion columns. Since the CUBE is an aggregation operation, it makes sense to externalize it
by overloading the SQL GROUP BY operator. In fact, the cube is a relational operator, with
GROUP BY and ROLL UP as degenerate forms of the operator. This can be conveniently
specified by overloading the SQL GROUP BY3 .
Figure 4 has an example of the cube syntax. To give another, here follows a statement to
aggregate the set of temperature observations:

SELECT day, nation, MAX(Temp)


FROM Weather
GROUP BY CUBE
Day(Time) AS day,
Country(Latitude, Longitude)
AS nation;

The semantics of the CUBE operator are that it first aggregates over all the <select
list> attributes in the GROUP BY clause as in a standard GROUP BY. Then, it UNIONs
in each super-aggregate of the global cube—substituting ALL for the aggregation columns.
If there are N attributes in the <select list>, there will be 2 N − 1 super-aggregate
values. If the cardinality of the N attributes are C1 , C2 , . . . , C N then the cardinality of the
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 565

40 GRAY ET AL.

resulting cube relation is (Ci + 1). The extra value in each domain is ALL. For example,
the SALES table has 2 × 3 × 3 = 18 rows, while the derived data cube has 3 × 4 × 4 = 48
rows.
If the application wants only a roll-up or drill-down report, similar to the data in Table 3a,
the full cube is overkill. Indeed, some parts of the full cube may be meaningless. If the
answer set is not is not normalized, there may be functional dependencies among columns.
For example, a date functionally defines a week, month, and year. Roll-ups by year, week,
day are common, but a cube on these three attributes would be meaningless.
The solution is to offer ROLLUP in addition to CUBE. ROLLUP produces just the super-
aggregates:

(v1 ,v2 ,...,vn, f()),


(v1 ,v2 ,...,ALL, f()),

...
(v1 ,ALL,...,ALL, f()),
(ALL,ALL,...,ALL, f()).

Cumulative aggregates, like running sum or running average, work especially well with
ROLLUP because the answer set is naturally sequential (linear) while the full data
cube is naturally non-linear (multi-dimensional). ROLLUP and CUBE must be ordered for
cumulative operators to apply.
We investigated letting the programmer specify the exact list of super-aggregates but
encountered complexities related to collation, correlation, and expressions. We believe
ROLLUP and CUBE will serve the needs of most applications.

3.1. The GROUP, CUBE, ROLLUP algebra

The GROUP BY, ROLLUP, and CUBE operators have an interesting algebra. The CUBE of a
ROLLUP or GROUP BY is a CUBE. The ROLLUP of a GROUP BY is a ROLLUP. Algebraically,
this operator algebra can be stated as:

CUBE(ROLLUP) = CUBE
ROLLUP(GROUP BY) = ROLLUP

So it makes sense to arrange the aggregation operators in the compound order where the
“most powerful” cube operator at the core, then a roll-up of the cubes and then a group by
of the roll-ups. Of course, one can use any subset of the three operators:

GROUP BY <select list>


ROLLUP <select list>
CUBE <select list>

The following SQL demonstrates a compound aggregate. The “shape” of the answer is
diagrammed in figure 5:
566 Chapter 7: Data Warehousing

DATA CUBE: A RELATIONAL AGGREGATION OPERATOR 41

Figure 5. The combination of a GROUP BY on Manufacture, ROLLUP on year, month, day, and CUBE on some
attributes. The aggregate values are the contents of the cube.

SELECT Manufacturer, Year, Month, Day, Color, Model,


SUM(price) AS Revenue
FROM Sales
GROUP BY Manufacturer,
ROLLUP Year(Time) AS Year,
Month(Time) AS Month,
Day(Time) AS Day,
CUBE Color, Model;

3.2. A syntax proposal

With these concepts in place, the syntactic extension to SQL is fairly easily defined. The
current SQL GROUP BY syntax is:

GROUP BY
{<column name> [collate clause] ,...}

To support histograms and other function-valued aggregations, we first extend the GROUP
BY syntax to:

GROUP BY <aggregation list>


<aggregation list> ::=
{ ( <column name> | <expression> )
[ AS <correlation name> ]
[ <collate clause> ]
,...}

These extensions are independent of the CUBE operator. They remedy some pre-existing
problems with GROUP BY. Many systems already allow these extensions.
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 567

42 GRAY ET AL.

Now extend SQL’s GROUP BY operator:


GROUP BY [ <aggregation list> ]
[ ROLLUP <aggregation list> ]
[ CUBE <aggregation list> ]

3.3. A discussion of the ALL value

Is the ALL value really needed? Each ALL value really represents a set—the set over which
the aggregate was computed4 . In the Table 5 SalesSummary data cube, the respective
sets are:
Model.ALL = ALL(Model) = {Chevy, Ford}
Year.ALL = ALL(Year) = {1990,1991,1992}
Color.ALL = ALL(Color) = {red,white,blue}

In reality, we have stumbled in to the world of nested relations—relations can be values.


This is a major step for relational systems. There is much debate on how to proceed. In this
section, we briefly discuss the semantics of ALL in the context of SQL. This design may be
eased by SQL3’s support for set-valued variables and domains.
We can interpret each ALL value as a context-sensitive token representing the set it
represents. Thinking of the ALL value as the corresponding set defines the semantics of the
relational operators (e.g., equals and IN). A function ALL() generates the set associated
with this value as in the examples above. ALL() applied to any other value returns NULL.
The introduction of ALL creates substantial complexity. We do not add it lightly—adding
it touches many aspects of the SQL language. To name a few:
• ALL becomes a new keyword denoting the set value.
• ALL [NOT] ALLOWED is added to the column definition syntax and to the column
attributes in the system catalogs.
• The set interpretation guides the meaning of the relational operators {=, IN}.
There are more such rules, but this gives a hint of the added complexity. As an aside, to
be consistent, if ALL represents a set then the other values of that domain must be treated
as singleton sets in order to have uniform operators on the domain.
It is convenient to know when a column value is an aggregate. One way to test this is
to apply the ALL() function to the value and test for a non-NULL value. This is so useful
that we propose a Boolean function GROUPING() that, given a select list element, returns
TRUE if the element is an ALL value, and FALSE otherwise.

3.4. Avoiding the ALL value

Veteran SQL implementers will be terrified of the ALL value—like NULL, it will create
many special cases. Furthermore, the proposal in Section 3.3. requires understanding of
sets as values. If the goal is to help report writer and GUI visualization software, then it
may be simpler to adopt the following approach5 :
568 Chapter 7: Data Warehousing

DATA CUBE: A RELATIONAL AGGREGATION OPERATOR 43

• Use the NULL value in place of the ALL value.


• Do not implement the ALL() function.
• Implement the GROUPING() function to discriminate between NULL and ALL.

In this minimalist design, tools and users can simulate the ALL value as by for example:

SELECT Model,Year,Color,SUM(sales),
GROUPING(Model),
GROUPING(Year),
GROUPING(Color)
FROM Sales
GROUP BY CUBE Model, Year, Color;

Wherever the ALL value appeared before, now the corresponding value will be NULL in the
data field and TRUE in the corresponding grouping field. For example, the global sum of
figure 4 will be the tuple:

(NULL,NULL,NULL,941,TRUE,TRUE,TRUE)

rather than the tuple one would get with the “real” cube operator:

(ALL, ALL, ALL, 941).

Using the limited interpretation of ALL as above excludes expressing some meaningful
queries ( just as traditional relational model makes it hard to handle disjunctive information).
However, the proposal makes it possible to express results of CUBE as a single relation in
the current framework of SQL.

3.5. Decorations

The next step is to allow decorations, columns that do not appear in the GROUP BY but that
are functionally dependent on the grouping columns. Consider the example:

SELECT department.name, sum(sales)


FROM sales JOIN department USING (department_number)
GROUP BY sales.department_number;

The department.name column in the answer set is not allowed in current SQL, since
it is neither an aggregation column (appearing in the GROUP BY list) nor is it an aggregate.
It is just there to decorate the answer set with the name of the department. We recommend
the rule that if a decoration column (or column value) is functionally dependent on the
aggregation columns, then it may be included in the SELECT answer list.
Decoration’s interact with aggregate values. If the aggregate tuple functionally defines
the decoration value, then the value appears in the resulting tuple. Otherwise the decoration
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 569

44 GRAY ET AL.

field is NULL. For example, in the following query the continent is not specified unless
nation is.

SELECT day,nation,MAX(Temp),
continent(nation) AS continent
FROM Weather
GROUP BY CUBE
Day(Time) AS day,
Country(Latitude, Longitude)
AS nation

The query would produce the sample tuples:

Table 7. Demonstrating decorations and ALL.

day nation max(temp) continent

25/1/1995 USA 28 North America


ALL USA 37 North America
25/1/1995 ALL 41 NULL
ALL ALL 48 NULL

3.6. Dimensions star, and snowflake queries

While strictly not part of the CUBE and ROLLUP operator design, there is an important
database design concept that facilitates the use of aggregation operations. It is common to
record events and activities with a detailed record giving all the dimensions of the event.
For example, the sales item record in figure 6 gives the id of the buyer, seller, the product
purchased, the units purchased, the price, the date and the sales office that is credited with
the sale. There are probably many more dimensions about this sale, but this example gives
the idea.

Figure 6. A snowflake schema showing the core fact table and some of the many aggregation granularities of
the core dimensions.
570 Chapter 7: Data Warehousing

DATA CUBE: A RELATIONAL AGGREGATION OPERATOR 45

There are side tables that for each dimension value give its attributes. For example,
the San Francisco sales office is in the Northern California District, the Western Region,
and the US Geography. This fact would be stored in a dimension table for the Office6 .
The dimension table may also have decorations describing other attributes of that Office.
These dimension tables define a spectrum of aggregation granularities for the dimension.
Analysists might want to cube various dimensions and then aggregate or roll-up the cube
up at any or all of these granularities.
The general schema of figure 6 is so common that it has been given a name: a snowflake
schema. Simpler schemas that have a single dimension table for each dimension are called a
star schema. Queries against these schemas are called snowflake queries and star queries
respectively.
The diagram of figure 6 suggests that the granularities form a pure hierarchy. In reality,
the granularities typically form a lattice. To take just a very simple example, days nest in
weeks but weeks do not nest in months or quarters or years (some weeks are partly in two
years). Analysts often think of dates in terms of weekdays, weekends, sale days, various
holidays (e.g., Christmas and the time leading up to it). So a fuller granularity graph of
figure 6 would be quite complex. Fortunately, graphical tools like pivot tables with pull
down lists of categories hide much of this complexity from the analyst.

4. Addressing the data cube

Section 5 discusses how to compute data cubes and how users can add new aggregate
operators. This section considers extensions to SQL syntax to easily access the elements
of a data cube—making it recursive and allowing aggregates to reference sub-aggregates.
It is not clear where to draw the line between the reporting-visualization tool and the
query tool. Ideally, application designers should be able to decide how to split the function
between the query system and the visualization tool. Given that perspective, the SQL
system must be a Turing-complete programming environment.
SQL3 defines a Turing-complete procedural programming language. So, anything is
possible. But, many things are not easy. Our task is to make simple and common things easy.
The most common request is for percent-of-total as an aggregate function. In SQL this
is computed as a nested SELECT SQL statements.

SELECT Model,Year,Color,SUM(Sales),
SUM(Sales)/
(SELECT SUM(Sales)
FROM Sales
WHERE Model IN {‘Ford’,‘Chevy’}
AND Year BETWEEN 1990 AND 1992
)
FROM Sales
WHERE Model IN { ‘Ford’, ‘Chevy’ }
AND Year BETWEEN 1990 AND 1992
GROUP BY CUBE Model, Year, Color;
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 571

46 GRAY ET AL.

It seems natural to allow the shorthand syntax to name the global aggregate:

SELECT Model, Year, Color


SUM(Sales) AS total,
SUM(Sales) / total(ALL,ALL,ALL)
FROM Sales
WHERE Model IN {‘Ford’, ‘Chevy’}
AND Year BETWEEN 1990 AND 1992
GROUP BY CUBE Model, Year, Color;

This leads into deeper water. The next step is a desire to compute the index of a value—an
indication of how far the value is from the expected value. In a set of N values, one expects
each item to contribute one N th to the sum. So the 1D index of a set of values is:

index(vi ) = vi /( j v j )

If the value set is two dimensional, this commonly used financial function is a nightmare
of indices. It is best described in a programming language. The current approach to
selecting a field value from a 2D cube would read as:

SELECT v
FROM cube
WHERE row = :i
AND column = :j

We recommend the simpler syntax:

cube.v(:i, :j)

as a shorthand for the above selection expression. With this notation added to the SQL
programming language, it should be fairly easy to compute super-super-aggregates from
the base cube.

5. Computing cubes and roll-ups

CUBE and ROLLUP generalize aggregates and GROUP BY, so all the technology for com-
puting those results also apply to computing the core of the cube (Graefe, 1993). The basic
technique for computing a ROLLUP is to sort the table on the aggregating attributes and
then compute the aggregate functions (there is a more detailed discussion of the kind of
aggregates in a moment.) If the ROLLUP result is small enough to fit in main memory,
it can be computed by scanning the input set and applying each record to the in-memory
ROLLUP. A cube is the union of many rollups, so the naive algorithm computes this union.
As Graefe (1993) points out, the basic techniques for computing aggregates are:

• To minimize data movement and consequent processing cost, compute aggregates at the
lowest possible system level.
572 Chapter 7: Data Warehousing

DATA CUBE: A RELATIONAL AGGREGATION OPERATOR 47

• If possible, use arrays or hashing to organize the aggregation columns in memory, storing
one aggregate value for each array or hash entry.
• If the aggregation values are large strings, it may be wise to keep a hashed symbol table
that maps each string to an integer so that the aggregate values are small. When a new
value appears, it is assigned a new integer. With this organization, the values become
dense and the aggregates can be stored as an N -dimensional array.
• If the number of aggregates is too large to fit in memory, use sorting or hybrid hashing to
organize the data by value and then aggregate with a sequential scan of the sorted data.
• If the source data spans many disks or nodes, use parallelism to aggregate each partition
and then coalesce these aggregates.

Some innovation is needed to compute the ‘‘ALL’’ tuples of the cube and roll-up from
the GROUP BY core. The ALL value adds one extra value to each dimension in the CUBE.
So, an N -dimensional cube of N attributes each with cardinality Ci , will have (Ci + 1)
values. If each Ci = 4 then a 4D CUBE is 2.4 times larger than the base GROUP BY. We
expect the Ci to be large (tens or hundreds) so that the CUBE will be only a little larger than
the GROUP BY. By comparison, an N -dimensional roll-up will add only N records to the
answer set.
The cube operator allows many aggregate functions in the aggregation list of the GROUP
BY clause. Assume in this discussion that there is a single aggregate function F() being
computed on an N -dimensional cube. The extension to computing a list of functions is a
simple generalization.
Figure 7 summarizes how aggregate functions are defined and implemented in many
systems. It defines how the database execution engine initializes the aggregate function,
calls the aggregate functions for each new value and then invokes the aggregate function to
get the final value. More sophisticated systems allow the aggregate function to declare a
computation cost so that the query optimizer knows to minimize calls to expensive functions.
This design (except for the cost functions) is now part of the proposed SQL standard.
The simplest algorithm to compute the cube is to allocate a handle for each cube cell.
When a new tuple: (x1 , x2 , . . . , x N , v) arrives, the Iter(handle, v) function is called
2 N times—once for each handle of each cell of the cube matching this value. The 2 N
comes from the fact that each coordinate can either be xi or ALL. When all the input tuples

Figure 7. System defined and user defined aggregate functions are initialized with a start() call that allocates and
initializes a scratchpad cell to compute the aggregate. Subsequently, the next() call is invoked for each value to be
aggregated. Finally, the end() call computes the aggregate from the scratchpad values, deallocates the scratchpad
and returns the result.
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 573

48 GRAY ET AL.

have been computed, the system invokes the final(&handle) function for each of the
(Ci + 1) nodes in the cube. Call this the 2N -algorithm. There is a corresponding order-N
algorithm for roll-up.
If the base table has cardinality T , the 2 N -algorithm invokes the Iter() function T ×2 N
times. It is often faster to compute the super-aggregates from the core GROUP BY, reducing
the number of calls by approximately a factor of T . It is often possible to compute the cube
from the core or from intermediate results only M times larger than the core. The following
trichotomy characterizes the options in computing super-aggregates.
Consider aggregating a two dimensional set of values {X i j | i = 1, . . . , I ; j = 1, . . . , J }.
Aggregate functions can be classified into three categories:

Distributive: Aggregate function F() is distributive if there is a function G() such that
F({X i, j }) = G({F({X i, j | i = 1, . . . , I }) | j = 1, . . . J }). COUNT(), MIN(), MAX(),
SUM() are all distributive. In fact, F = G for all but COUNT(). G = SUM() for the
COUNT() function. Once order is imposed, the cumulative aggregate functions also fit
in the distributive class.
Algebraic: Aggregate function F() is algebraic if there is an M-tuple valued function G()
and a function H () such that F({X i, j }) = H ({G({X i, j | i = 1, . . . , I }) | j = 1, . . . , J }).
Average(), standard deviation, MaxN(), MinN(), center of mass() are all algebraic. For
Average, the function G() records the sum and count of the subset. The H () function
adds these two components and then divides to produce the global average. Similar
techniques apply to finding the N largest values, the center of mass of group of objects,
and other algebraic functions. The key to algebraic functions is that a fixed size result
(an M-tuple) can summarize the sub-aggregation.
Holistic: Aggregate function F() is holistic if there is no constant bound on the size of
the storage needed to describe a sub-aggregate. That is, there is no constant M, such
that an M-tuple characterizes the computation F({X i, j | i = 1, . . . , I }). Median(),
MostFrequent() (also called the Mode()), and Rank() are common examples of holistic
functions.
We know of no more efficient way of computing super-aggregates of holistic functions
than the 2 N -algorithm using the standard GROUP BY techniques. We will not say more
about cubes of holistic functions.
Cubes of distributive functions are relatively easy to compute. Given that the core is
represented as an N -dimensional array in memory, each dimension having size Ci + 1, the
N − 1 dimensional slabs can be computed by projecting (aggregating) one dimension of
the core. For example the following computation aggregates the first dimension.

CUBE(ALL, x 2 , . . . , x N ) = F({CUBE(i, x 2 , . . . , x N ) | i = 1, . . . C1 }).

N such computations compute the N − 1 dimensional super-aggregates. The distributive


nature of the function F() allows aggregates to be aggregated. The next step is to compute
the next lower dimension—an (...ALL,..., ALL...) case. Thinking in terms of the cross tab,
one has a choice of computing the result by aggregating the lower row, or aggregating the
right column (aggregate (ALL, ∗) or (∗, ALL)). Either approach will give the same answer.
The algorithm will be most efficient if it aggregates the smaller of the two (pick the ∗ with
574 Chapter 7: Data Warehousing

DATA CUBE: A RELATIONAL AGGREGATION OPERATOR 49

the smallest Ci ). In this way, the super-aggregates can be computed dropping one dimension
at a time.
Algebraic aggregates are more difficult to compute than distributive aggregates. Recall
that an algebraic aggregate saves its computation in a handle and produces a result in the
end—at the Final() call. Average() for example maintains the count and sum values
in its handle. The super-aggregate needs these intermediate results rather than just the raw
sub-aggregate. An algebraic aggregate must maintain a handle (M-tuple) for each element
of the cube (this is a standard part of the group-by operation). When the core GROUP
BY operation completes, the CUBE algorithm passes the set of handles to each N − 1
dimensional super-aggregate. When this is done the handles of these super-aggregates are
passed to the super-super aggregates, and so on until the (ALL, ALL, . . . , ALL) aggregate
has been computed. This approach requires a new call for distributive aggregates:

Iter_super( &handle, &handle)

which folds the sub-aggregate on the right into the super aggregate on the left. The same
ordering idea (aggregate on the smallest list) applies at each higher aggregation level.
Interestingly, the distributive, algebraic, and holistic taxonomy is very useful in comput-
ing aggregates for parallel database systems. In those systems, aggregates are computed for
each partition of a database in parallel. Then the results of these parallel computations are
combined. The combination step is very similar to the logic and mechanism used in figure 8.
If the data cube does not fit into memory, array techniques do not work. Rather one
must either partition the cube with a hash function or sort it. These are standard techniques
for computing the GROUP BY. The super-aggregates are likely to be orders of magnitude
smaller than the core, so they are very likely to fit in memory. Sorting is especially conve-
nient for ROLLUP since the user often wants the answer set in a sorted order—so the sort
must be done anyway.
It is possible that the core of the cube is sparse. In that case, only the non-null elements
of the core and of the super-aggregates should be represented. This suggests a hashing or a
B-tree be used as the indexing scheme for aggregation values (Method and Apparatus for
Storing and Retrieving Multi-Dimensional Data in Computer Memory, 1994).

6. Maintaining cubes and roll-ups

SQL Server 6.5 has supported the CUBE and ROLLUP operators for about a year now.
We have been surprised that some customers use these operators to compute and store the
cube. These customers then define triggers on the underlying tables so that when the tables
change, the cube is dynamically updated.
This of course raises the question: how can one incrementally compute (user-defined)
aggregate functions after the cube has been materialized? Harinarayn et al. (1996) have
interesting ideas on pre-computing a sub-cubes of the cube assuming all functions are
holistic. Our view is that users avoid holistic functions by using approximation techniques.
Most functions we see in practice are distributive or algebraic. For example, medians
and quartiles are approximated using statistical techniques rather than being computed
exactly.
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 575

50 GRAY ET AL.

Figure 8. (Top) computing the cube with a minimal number of calls to aggregation functions. If the aggregation
operator is algebraic or distributive, then it is possible to compute the core of the cube as usual. (Middle) then,
the higher dimensions of the cube are computed by calling the super-iterator function passing the lower-level
scratch-pads. (Bottom) once an N -dimensional space has been computed, the operation repeats to compute the
N − 1 dimensional space. This repeats until N = 0.

The discussion of distributive, algebraic, and holistic functions in the previous section
was completely focused on SELECT statements, not on UPDATE, INSERT, or DELETE
statements.
Surprisingly, the issues of maintaining a cube are quite different from computing it in
the first place. To give a simple example: it is easy to compute the maximum value in
a cube—max is a distributive function. It is also easy to propagate inserts into a “max”
576 Chapter 7: Data Warehousing

DATA CUBE: A RELATIONAL AGGREGATION OPERATOR 51

N -dimensional cube. When a record is inserted into the base table, just visit the 2N super-
aggregates of this record in the cube and take the max of the current and new value. This
computation can be shortened—if the new value “loses” one competition, then it will lose
in all lower dimensions. Now suppose a delete or update changes the largest value in the
base table. Then 2 N elements of the cube must be recomputed. The recomputation needs
to find the global maximum. This seems to require a recomputation of the entire cube. So,
max is a distributive for SELECT and INSERT, but it is holistic for DELETE.
This simple example suggests that there are orthogonal hierarchies for SELECT, INSERT,
and DELETE functions (update is just delete plus insert). If a function is algebraic for insert,
update, and delete (count() and sum() are such a functions), then it is easy to maintain the
cube. If the function is distributive for insert, update, and delete, then by maintaining the
scratchpads for each cell of the cube, it is fairly inexpensive to maintain the cube. If the
function is delete-holistic (as max is) then it is expensive to maintain the cube. These ideas
deserve more study.

7. Summary

The cube operator generalizes and unifies several common and popular concepts:

aggregates,
group by,
histograms,
roll-ups and drill-downs and,
cross tabs.

The cube operator is based on a relational representation of aggregate data using the ALL
value to denote the set over which each aggregation is computed. In certain cases it makes
sense to restrict the cube operator to just a roll-up aggregation for drill-down reports.
The data cube is easy to compute for a wide class of functions (distributive and algebraic
functions). SQL’s basic set of five aggregate functions needs careful extension to include
functions such as rank, N tile, cumulative, and percent of total to ease typical data mining
operations. These are easily added to SQL by supporting user-defined aggregates. These ex-
tensions require a new super-aggregate mechanism to allow efficient computation of cubes.

Acknowledgments

Joe Hellerstein suggested interpreting the ALL value as a set. Tanj Bennett, David Maier
and Pat O’Neil made many helpful suggestions that improved the presentation.

Notes

1. These criticisms led to a proposal to include these features in the draft SQL standard (ISO/IEC DBL:MCI-006,
1996).
2. It seems likely that a relational pivot operator will appear in database systems in the near future.
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 577

52 GRAY ET AL.

3. An earlier version of this paper (Gray et al., 1996) and the Microsoft SQL Server 6.5 product implemented
a slightly different syntax. They suffix the GROUP BY clause with a ROLLUP or CUBE modifier. The SQL
Standards body chose an infix notation so that GROUP BY and ROLLUP and CUBE could be mixed in a single
statement. The improved syntax is described here.
4. This is distinct from saying that ALL represents one of the members of the set.
5. This is the syntax and approach used by Microsoft’s SQL Server (version 6.5).
6. Database normalization rules (Date, 1995) would recommend that the California District be stored once, rather
than storing it once for each Office. So there might be an office, district, and region tables, rather than one big
denormalized table. Query users find it convenient to use the denormalized table.

References

Agrawal, R., Deshpande, P., Gupta, A., Naughton, J.F., Ramakrishnan, R., and Sarawagi, S. 1996. On the
Computation of Multidimensional Aggregates. Proc. 21st VLDB, Bombay.
Chamberlin, D. 1996. Using the New DB2—IBM’s Object-Relational Database System. San Francisco, CA:
Morgan Kaufmann.
DataBlade Developer’s Kit: Users Guide 2.0. Informix Software, Menlo Park, CA, 1996.
Date, C.J. 1995. Introduction to Database Systems. 6th edition, N.Y.: Addison Wesley.
Date, C.J. 1996. Aggregate functions. Database Programming and Design, 9(4): 17–19.
Graefe, C.J. 1993. Query evaluation techniques for large databases. ACM Computing Surveys, 25.2, pp. 73–170.
Gray, J. (Ed.) 1991. The Benchmark Handbook. San Francisco, CA: Morgan Kaufmann.
Gray, J., Bosworth, A., Layman, A., and Pirahesh, H. 1996. Data cube: A relational operator generalizing group-by,
cross-tab, and roll-up. Proc. International Conf. on Data Engineering. New Orleans: IEEE Press.
Harinarayn, V., Rajaraman, A., and Ullman, J.D. 1996. Implementing data cubes efficiently. Proc. ACM SIGMOD.
Montreal, pp. 205–216.
1992. IS 9075 International Standard for Database Language SQL, document ISO/IEC 9075:1992, J. Melton (Ed.).
1996. ISO/IEC DBL:MCI-006 (ISO Working Draft) Database Language SQL—Part 4: Persistent Stored Modules
(SQL/PSM), J. Melton (Ed.).
Melton, J. and Simon, A.R. 1993. Understanding the New SQL: A Complete Guide. San Francisco, CA: Morgan
Kaufmann.
1994. Method and Apparatus for Storing and Retrieving Multi-Dimensional Data in Computer Memory. Inventor:
Earle; Robert J., Assignee: Arbor Software Corporation, US Patent 05359724.
1994. Microsoft Access Relational Database Management System for Windows, Language Reference—Functions,
Statements, Methods, Properties, and Actions, DB26142, Microsoft, Redmond, WA.
1995. Microsoft Excel—User’s Guide. Microsoft. Redmond, WA.
1996. Microsoft SQL Server: Transact-SQL Reference, Document 63900. Microsoft Corp. Redmond, WA.
1994. RISQL Reference Guide, Red Brick Warehouse VPT Version 3, Part no.: 401530, Red Brick Systems, Los
Gatos. CA.
Shukla, A., Deshpande, P., Naughton, J.F., and Ramaswamy, K. 1996. Storage estimation for multidimensional
aggregates in the presence of hierarchies. Proc. 21st VLDB, Bombay.
1993. The Benchmark Handbook for Database and Transaction Processing Systems—2nd edition, J. Gray (Ed.),
San Francisco, CA: Morgan Kaufmann. Or http://www.tpc.org/

Jim Gray is a specialist in database and transaction processing computer systems. At Microsoft his research
focuses on scaleable computing: building super-servers and workgroup systems from commodity software and
hardware. Prior to joining Microsoft, he worked at Digital, Tandem, IBM and AT&T on database and transaction
processing systems including Rdb, ACMS, NonStopSQL, Pathway, System R, SQL/DS, DB2, and IMS-Fast Path.
He is editor of the Performance Handbook for Database and Transaction Processing Systems, and coauthor of
Transaction Processing Concepts and Techniques. He holds doctorates from Berkely and Stuttgart, is a Member of
the National Academy of Engineering, Fellow of the ACM, a member of the National Research council’s computer
Science and Telecommunications Board, Editor in Chief of the VLDB Journal, Trustee of the VLDB Foundation,
and Editor of the Morgan Kaufmann series on Data Management.
578 Chapter 7: Data Warehousing

DATA CUBE: A RELATIONAL AGGREGATION OPERATOR 53

Surajit Chaudhuri is a researcher in the Database research Group of Microsoft Research. From 1992 to 1995, he
was a Member of the Technical Staff at Hewlett-Packard Laboratories, Palo Alto. He did his B.Tech from Indian
Institute of Technology, Kharagpur and his Ph.D. from Stanford University. Surajit has published in SIGMOD,
VLDB and PODS in the area of optimization of queries and multimedia systems. He served in the program
committees for VLDB 1996 and International Conference on Database Theory (ICDT), 1997. He is a vice-chair
of the Program Committee for the upcoming International Conference on Data Engineering (ICDE), 1997. In
addition to query processing and optimization, Surajit is interested in the areas of data mining, database design
and uses of databases for nontraditional applications.

Adam Bosworth is General Manager (co-manager actually) of Internet Explorer 4.0. Previously General Manager
of ODBC for Microsoft and Group Program Manager for Access for Microsoft; General Manager for Quattro for
Borland.

Andrew Layman has been a Senior Program Manager at Microsoft Corp. since 1992. He is currently working
on language integration for Internet Explorer. Before that, he designed and built a number of high-performance,
data-bound Active-X controls for use across several Microsoft products and worked on the original specs for
Active-X controls (nee “OLE Controls”). Formerly he was Vice-President of Symantec.

Don Reichart is currently a software design engineer at Microsoft working in the SQL Server query engine area.
He holds a B.Sc. degree in computer science from the University of Southern California.

Murali Venkatrao is a program manager at Microsoft Corp. Currently he is working on multi-dimensional


databases and the use of relational DBMS for OLAP type applications. During his 5 years at Microsoft, he has
mainly worked on designing interfaces for heterogeneous database access. Murali’s graduate work was in the area
of computational complexity theory and its applications to real time scheduling.

Frank Pellow is a senior development analyst at the IBM Laboratory in Toronto. As an external software architect,
Frank is part of the small team responsible for the SQL language in the DB2 family of products. Most recently,
he has focused on callable SQL (CLI, ODBC) as well as on object extensions to the relational model both within
IBM and within the SQL standards bodies. Frank wrote the ANSI and ISO proposals to have the SQL standards
extended with many of the capabilities outlined in this paper.

Hamid Pirahesh, Ph.D., has been a Research Staff Member at IBM Almaden Research Center in San Jose,
California since 1985. He has been involved in research, design and implementation of Starburst extensible
database system. Dr. Pirahesh has close cooperations with IBM Database Technology Institute and IBM product
division. He also has direct responsibilities in development of IBM DB2 CS product. He has been active in
several areas of database management systems, computer networks, and object oriented systems, and has served
on many program committees of major computer conferences. His recent research activities cover various aspects
of database management systems, including extensions for Object Oriented systems, complex query optimization,
deductive databases, concurrency control, and recovery. Before joining IBM, he worked at Citicorp/TTI in the areas
of distributed transaction processing systems amd computer networks. Previous to that, he was active in the design
and implementation of computer applications and electronic hardware systems. Dr. Pirahesh is an associate editor
of ACM Computing Surveys Journal. He received M.S. and Ph.D. degrees in computer science from University
of California at Los Angeles and a B.S. degree in Electrical Engineering from Institute of Technology, Tehran.
580 Chapter 7: Data Warehousing
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates 581
582 Chapter 7: Data Warehousing
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates 583
584 Chapter 7: Data Warehousing
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates 585
586 Chapter 7: Data Warehousing
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates 587
588 Chapter 7: Data Warehousing
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates 589
590 Chapter 7: Data Warehousing
592 Chapter 7: Data Warehousing
Deriving Production Rules for Incremental View Maintenance 593
594 Chapter 7: Data Warehousing
Deriving Production Rules for Incremental View Maintenance 595
596 Chapter 7: Data Warehousing
Deriving Production Rules for Incremental View Maintenance 597
598 Chapter 7: Data Warehousing
Deriving Production Rules for Incremental View Maintenance 599
600 Chapter 7: Data Warehousing
Deriving Production Rules for Incremental View Maintenance 601
602 Chapter 7: Data Warehousing
Deriving Production Rules for Incremental View Maintenance 603
Data Mining and Knowledge Discovery, 12, 281–314, 2000

c 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.

Informix under CONTROL: Online Query


Processing
JOSEPH M. HELLERSTEIN jmh@cs.berkeley.edu
RON AVNUR ronathan@cs.berkeley.edu
VIJAYSHANKAR RAMAN rshankar@cs.berkeley.edu
Computer Science Division, U.C. Berkeley, USA

Editors: Fayyad, Mannila, Ramakrishnan

Abstract. The goal of the CONTROL project at Berkeley is to develop systems for interactive analysis of large
data sets. We focus on systems that provide users with iteratively refining answers to requests and online control
of processing, thereby tightening the loop in the data analysis process. This paper presents the database-centric
subproject of CONTROL: a complete online query processing facility, implemented in a commercial Object-
Relational DBMS from Informix. We describe the algorithms at the core of the system, and detail the end-to-end
issues required to bring the algorithms together and deliver a complete system.

Keywords: online query processing, interactive, informix, control, data analysis, ripple joins, online reordering

1. Introduction

Of all men’s miseries, the bitterest is this: to know so much and have control over nothing.
– Herodotus
Data analysis is a complex task. Many tools can been brought to bear on the problem,
from user-driven SQL and OLAP systems, to machine-automated data mining algorithms,
with hybrid approaches in between. All the solutions on this spectrum share a basic prop-
erty: analyzing large amounts of data is a time-consuming task. Decision-support SQL
queries often run for hours or days before producing output; so do data mining algorithms
(Agrawal, 1997). It has recently been observed that user appetite for online data storage is
growing faster than what Moore’s Law predicts for the growth in hardware performance
(Papadopoulos, 1997; Winter and Auerbach, 1998), suggesting that the inherent sluggish-
ness of data analysis will only worsen over time.
In addition to slow performance, non-trivial data analysis techniques share a second
common property: they require thoughtful deployment by skilled users. It is well-known
that composing SQL queries requires sophistication, and it is not unusual today to see an
SQL query spanning dozens of pages (Walter, 1998). Even for users of graphical front-end
tools, generating the correct query for a task is very difficult. Perhaps less well-appreciated
is the end-user challenge of deploying the many data mining algorithms that have been
developed. While data mining algorithms are typically free of complex input languages,
using them effectively depends on a judicious choice of algorithm, and on the careful tuning
of various algorithm-specific parameters (Fayyad, 1996).
Informix under CONTROL: Online Query Processing 605

282 HELLERSTEIN, AVNUR AND RAMAN

A third common property of data analysis is that it is a multi-step process. Users are
unlikely to be able to issue a single, perfectly chosen query that extracts the “desired
information” from a database; indeed the idea behind data analysis is to extract heretofore
unknown information. User studies have found that information seekers very naturally
work in an iterative fashion, starting by asking broad questions, and continually refining
them based on feedback and domain knowledge (O’day and Jeffries, 1993). This iteration
of analyses is a natural human mode of interaction, and not clearly an artifact of current
software, interfaces, or languages.
Taken together, these three properties result in a near-pessimal human-computer inter-
action: data analysis today is a complex process involving multiple time-consuming steps.
A poor choice or erroneous query at a given step is not caught until the end of the step
when results are available. The long delay and absolute lack of control between successive
queries disrupts the concentration of the user and hampers the process of data analysis.
Therefore many users eschew sophisticated techniques in favor of cookie-cutter reports,
significantly limiting the impact of new data analysis technologies. In short, the mode of
human-computer interaction during data analysis is fundamentally flawed.

1.1. CONTROL: Interactive data analysis

The CONTROL1 project attempts to improve the interaction between users and computers
during data analysis. Traditional tools present black box interfaces: users provide inputs,
the system processes silently for a significant period, and returns outputs. Because of the
long processing times, this interaction is reminiscent of the batch processing of the 1960’s
and ‘70’s. By contrast, CONTROL systems have an online mode of interaction: users can
control the system at all times, and the system continuously provides useful output in the
form of approximate or partial results. Rather than a black box, online systems are intended to
operate like a crystal ball: the user “sees into” the online processing, is given a glimpse of the
final results, and can use that information to change the results by changing the processing.
This significantly tightens the loop for asking multiple questions: users can quickly sense
if their question is a useful one, and can either refine or halt processing if the question was
not well-formed. We describe a variety of interactive online systems in Section 2.
Though the CONTROL project’s charter was to solve interface problems, we quickly
realized that the solutions would involve fundamental shifts in system performance goals
(Hellerstein, 1997). Traditional algorithms are optimized to complete as quickly as possi-
ble. By contrast, online data analysis techniques may never complete; users halt them when
answers are “good enough”. So instead of optimizing for completion time, CONTROL
systems must balance two typically conflicting performance goals: minimizing uneventful
“dead time” between updates for the user, while simultaneously maximizing the rate at which
partial or approximate answers approach a correct answer. Optimizing only one of these
goals is relatively easy: traditional systems optimize the second goal (they quickly achieve a
correct answer) by pessimizing the first goal (they provide no interactivity). Achieving both
goals simultaneously requires redesigning major portions of a data analysis system, employ-
ing a judicious mix of techniques from data delivery, query processing, statistical estimation
and user interfaces. As we will see, these techniques can interact in non-trivial ways.
606 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 283

1.2. Online query processing in informix

In this paper we focus on systems issues for implementing online query processing—i.e.,
CONTROL for SQL queries. Online query processing enables a user to issue an SQL query,
see results immediately, and adjust the processing as the query runs. In the case of an online
aggregation query, the user sees refining estimates of the final aggregation results. In the
case of an online enumeration query (i.e., a query with no aggregation), the user receives an
ever-growing collection of result records, which are available for browsing via user interface
tools. In both cases, users should be able to provide feedback to the system while the query
is running, to control the flow of estimates or records. A common form of control is to
terminate delivery of certain classes of estimates or records; more generally, users might
express a preference for certain classes of estimates or records over others.
Online query processing algorithms are designed to produce a steady stream of output
records, which typically serve as input to statistical estimators and/or intelligent user inter-
faces. Our discussion here focuses mostly on the pipelined production of records and the
control of that data flow; statistical and interface issues are considered in this paper only
to the extent that they drive the performance goals, or interact with implementation issues.
The interested reader is referred to Section 1.4 for citations on the statistical and interface
aspects of online query processing.
As a concrete point of reference, we describe our experience implementing online query
processing in a commercial object-relational database management system (DBMS): In-
formix’s Dynamic Server with Universal Data Option (UDO) (Informix, 1998). The pedi-
gree of UDO is interesting: formerly known as Informix Universal Server, it represents
the integration of Informix’s original high-performance relational database engine with
the object-relational facilities of Illustra (Illustra, 1994), which in turn was the commer-
cialization of the Postgres research system (Stonebraker and Kemnitz, 1991). Informix
Corporation made its source code and development environment available to us for this
research, enabling us to test our ideas within a complete SQL database engine.
Working with UDO represented a significant challenge and opportunity. UDO is a large
and complex system developed (in its various ancestral projects) over more than 15 years.
As we describe in Hellerstein et al. (1997), online query processing cannot simply be
implemented as a “plug-in” module for an existing system. Most of the work described in
this paper involved adding significant new features to the UDO database engine itself. In a
few cases—particularly in crafting an API for standard client applications—we were able to
leverage the object-relational extensibility of UDO to our advantage, as we describe below.
In addition to adding new algorithms to the system, a significant amount of effort went
into architecting a complete end-to-end implementation. Our implementation allows the
various algorithms to be pipelined into complex query plans, and interacts effectively with
a wide variety of client tools. This paper describes both the core algorithms implemented
in UDO, as well as the architectural issues required to provide a usable system.

1.3. Structure of the paper

We discuss related work in Section 1.4. In Section 2 we describe a number of application


scenarios for CONTROL-based systems. Section 3 describes the core algorithms used in
Informix under CONTROL: Online Query Processing 607

284 HELLERSTEIN, AVNUR AND RAMAN

online query processing including access methods, data delivery algorithms, and join algo-
rithms. Section 4 describes the end-to-end challenges of putting these algorithms together
in the context of a commercial object-relational database management system. Section 5
demonstrates the performance of the system, in terms both of interactivity and rate of
convergence to accurate answers. In Section 6 we conclude with a discussion of future
work.

1.4. Related work

The CONTROL project began by studying online aggregation, which was motivated in
(Hellerstein (1997a) and Hellerstein et al. (1997). The idea of online processing has been
expanded upon within the project (Hellerstein, 1997b; Hellerstein, 1998a; Hellerstein et al.,
1999; Hidber, 1997) a synopsis of these thrusts is given in Section 2. Recently we have
presented the details of two of our core query processing algorithms: ripple joins (Haas and
Hellerstein, 1999) and online reordering (Raman et al., 1999). Estimation and confidence
interval techniques for online aggregation are presented in Haas (1996, 1997) and Haas and
Hellerstein (1999). To our knowledge, the earliest work on approximate answers to decision-
support queries appears in Morgenstein’s dissertation from Berkeley (Morgenstein, 1980),
in which he presents motivation quite similar to ours, along with proposed techniques for
sampling from relations and from join results.
Our work on online aggregation builds upon earlier work on estimation and confidence
intervals in the database context (Hou et al., 1988; Haas et al., 1996; Lipton et al., 1993).
The prior work has been concerned with methods for producing a confidence interval with
a width that is specified prior to the start of query processing (e.g. “get within 2% of the
actual answer with 95% probability”). The underlying idea in most of these methods is
to effectively maintain a running confidence interval (not displayed to the user) and stop
sampling as soon as the length of this interval is sufficiently small. Hou, et al. (1989) consider
the related problem of producing a confidence interval of minimal length, given a real-time
stopping condition (e.g. “run for 5 minutes only”). The drawback with using sampling
to produce approximate answers is that the end-user needs to understand the statistics.
Moreover, making the user specify statistical stopping conditions at the beginning reduces
the execution time but does not make the execution interactive; for instance there is no
way to dynamically control the rate of processing—or the desired accuracy—for individual
groups of records.
More recent work has focused on maintaining precomputed summary statistics for ap-
proximately answering queries (Gibbons and Matias, 1998; Gibbons et al., 1998); Olken
also proposed the construction of sample views (Olken, 1993). In a similar though sim-
pler vein, Informix has included simple precomputed samples for approximate results to
ROLAP queries (Informix, 1998). These techniques are to online query processing what ma-
terialized views are to ad hoc queries: they enhance performance by precomputing results,
but are inapplicable when users ask queries that cannot exploit the precomputed results.
In the context of approximate query answers, ad hoc specification applies both to queries
and to the stopping criteria for sampling: a user may specify any query, and want to see
608 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 285

answers with differing accuracies. Unlike general materialized views, most precomputed
summaries are on single tables, so many of the advantages of precomputed samples can be
achieved in an online query processing system via simple buffer management techniques. In
short, work on precomputed summaries is complementary to the techniques of this paper; it
seems viable to automate the choice and construction of precomputed summaries as an aid
to online query processing, much as Hybrid OLAP chooses queries to precompute to aid
OLAP processing (Shukla et al., 1998; Harinarayan et al., 1996; Pilot Software, 1998; SQL,
1998).
A related but quite different notion of precomputation for online query processing involves
semantically modeling data at multiple resolutions (Silberschatz et al., 1992). A version of
this idea was implemented in a system called APPROXIMATE (Vrbsky and Liu, 1993). This
system defines an approximate relational algebra which it uses to process standard relational
queries in an iteratively refined manner. If a query is stopped before completion, a superset
of the exact answer is returned in a combined extensional/intensional format. This model
is different from the type of data browsing we address with online query processing: it is
dependent on carefully designed metadata and does not address aggregation or statistical
assessments of precision.
There has been some initial work on “fast-first” query processing, which attempts to
quickly return the first few tuples of a query. Antoshenkov and Ziauddin report on the
Oracle Rdb (formerly DEC Rdb/VMS) system, which addresses the issues of fast-first
processing by running multiple query plans simultaneously; this intriguing architecture re-
quires some unusual query processing support (Antoshenkov and Ziauddin, 1996). Bayardo
and Miranker propose optimization and execution techniques for fast-first processing us-
ing nested-loops joins (Bayardo and Miranker, 1996). Carey and Kossman (1997, 1998),
Chaudhuri and Gravano (1996, 1999), and Donjerkovic and Ramakrishnan (1999) discuss
techniques for processing ranking and “top-N ” queries, which have a “fast-first” flavor as
well. Much of this work seems applicable to online query optimization, though integration
with online query processing algorithms has yet to be considered. Fagin (1998) proposes
an interesting algorithm for the execution of ranking queries over multiple sources that
optimizes for early results. This algorithm has a similar flavor to the Ripple Join algorithm
we discuss in Section 3.3.

2. Application scenarios and performance requirements

The majority of data analysis solutions are architected to provide black-box, batch behavior
for large data sets: this includes software for the back-office (SQL decision-support systems),
the desktop (spreadsheets and OLAP tools), and statistical analysis techniques (statistics
packages and data mining). The result is that either the application is frustratingly slow
(discouraging its use), or the user interface prevents the application from entering batch
states (constraining its use.) The applications in this section are being handled by current
tools with one or both of these approaches. In this section we describe online processing
scenarios, including online aggregation and enumeration, and online visualization. We also
briefly mention some ideas in online data mining.
Informix under CONTROL: Online Query Processing 609

286 HELLERSTEIN, AVNUR AND RAMAN

2.1. Online aggregation

Aggregation queries in relational database systems often require scanning and analyzing a
significant portion of a database. In current relational systems such query execution has batch
behavior, requiring a long wait for the user. Online query processing can make aggregation
an interactive process.
Consider the following simple relational query:

SELECT college, AVG(grade)


FROM enroll
GROUP BY college;

This query requests that all records in the enroll table be partitioned into groups by college,
and then for each college its name and average grade should be returned. The output of this
query in an online aggregation system can be a set of interfaces, one per output group, as
in figure 1. For each output group, the user is given a current estimate of the final answer.
In addition, a graph is drawn showing these estimates along with a description of their
accuracy: each estimate is drawn with bars that depict a confidence interval, which says that

Figure 1. An online aggregation interface.


610 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 287

with X % probability, the current estimate is within an interval of ±


from the final answer
(X is set to 95% in the figure). The “Confidence” slider on the lower left allows the user
to control the percentage probability, which in turn affects the 2 ·
width of the bars. In
addition, controls on the upper left of the screen are provided to stop processing on a group,
or to speed up or slow down one group relative to others. These controls allow the user to
devote more processing to groups of particular interest. These interfaces require the support
of significant modifications to a DBMS, which we describe in this paper. We have developed
estimators and corresponding implementation techniques for the standard SQL aggregates
AVG, COUNT, and STDDEV2 (Hellerstein et al., 1997; Hass and Hellerstein, 1999).
Online aggregation is particularly useful in “drill-down” scenarios: a user may ask for
aggregates over a coarse-grained grouping of records, as in the query above. Based on a
quick online estimate of the coarse-grained results, the user may choose to issue another
query to “drill down” into a set of particularly anomalous groups. Alternatively the user
may quickly find that their first query shows no interesting groups, and they may issue an
alternate query, perhaps grouping on different attributes, or requesting a different aggregate
computation. The interactivity of online aggregation enables users to explore their data in
a relatively painless fashion, encouraging data browsing.
The obvious alternative to online aggregation is to precompute aggregation results be-
fore people use the system—this is the solution of choice in the multidimensional OLAP
(MOLAP) tools (e.g., Hypersion Essbase OLAP Server, 1999). Note that the name OLAP
(“OnLine Analytic Processing”) is something of a misnomer for these systems. The analytic
processing in many OLAP tools is in fact done “off line” in batch mode; the user merely
navigates the stored results on line. This solution, while viable in some contexts, is an exam-
ple of the constrained usage mentioned at the beginning of this section: the only interactive
queries are those that have been precomputed. This constraint is often disguised with a
graphical interface that allows only precomputed queries to be generated. A concomitant
and more severe constraint is that these OLAP systems have trouble scaling beyond a few
dozen gigabytes because of both the storage costs of precomputed answers, and the time
required to periodically “refresh” those answers. Hybrids of precomputation and online
aggregation are clearly possible, in the same way that newer systems provide hybrids of
precomputation and batch query processing (e.g., Shukla et al., 1998; Harinarayan et al.,
1996; Pilot Software, 1998; Maier and Stein, 1986).

2.2. Online enumeration: Scalable spreadsheets

Database systems are often criticized as being hard to use. Many data analysts are experts
in a domain other than computing, and hence prefer simple direct-manipulation interfaces
like those of spreadsheets (Shneiderman, 1982), in which the data is at least partially visible
at all times. Domain-specific data patterns are often more easily seen by “eyeballing” a
spreadsheet than by attempting to formulate a query. For example, consider analyzing a
table of student grades. By sorting the output by GPA and scrolling to the top, middle, and
bottom, an analyst may notice a difference in the ethnic mix of names in different GPA
quantiles; this may be evidence of discrimination. By contrast, imagine trying to write an
Informix under CONTROL: Online Query Processing 611

288 HELLERSTEIN, AVNUR AND RAMAN

SQL aggregation query asking for the average grade per apparent ethnicity of the name
column—there is no way to specify the ethnicity of a name declaratively. The difficulty is
that the (rough) name-to-ethnicity mapping is domain knowledge in the analyst’s head, and
not captured in the database.
Unfortunately, spreadsheets do not scale gracefully to large datasets. An inherent problem
is that many spreadsheet behaviors are painfully slow on large datasets—if the spreadsheet
allows large data sets at all. Microsoft Excel, for example, restricts table size to 64K rows or
fewer, presumably to ensure interactive behavior. The difficulty of guaranteeing acceptable
spreadsheet performance on large datasets arises from the “speed of thought” response time
expected of spreadsheet operations such as scrolling, sorting on different columns, pivoting,
or jumping to particular cells in the table (by address or cell-content prefix). Thus traditional
spreadsheets are not useful for analyzing large amounts of data.
We are building A-B-C, a scalable spreadsheet that allows online interaction with individ-
ual records (Raman et al., 1999a). As records are enumerated from a large file or a database
query returning many rows, A-B-C allows the user to view example rows and perform typical
spreadsheet operations (scroll, sort, jump) at any time. A-B-C provides interactive (sub-
second) responses to all these operations via the access methods and reordering techniques
of Sections 3.1 and 3.2.
Hypotheses formed via online enumeration can be made concrete in A-B-C by grouping
records “by example”: the user can highlight example rows, and use them to interactively
develop a regular expression or other group identity function. Groups are then “rolled up” in
a separate panel of the spreadsheet, and users can interactively specify aggregation functions
to compute on the groups online. In this case, the online enumeration features of A-B-C are
a first step in driving subsequent online aggregation.

2.3. Aggregation + enumeration: Online data visualization

Data visualization is an increasingly active research area, with rather mature prototypes in
the research community (e.g. Tioga Datasplash (Aiken et al., 1996), DEVise (Livny et al.,
1997), Pad (Perlin and Fox, 1993)), and products emerging from vendors (Ohno, 1998).
These systems are interactive data exploration tools, allowing users to “pan” and “zoom”
over visual “canvases” representing a data set, and derive and view new visualizations
quickly.
An inherent challenge in architecting a data visualization system is that it must present
large volumes of information efficiently. This involves scanning, aggregating and rendering
large datasets at point-and-click speeds. Typically these visualization systems do not draw
a new screen until its image has been fully computed. Once again, this means batch-style
performance for large datasets. This is particularly egregious for visualization systems that
are expressly intended to support browsing of large datasets.
Related work in the CONTROL project involves the development of online visualization
techniques we call CLOUDS (Hellerstein et al., 1999), which can be thought of as visual
aggregations and enumerations for an online query processing system. CLOUDS performs
both enumeration and aggregation simultaneously: it renders records as they are fetched,
and also uses those records to generate an overlay of shaded rectangular regions of color
612 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 289

Figure 2. Snapshots of an online visualization of cities in the United States, with and without CLOUDS.

(“clouds”), corresponding to nodes in a carefully constructed quad tree. The combination of


the clouds and the rendered sample is intended to approximate the final image. This means
that the clouds are not themselves an approximation of the image, but rather a compensatory
shading that accounts for the difference between the rendered records and the projected final
outcome. During processing, the user sees the picture improve much the way that images
become refined during network transmission. This can be particularly useful when a user
pans or zooms over the results of an ad hoc query: in such scenarios the accuracy of what is
seen is not as important as the rough sense of the moving picture. Figure 2 shows a snapshot
of an online visualization of cities in the United States, with and without CLOUDS. Note
how the CLOUDS version contains shading that approximates the final density of areas
better than the non-CLOUDS version; note also how the CLOUDS visualization renders
both data points and shading.
As with the Scalable Spreadsheet, our data visualization techniques tie into data delivery,
and benefit directly from the access methods and reordering techniques described in Sections
3.1 and 3.2. For DBMS-centric visualization tools like DEVise and Tioga, the full power of
an online query processing system—including joins, aggregations, and so on—is needed
in the back end.

2.4. Online data mining

Many data mining algorithms make at least one complete pass over a database before
producing answers. In addition, most mining algorithms have a number of parameters to
tune, which are not adjustable while the algorithm is running. While we do not focus on
data mining algorithms in this paper, we briefly consider them here to highlight analogies
to online query processing.
As a well-known example, consider the oft-cited apriori algorithm for finding “associa-
tion rules” in market-basket data (Agrawal and Srikant, 1994). To use an association rule
application, a user specifies values for two variables: one that sets a minimum threshold
Informix under CONTROL: Online Query Processing 613

290 HELLERSTEIN, AVNUR AND RAMAN

on the amount of evidence required for a set of items to be produced (minsupport) and
another which sets a minimum threshold on the correlation between the items in the set
(minconfidence). These algorithms can run for hours without output, before producing as-
sociation rules that passed the minimum support and confidence thresholds. Users who set
those thresholds incorrectly typically have to start over. Setting thresholds too high means
that few rules are returned. Setting them too low means that the system (a) runs even more
slowly, and (b) returns an overwhelming amount of information, most of which is useless.
Domain experts may also want to explicitly prune irrelevant correlations during processing.
The traditional algorithm for association rules is a sequence of aggregation queries,
and can be implemented in an online fashion using techniques for online query processing
described in this paper. An alternative association rule algorithm called CARMA was devel-
oped in the CONTROL project (Hidber, 1997). While not clearly applicable to SQL query
processing, CARMA is worthy of mention here for two reasons. First, it very efficiently pro-
vides online interaction and early answers. Second—and somewhat surprisingly—CARMA
often produces a final, accurate answer faster than the traditional “batch” algorithms, both
because it makes fewer passes of the dataset and because it manages less memory-resident
state. So in at least one scenario, inventing an algorithm for online processing resulted in a
solution that is also better for batch processing!
Most other data mining algorithms (clustering, classification, pattern-matching) are simi-
larly time-consuming. CONTROL techniques seem worth considering for these algorithms,
and the development of such techniques seems to be a tractable research challenge. Note
that CONTROL techniques tighten loops in the knowledge-discovery process (Fayyad et al.,
1996), bringing mining algorithms closer in spirit to data visualization and browsing. Such
synergies between user-driven and automated techniques for data analysis seem like a
promising direction for cross-pollenation between research areas.

3. Algorithms for online query processing

Relational systems implement a relatively small set of highly tuned query processing oper-
ators. Online query processing is driven by online analogs of the standard relational query
processing operators, along with a few new operators. In this section we discuss our im-
plementation in Informix of online query processing operators, including randomized data
access, preferential data delivery, relational joins, and grouping of result records. Of these,
randomized data access was the simplest to address, and our solution required no additions
to the Informix SQL engine. It does, however, have an impact on the physical design of a
database, i.e., the layout of tables and indexes on disk.

3.1. Randomized data access and physical database design

In most scenarios, it is helpful if the output of a partially-completed online query can be


treated as a random sample. In online aggregation queries, the estimators for aggregates like
AVG and SUM require random sampling in order to allow for confidence intervals or other
statements about accuracy. This requirement is less stringent for online enumeration, but
still beneficial: in most scenarios a user would prefer to see a representative sample of the
614 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 291

data at any given time. Hence we begin our discussion of query operators by considering
techniques for randomized data access.
In order to guarantee random data delivery, we need access methods—algorithms for data
access—that produce ever-larger random samples of tables. This can be accomplished in a
number of ways. Random sampling is the same as simply scanning a table in certain sce-
narios: particularly, when the rows have been randomly permuted prior to query processing
or when, as verified by statistical testing, the storage order of the rows on disk is indepen-
dent of the values of the attributes involved in the aggregation query. Alternatively, it may
be desirable to actually sample a table during query processing, or to materialize a small
random sample of each base relation during an initialization step, and then subsequently
scan the sample base relations during online processing. Olken (1993) surveys techniques
for sampling from databases and for maintaining materialized sample views.
We chose to guarantee random delivery in Informix by storing tables in random order.
This approach is already available in any DBMS that supports user-defined functions: we
cluster 3 tables randomly by clustering records on a user-defined function f ( ) that generates
pseudo-random numbers.4 Scans of the table produce ever-larger random samples at full
disk bandwidth. To make this scheme work in the face of updates, new tuples should be
inserted to random positions in the table, with the tuples formerly in those positions being
appended to the end of the table. While not difficult, we have not added random insertion
functionality to our Informix prototype.
There are two potential drawbacks to this approach. The first is that every scan of the table
generates the same random sample; over time the random but static properties of the order
could be misinterpreted as representative of all possible orderings. This can be alleviated
somewhat by starting scans at an arbitrary points in the file (as is done in the “shared
scans” of some DBMSs (Red Brick Systems, Inc., 1998)), though of course the properties
of the fixed order are still reflected. A better solution is to periodically force some random
shuffling in the table; this is analogous to the “reorganization” steps common in database
administration, and it easily could be made automatic and incremental.
The second problem with this approach is that a relation stored in random order is by
definition not stored in some other order. This has ramifications for database design, since it
is typical for a database administrator to cluster a table on an attribute frequently referenced
in range queries, rather than on a random function. This can be solved in a manner analogous
to that of traditional database design: if one has a clustering on some column, and a secondary
random clustering is desired, one can generate a secondary random ordering via an index on
a random-valued attribute. This can be done without modification in any object-relational
system like Informix that supports functional indexes (Maier and Stein, 1986; Lynch and
Stonebrakere, 1988). One simply constructs a functional index on f (R · x), where f ( ) is
a random-number generator, and x is any column of R. The resulting index on R serves
as a secondary random ordering structure. Note that scanning a secondary index requires
a random I/O per record; this is a performance drawback of secondary random indexes as
well. As with most physical database design decisions, there are no clear rules of thumb
here: the choice of which clustering to use depends on whether online queries or traditional
range queries are more significant to the workload performance. These kinds of decision
can be aided or even automated by workload analysis tools (e.g., Chaudhuri and Narasayya,
1998)).
Informix under CONTROL: Online Query Processing 615

292 HELLERSTEIN, AVNUR AND RAMAN

3.2. Preferential data delivery: Online reordering

It is not sufficient for data delivery in an online query processing system to be random. It also
must be user-controllable. This requirement was not present in traditional batch systems,
and hence the techniques we discuss in this section do not have direct analogs in traditional
systems.
A key aspect of an online query processing system is that users perceive data being
processed over time. Hence an important performance goal for these systems is to present
data of interest early on in the processing, so that users can get satisfactory results quickly,
halt processing early, and move on to their next request. The “speed” buttons shown in
the Online Aggregation interface in figure 1 are one interface for specifying preferences:
they allow users to request preferential delivery for particular groups. The scrollbar of
a spreadsheet is another interface: items that can be displayed at the current scrollbar
position are of greatest interest, and the likelihood of navigation to other scrollbar positions
determines the relative preference of other items.
To support preferential data delivery, we developed an online reordering operator that
reorders data on the fly based on user preferences—it attempts to ensure that interesting
items get processed first (Raman et al., 1999b). We allow users to dynamically change their
definition of “interesting” during the course of a query; the reordering operator alters the
data delivery to try and meet the specification at any time.
In order to provide user control, a data processing system must accept user preferences
for different items and use them to guide the processing. These preferences are specified in
a value-based, application-specific manner, usually based on values in the data items. The
mapping from user preferences to the rates of data delivery depends on the performance
goals of the application. This is derived for some typical applications in Raman et al.
(1999b). Given a statement of preferences, the reorder operator should permute the data
items at the source so as to make an application-specific quality of feedback function rise
as fast as possible.
Since our goal is interactivity, the reordering must not involve pre-processing or other
overheads that will increase runtime. Instead, we do a “best effort” reordering without
slowing down the processing, by making the reordering concurrent with the processing.
Figure 3 illustrates our scheme of inserting a reorder operator into a data flow. We can

Figure 3. The reordering operator in context.


616 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 293

divide any data flow into four stages: the produce stage, the reorder stage, the pro-
cess stage, and the consume stage. In the context of query processing, Produce represents
an access method generating records. Reorder reorders the items according to the dy-
namically changing preferences of the consumer. Process is the set of operations that
are applied “downstream” to the records—this could involve query plan operators like
joins, shipping data across a slow network, rendering data onto the screen in data visual-
ization, etc. Consume captures the user think-time, if any—this is mainly for interactive
interfaces such as spreadsheets or data visualization. Since all these operations can go on
concurrently, we exploit the difference in throughput between the produce stage and the
process or consume stages to permute the items. For disk-based data sources, Produce
can run as fast as the sequential read bandwidth, whereas process may involve several ran-
dom I/Os which are much slower (Gray and Graefe, 1997). While the items sent out so far
are being processed/consumed, reorder can take more items from produce and permute
them.
The reorder operator tries to put as many interesting items as possible onto a main-
memory buffer, and the process operator issues requests to get items from the buffer.
Process decides which item to get based on its performance goals. Reorder uses the
time gap between successive gets from the buffer (which may arise due to processing or
consumption time) to populate the buffer with more interesting items. It does this either
by using an index to fetch interesting items, or by aggressively prefetching from the input,
spooling uninteresting items onto an auxiliary disk. Policies for management of the buffer
and organization of the auxiliary disk are described in more detail in Raman et al. (1999b);
the basic idea is to evict least-interesting items from the buffer, and place them into chunks
on disk of records from the same group.
If the reorder operator can get records much faster than they can be processed, then the
reordering has two phases. In the first phase, reorder continually gets data, and tries to keep
the buffer full of interesting items in the appropriate ratios, carefully spooling uninteresting
items to chunks on the side disk. The second phase occurs when there is no more data to
get; at this point, reorder simply enriches the buffer by fetching chunks of interesting tuples
of interest from the side disk.

3.2.1. Index stride and database design issues. The Index Stride access method was first
presented in Hellerstein et al. (1997); it works as follows. Given a B-tree index on the
grouping columns,5 on the first request for a tuple we open a scan on the leftmost edge
of the index, where we find a key value k1 . We assign this scan a search key of the form
[=k1 ]. After fetching the first tuple with key value k1 , on a subsequent request for a tuple
we open a second index scan with search key [>k1 ], in order to quickly find the next group
in the table. When we find this value, k2 , we change the second scan’s search key to be
[=k2 ], and return the tuple that was found. We repeat this procedure for subsequent requests
until we have a value kn such that a search key [>kn ] returns no tuples. At this point, we
satisfy requests for tuples by fetching from the scans [=k1 ], . . . , [=kn ] in a round-robin
fashion. In order to capture user preference for groups, we do not actually use round-robin
scheduling among the groups; rather we use lottery scheduling (Waldspurger and Weihl,
1995), assigning more “tickets” to groups of greater interest.
Informix under CONTROL: Online Query Processing 617

294 HELLERSTEIN, AVNUR AND RAMAN

Index Stride can be used to support online reordering, because the reorder operator can
get tuples in the appropriate ratios by simply passing its weighting of groups to the Index
Stride access method. A drawback of using an index is that it involves many random I/Os.
For many scenarios, Index Stride is significantly less efficient than simply running an online
reordering operator over a table-scan. However for groups of very low cardinality Index
Stride can be extremely beneficial. A hybrid of Index Stride and table-scan can be achieved
via partial indexes (Seshadri and Swami, 1995; Stonebraker, 1989): an index is built over
the small groups, and the reorder operator is run over a union of the small groups (in the
index) and the larger groups (in the heap file). In a system without partial indexes like
UDO, the hybrid scheme can be effected by explicitly partitioning the table into its rare and
common groups, storing the rare groups in a separate table. Performance tradeoffs between
Index Stride and online reordering of table-scans are presented in Raman et al. (1999b).

3.3. Ripple join algorithms

Up to this point, our discussion has focused on algorithms for delivering data from individual
tables. These techniques are appropriate for SQL queries on single tables; they are also
appropriate for simple spreadsheet-like systems built on files, as sketched in Section 2.2.
In general, however, SQL queries often combine data from multiple tables—this requires
relational join operators.
The fastest classical join algorithms (see, e.g., Graefe (1993)) are inappropriate for online
query processing. Sort-merge join is blocking: it generates no output until it has consumed
its entire input. Hybrid hash join (DeWitt et al., 1984) does produce output from the begin-
ning of processing, but at a fraction of the rate at which it consumes its input. Moreover
most commercial systems use Grace hash join (Fushimi et al., 1986) which is a blocking
algorithm.
The only classical algorithm that is completely pipelining is nested-loops join, but it is
typically quite slow unless an index is present to speed up the inner loop. Using nested-
loops join in an online fashion is often more attractive than using a blocking algorithm
like sort-merge join. But the absolute performance of the online nested-loops join is often
unacceptably slow even for producing partial results—this is particularly true for estimating
aggregates.
To see this, consider an online aggregation query over two tables, R and S. When a sample
of R is joined with a sample of S, an estimate of the aggregate function can be produced,
along with a confidence interval (Haas, 1997). We call this scenario the end of a sampling
step—at the end of each sampling step in a join, the running estimate can be updated and the
confidence interval tightened. In nested loops join, a sampling step completes only at the
end of each inner loop. If S is the relation in the outer loop of the join in our example, then a
sampling step completes after each full scan of R. But if R is of non-trivial size (as is often
the case for decision-support queries), then the amount of time between successive sampling
steps—and hence successive updates to the running estimate and confidence interval—can
be excessive.
In addition to having large pauses between estimation updates, nested-loops join has an
additional problem: it “samples” more quickly from one relation (the inner loop) than from
618 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 295

Figure 4. The elements of R × S that have been seen after n steps of a “square” ripple join.

the other. If the relation in the outer loop contributes significantly to the variance of inputs to
the aggregation function, it can be beneficial to more carefully balance the rates of reading
from the two relations.
To address these problems, we developed a new join algorithm for online query pro-
cessing, called the ripple join (Haas and Hellerstein, 1999). In the simplest version of the
two-table ripple join (figure 6), one previously-unseen random tuple is retrieved from each
of R and S at each sampling step; these new tuples are joined with the previously-seen
tuples and with each other. Thus, the Cartesian product R × S is swept out as depicted in the
“animation” of figure 4. In each matrix in the figure, the R axis represents tuples of R, the
S axis represents tuples of S, each position (r, s) in each matrix represents a corresponding
tuple in R × S, and each “x” inside the matrix corresponds to an element of R × S that
has been seen so far. In the figure, the tuples in each of R and S are displayed in the order
provided by the access methods; this order is assumed to be random.
The “square” version of the ripple join described above draws samples from R and S at
the same rate. As discussed in Haas and Hellerstein (1999), it is often beneficial to sample
one relation (the “more variable” one) at a higher rate in order to provide the shortest
possible confidence intervals for a given aggregation query. This requirement leads to the
general “rectangular” version of the ripple join6 depicted in figure 5. The general algorithm
with K (≥2) base relations R1 , R2 , . . . , R K retrieves βk previously-unseen random tuples
from Rk at each sampling step for 1 ≤ k ≤ K . (figure 5 corresponds to the special case in
which K = 2, β1 = 3, and β2 = 2.) Note the tradeoff between interactivity of the display
and estimation accuracy. When, for example, β1 = 1 and β2 = 2, more I/O’s are required
per sampling step than when β1 = 1 and β2 = 1, so that the time between updates to the
confidence intervals is longer; on the other hand, after each sampling step the confidence
interval typically is shorter when β1 = 1 and β2 = 2.
In Haas and Hellerstein (1999) we provide full details of ripple joins, and their interaction
with aggregation estimators. We also include detailed discussion on the statistics behind

Figure 5. The elements of R × S that have been seen after n sampling steps of a “rectangular” ripple join with
aspect ratio 3 × 2.
Informix under CONTROL: Online Query Processing 619

296 HELLERSTEIN, AVNUR AND RAMAN

Figure 6. A simple square ripple join. The tuples within each relation are referred to in array notation.

tuning the aspect ratios to shrink the confidence intervals as fast as possible, while still
attempting to meet a user’s goal for interactivity. In addition, we demonstrate that ripple
join generalizes pipelining nested-loops and hash joins. That leads to descriptions of higher-
performance algorithmic variants on figure 6 based on blocking (analogous to blocked nested
loops join), hashing (generalizing the pipelined hash join of Wilschut and Apers (1991),
and indexes (identical to index nested loops).
The benefits of ripple join for enumeration (non-aggregation) queries are akin to the
benefits of sampling for these queries: the running result of a ripple join arguably represents
a “representative” subset of the final result, since it is made from sizable random samples
from each input relation. In an enumerative query, the ripple join aspect ratio can be tuned
to maximize the size of the output, rather than a statistical property.

3.4. Hash-based grouping

SQL aggregation queries can contain a GROUP BY clause, which partitions tuples into
groups. This process can be done via sorting or hashing; in either case, the stream of tuples
resulting from selections, projections and joins are grouped into partitions such that within
a partition all tuples match on the GROUP BY columns. Sorting is a blocking operation
and hashing is not (at least not for a reasonable number of groups), so it is important that
an online aggregation system implement GROUP BY via hashing. Further discussion of
unary hashing and sorting appears in Hellerstein and Naughton (1996).

4. End-to-end issues in online query processing

A collection of algorithms does not make a system; the various building blocks must be
made to interact correctly, and consideration must be given to the end-to-end issues required
to deliver useful behavior to users. In this section we discuss our experience implementing
online query processing in Informix UDO. This implementation is more than a simple “plug-
in”: the various algorithms must be incorporated inside the server alongside their traditional
counterparts, and must be made to work together correctly. In addition, interfaces must be
added to the system to handle interactive behaviors that are not offered by a traditional
system. In this section we describe interfaces to the system, and implementation details in
composing query execution plans. We also discuss some of remaining challenges in realizing
a completely online query processing system. There were some issues in implementing the
620 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 297

estimators in a DBMS that are not clear from an algorithmic standpoint. We discuss these
in Appendix A.

4.1. Client-server interfaces

A traditional database Application Programming Interface (API) accepts SQL queries as


input, and returns a cursor as output; the cursor is a handle with which the client can request
output tuples one at a time. A batch system may process for a long time before the cursor is
available for fetching output, while an online system will make the cursor available almost
immediately. Beyond this minor change in interaction lie more significant changes in the
client-server API: online query processing allows for ongoing interaction between a client
application and database server. Two basic issues need to be handled. First, the system needs
to provide output beyond relational result records. Second, users can provide many kinds
of input to the system while a query is running.
Before actually deciding on functions to add, we had to decide how applications would
invoke our functions. We took two approaches to this problem. Our first solution was to
extend Informix’s Embedded SQL (ESQL) API, which is a library of routines that can be
called from C programs. This was problematic, however, because implementing clients in
C is an unpleasant low-level task.
We preferred to use a combination of Windows-based tools to build our application:
the interface of figure 1 consists of a few hundred lines of Visual Basic code that invoke
Informix’s MetaCube client tool, and also utilize a Microsoft Excel spreadsheet that provides
the table and graph widgets. These Windows-based tools do not work over ESQL; instead
they communicate with the DBMS via the standard Open Database Connectivity (ODBC)
protocol. Unfortunately ODBC is part of the MS Windows operating systems, and its API is
not extensible by third parties. Hence we developed a second architecture that conforms to
the traditional ODBC paradigm: all input is done via SQL queries, and output via cursors.
Sticking to this restricted interface has a number of advantages: it enables the use of ODBC
and similar connectivity protocols (such as Java’s JDBC), and it also allows standard client
tools to be deployed over an online query processing system.
Given this architecture, we proceed to discuss our CONTROL API for output and input
in turn.

4.1.1. Output API. Online enumeration queries provide the same output as traditional
database queries; the only distinction is that they make a cursor available for fetching almost
instantly. Online aggregation queries provide a form of output not present in traditional
systems: running estimates and confidence intervals for results. We chose to have these
running outputs appear as regular tuples, with the same attributes as the actual output
tuples. This means that a single SQL query is issued, which returns a stream of tuples that
includes both estimates and results. For example, the query of Section 2.1 can be issued as:

SELECT ONLINE AVG(grade), CONFIDENCE AVG(grade, 95)


FROM enroll
GROUP BY college;
Informix under CONTROL: Online Query Processing 621

298 HELLERSTEIN, AVNUR AND RAMAN

The addition of the ONLINE keyword and the CONFIDENCE AVG function to the query
can be done automatically by the client application, or explicitly by a user. In this example,
CONFIDENCE AVG is a user-defined aggregate function (UDA) that returns a confidence
interval “half-width”
corresponding to the probability given by the second argument to
the aggregate (i.e., the estimate is ±
of the correct average with 95% probability). The
interface of figure 1 shows the
values displayed as “error bars” in a graph. Immediately
after issuing this query, a cursor is available for fetching results. In a traditional system, this
query would produce one tuple per college. In our system, the query produces multiple
tuples per college, each representing an estimate of the average grade along with a
confidence interval half-width. If the query is run to completion, the last tuple fetched per
college represents the accurate final answer.
Any ODBC client application can handle this sequence of tuples. In particular, a standard
text-based client works acceptably for online aggregation—estimates stream across the
screen, and the final results remain on the last lines of the output. This proved to be our
client of choice while developing the system. Our new interfaces (e.g., figure 1) overwrite
old estimates with updated ones by assigning groups to positions in a tabular display.

4.1.2. Input API. Users should be able to control a running query by interacting with a
client, which passes user input on to the server. As a concrete example from our imple-
mentation, the user should be able to speed up, slow down, or stop groups in an online
aggregation query. Similarly, the spreadsheet widget should be able to express a preference
for records that are in the range currently being shown on screen.
For clients implemented in C over our enhanced ESQL, input to the server is direct: the
client simply invokes API functions that we have added to ESQL, such as pause group(x),
speed up group(x), etc. Such invocations can be interleaved arbitrarily with requests to
fetch data from a cursor. For clients communicating via ODBC rather than ESQL, direct
invocation of API functions is not an option. For this scenario, we were able to leverage
the Object-Relational features of UDO in a non-standard manner. In an object-relational
system, an SQL query can invoke user-defined functions (UDFs) in the SELECT statement.
We added UDFs to our database that invoke each of our API functions. Hence the client
can issue a query of the form

SELECT PAUSE GROUP(x);

and get the same effect as a client using ESQL. Clients can interleave such “control” queries
with requests to fetch tuples from the main query running on the database. Originally
we considered this an inelegant design (Hellerstein et al., 1997), because it significantly
increases the code path for each gesture a user makes. In practice, however, the performance
of this scheme has been acceptable, and the ease of use afforded by ODBC and Windows-
based 4GLs has been beneficial.

4.1.3. Pacing. Online aggregation queries can return updated estimates quite frequently—
for a single-table aggregation query like that of Section 2.1, a new estimate is available for
every tuple scanned from the table. However, many clients use complex graphical interfaces
or slow network connections, and are unable to fetch and display updates as fast as the
622 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 299

database can generate them. In these scenarios it is preferable for the system to produce a
new estimate once for every k  1 tuples. This prevents overburdening the client. There
is a second reason to avoid producing too many updated estimates. Typically, SQL query
output is synchronized between the server and client: after the client fetches a tuple, the
server is idle until the next request for a tuple from the client. This architecture is an artifact
of batch system design, in which most of the server processing is done before the first tuple
can be fetched. In an online aggregation query, time spent handling updated estimates at the
client is time lost at the server. Hence it pays to amortize this lost time across a significant
amount of processing at the server—i.e., one does not want to incur such a loss too often.
A solution to this problem would be to reimplement the threading model for online
aggregation so that the query processing could proceed in parallel with the delivery of
estimates. To achieve this, the aggregation query could run as one thread, and a separate
request-handling thread could process fetch requests by “peeking” at the state of the query
thread and returning an estimate. Rather than rearchitecting Informix’s structure in this way,
we chose to maintain Informix’s single-thread architecture, and simply return an output tuple
for every k tuples that are aggregated. We refer to the number k as the skip factor. The skip
factor is set by the user in terms of time units (seconds), and we translate those time units into
a skip factor via a dynamic feedback technique that automatically calibrates the throughput
of client fetch-handling.

4.2. Implementing online query operators

The algorithms of Section 3 were designed in isolation, but a number of subtleties arise when
integrating them into a complete system. In this section, we discuss issues in implementing
online query processing operators in Informix.
4.2.1. Online reordering. Like the early System R DBMS prototype (Astrahan et al.,
1976), Informix UDO is divided into two parts: a storage manager called RSAM, and
a query optimization/execution engine implemented on top of RSAM. In our prototype,
we decided to implement all our algorithms entirely above RSAM. This decision was
motivated by considerations of programming complexity, and by the fact that RSAM has a
well-defined, published API that we did not want to modify (Informix, 1998).
Implementing Index Stride above RSAM resulted in some compromises in performance.
RSAM presents interfaces for scanning relations, and for fetching tuples from indexes.
The index-based interface to RSAM takes as arguments a relation R, an index I , and a
predicate p, and returns tuples from R that match p. To amortize I/O costs, an efficient
implementation of Index Stride would fetch tuples a disk block at a time: one block of tuples
from the first group, then one from the second, and so on. Since the RSAM interface is
tuple-at-a-time, there is no way to explicitly request a block of tuples. In principle the buffer
manager could solve this problem by allocating a buffer per group: then when fetching the
first tuple from a block it would put the block in the buffer pool, and it would be available
on the next fetch. Unfortunately this buffering strategy does not correspond naturally to the
replacement policies in use in a traditional DBMS. The performance of Index Stride could
be tuned up by either implementing a page-based Index Stride in RSAM, or enhancing the
buffer manager to recognize and optimize Index Stride-style access.
Informix under CONTROL: Online Query Processing 623

300 HELLERSTEIN, AVNUR AND RAMAN

4.2.2. Ripple join. Performance is not the only issue that requires attention in a complete
implementation; we also encountered correctness problems resulting from the interaction
of ripple join and our data delivery algorithms. As described in figure 6, ripple join rescans
each relation multiple times. Upon each scan, it expects tuples to be fetched in the same
order. Traditional scans provide this behavior, delivering tuples in the same order every time.
However, sampling-based access methods may not return the same order when restarted.
Moreover, the online reordering operator does not guarantee that its output is the same when
it is restarted, particularly because users can change its behavior dynamically.
To address this problem, we introduce a cache above non-deterministic operators like
the reorder operator, which keeps a copy of incoming tuples so that downstream ripple join
operators can “replay” their inputs accurately. This cache also prevents ripple join from
actually repeating I/Os when it rescans a table—in this sense the cache is much like the
“building” hash table of a hash join.
In ordinary usage, we expect online queries to be terminated quickly. However, if a query
is allowed to run for a long period of time, the ripple caches can consume more than the
available memory. At that point two options are available. The first alternative is to allow the
cache to become larger than available memory, spilling some tuples to disk, and recovering
them as necessary—much as is done in hash join algorithms. The second alternative is to
stop caching at some point, but ensure that the data delivery algorithms output the same
tuples each time they are restarted. For randomized access methods this can be achieved
by storing the seed of the random number generator. For user-controllable reordering, this
requires remembering previous user interactions and “replaying” them internally when
restarting the delivery. We implemented the latter technique in Informix.
An additional issue arises when trying to provide aggregation estimates over ripple joins.
As noted in Section 3.3, estimates and confidence intervals over ripple joins can be refreshed
when a sample of R has been joined with sample of S—this corresponds to a “corner” in the
pictures of figure 4. It therefore makes sense for the estimates in a join query to be updated
at every k’th corner for skip-factor k. However it is possible that the tuple corresponding to
a corner may not satisfy the WHERE clause; in this case, the system does not pass a tuple
up the plan tree, and the estimators are not given the opportunity to produce a refined output
tuple. To avoid this loss of interactivity, when a ripple join operator reaches a “corner”,
it returns a dummy, “null tuple” to the aggregation code. These null tuples are handled
specially: they are not counted as part of the aggregation result, but serve as a trigger for
the aggregation code to update estimates and confidence intervals. Null tuples were easy to
implement in UDO as generalizations of tuples in outer join operators: an outer join tuple
has null columns from a proper subset of its input relations, whereas a null tuple has null
columns from all of its input relations.

4.3. Constructing online query plans

Like any relational database system, an online query processing system must map declara-
tive SQL queries to explicit query plans: this process is typically called query optimization.
Online processing changes the optimization task in two ways: the performance goals of an
online system are non-traditional, and there are interactive aspects to online query processing
624 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 301

Figure 7. An online query plan tree.

that are not present in traditional databases. For example, for any given query we want to
minimize not the time to completion but rather the time to “reasonable” accuracy. Online
query optimization is an open research area we have only begun to address in the CON-
TROL project. In our first prototype we modified Informix UDO’s optimizer and query
executor to produce reasonable online query plans, without worrying about optimality. In
this section we describe our implementation, and raise issues for future research. As mo-
tivation, figure 7 shows a complete online query plan, including a number of online query
processing operators.

4.3.1. Access method selection. The choice of access methods dictates the extent to which
a user can control data delivery. The choices of access methods lie on a spectrum of con-
trollability. At one extreme, sequential scan provides no control, but it is fast and does not
require indexes. At the other extreme, Index Stride provides complete control over the pro-
cessing rates from different groups in a GROUP BY query, but it is slow because it performs
many random I/Os. Online reordering lies between these two extremes. It runs almost as
fast as a sequential scan (except for a small CPU overhead), but only does a best-effort
reordering: if the user has a high preference for extremely rare groups, online reordering
can deliver tuples only as fast as it scans them. In our implementation, we modified the
UDO optimizer to handle SELECT ONLINE queries with a simple heuristic: if there is
a clustered index on the GROUP BY column we use an Index Stride access method, and
otherwise we use a sequential scan access method and follow it with online reordering.
An open issue is to allow for reordering when there are GROUP BY clauses over columns
from multiple tables: in this case the reordering operators must work in concert with ripple
join to produce groups in the appropriate ratio. Currently, we add a reordering operator to
Informix under CONTROL: Online Query Processing 625

302 HELLERSTEIN, AVNUR AND RAMAN

only the leftmost table in the query plan that has a GROUP BY. We intend to address this
issue in future work.

4.3.2. Join ordering. Unlike most traditional join algorithms, ripple joins are essentially
symmetric: there is no distinction between the “outer” and “inner” relations. Hence for a
tree of ripple joins, the join ordering problem is quite simple: all that is required is to tune
the aspect ratios described in Section 3.3; this is done dynamically as described in Haas
and Hellerstein (1999). This view of the problem is overly simplistic, since there is a choice
of ripple join variants (block, index or hash), and the join order can affect this choice. For
example, in a query over relations R, S and T , there may be equality join clauses con-
necting R to S, and S to T , but no join clause connecting R to T . In this case the join
order (R  T )  S cannot use a hash- or index-ripple join since we need to form the
cross-product of R and T ; by contrast, the join order (R  S)  T can use two hash joins.
In our current prototype, we let UDO decide on the join ordering and join algorithms in its
usual fashion, optimizing for batch performance. We then post-process the plan, inserting
reorder operators above access methods, converting hash joins to hash-ripple joins, index
nested-loops joins to index-ripple joins, and nested-loops joins to block-ripple joins.
In ongoing related work, we are studying query optimization in the online context. We
have developed a prototype of a continuous optimization scheme based on the notion of an
eddy (Avnur and Hellerstein, 2000). An eddy is an n-ary query operator interposed between
n input tables and n − 1 joins, which adaptively routes tuples through the joins. The output
of each join is sent back to the eddy for further processing by other joins. By interposing
an eddy in the data flow, we change the join order of the query for every tuple as it passes
through the query plan. Eddies hold promise not only for online query processing, but for
any pipelined set of operations that operate in an uncertain environment—any scenario
where user preferences, selectivities or costs are unpredictable and dynamic.

4.4. Beyond select-project-join

The majority of our discussion up to this point has concentrated on simple SQL query
blocks—consisting of selection, projection and join—followed by grouping and aggrega-
tion. A complete online SQL system must handle other variations of SQL queries as well.
We have not implemented all of SQL in an online fashion: some of it is best done by the
user interface, and some is future research. We detail these issues here.

4.4.1. Order by, Having. The ORDER BY clause in SQL sorts output tuples based on the
values in some column(s). Implementing an online ORDER BY at the server is somewhat
pointless: at any time the server could construct an ordered version of the output so far,
but re-shipping it to the client for each update would be wasteful. Instead, we believe that
ORDER BY is best implemented at the client, using an interface like those described in
Section 2.2. Note that the ORDER BY clause is often used to get a “ranking”, where a few
representative rows near the top will suffice (Carey and Kossmann, 1997; Chaudhuri and
Gravano, 1996). In this case an online spreadsheet interface may be more appropriate to
get a typical smattering of the top few rows; the strict total ordering of SQL’s ORDER BY
may over-constrain the request at the expense of performance.
626 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 303

SQL’s HAVING clause filters out groups in a GROUP BY query, based on per-group
values—either aggregation functions or grouping columns. Consider a version of the query
of Section 2.1 that contains an additional clause at the end, HAVING AVG(grade) > 3.0;
this is an example of a HAVING clause with an aggregate in it. Over time, some college’s
estimated average grade may go from a value greater than 3.0 to one less than 3.0. In that
case, the client must handle the deletion of the group from the display; the server does not
control how the client displays previously-delivered tuples. Since the HAVING clause has
to be handled appropriately by clients, we did not implement a HAVING-clause update
scheme at the server.

4.4.2. Subqueries and other expensive predicates. SQL allows queries to be nested in the
FROM, WHERE and HAVING clauses. Queries containing such subqueries can sometimes
be rewritten into single-level queries (see Leung et al. (1998) or Cherniack (1998) for an
overview), but there are cases where such subqueries are unavoidable. The problem with
such subqueries is that they force batch-style processing: the outer query cannot produce
output until the subquery is fully processed. For correlated subqueries, the problem is
exacerbated: the outer query must complete processing (at least!) one subquery before each
tuple of output. A similar problem arises with expensive user-defined functions in an object-
relational system (Hellerstein, 1998b; Chaudhuri and Shim, 1996): at least one expensive
computation must be completed before each tuple is passed to the output.
To date, we have not addressed the online processing of SQL queries with subqueries.
One perspective on subqueries in an online system is to view them as the composition of
two online systems: the subquery Q 0 is an input to the outer query Q, and the goal is to
produce running estimates for Q(Q 0 ( )). The optimization of such compositions has been
studied in the AI community as anytime algorithms (Zilberstein and Russell, 1996), but it
is not immediately clear whether that body of work is applicable to multiset-based operators
like SQL queries. Recently (Tan et al., 1999) have suggested that queries with subqueries
be executed as two threads, one each for the outer and inner query blocks, with the outer
block executing based on estimated results from the inner block. The user controls the
rate of execution of these two threads, and thereby the accuracy of the answers. While this
approach is promising, it works only when the subquery simply computes an aggregate, and
the predicate linking the outer block to the inner block is a comparison predicate; it will be
interesting to see if this approach can be extended to other predicates. Various techniques
are available for efficiently executing correlated subqueries and expensive functions during
processing (Rao and Ross, 1998; Hellerstein and Naughton, 1996); these ameliorate but do
not solve the batch-style performance.

5. A Study of CONTROL in action

In this section, we illustrate the performance of interactive data analysis in Informix UDO via
a sample usage scenario. Our goal is to demonstrate the benefits of online query processing:
continually improving feedback on the final result, and interactive control of the processing.
We also give an example which combines a ripple join and online reordering to show the
completeness of our system against all “select-project-join” queries. Note that this is not
intended as an robust analytic study of performance; comparisons of online algorithms
Informix under CONTROL: Online Query Processing 627

304 HELLERSTEIN, AVNUR AND RAMAN

Figure 8. Queries used in our data analysis session.

under different distributions and parameter settings are given in Hellerstein et al. (1997),
Haas and Hellerstein (1999) and Raman et al. (1999).
We scale up all numbers by an undisclosed factor to honor privacy commitments to
Informix Corporation while still allowing comparative analysis (hence time is expressed
in abstract “chronons”). We give rough figures for the actual wall-clock times in order to
show the interactivity of CONTROL.7 We run each of our experiments only until reaonable
accuracy is reached and the analyst switches to the next query.
Our data analysis session uses three queries (see figure 8) against the TPC-D database
with a scale factor of 1 (about 1 Gb of data) (Transaction Processing Council). Since online
reordering is more difficult for skewed distributions of group values, we use a Zipfian
distribution (1: 12 : 13 : 14 : 15 ) of tuples across different order priorities in the Order table. In
Queries 2 and 3 we group by o orderpriority, to demonstrate the effectiveness of online
reordering even in the presence of the skewed distributions commonly found in real data.
Our scenario begins with an analyst issuing Query 1, to find the average price of various
orders. Although the base table takes a while to scan (it is about 276MB), the analyst
immediately starts getting estimates of the final average. Figure 9 shows the decrease in the

Figure 9. Decrease in the confidence interval for Query 1.


628 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 305

Figure 10. Number of tuples processed for different groups for Query 2.

confidence interval for the average as a function of time. We see that very accurate results
are obtained within 12 chronons. On the other hand, a non-online version of the same query
took 1490 chronons to complete, or over two orders of magnitude higher! To give an idea
of the interactivity, the online version took about a second to reach reasonable (confidence
interval <2%) accuracy whereas the non-online version took on the order of a few minutes
to complete.
After 12 chronons, the analyst looks at the estimates and feels (based on domain knowl-
edge) that they “are not quite right”. Wondering what is happening, the analyst hypothesizes
that the prices may be skewed by the presence of some expensive air shipments. Hence the
next query issued is Query 2, which finds the average price of orders that do not have any
item shipped by air. Notice that with a traditional DBMS, the analyst would not be able to
try out this alternative until the first query had completed and returned a result.
Since Query 2 involves a non-flattenable subquery, Informix cannot pick a ripple join. It
instead scans Order, using the index on Lineitem to evaluate the subquery condition. It
chooses online reordering to permit user control over the processing from different groups
of Order. Almost immediately after Query 2 starts running, the analyst decides, from
the estimates, that the group with o orderpriority as 5-LOW is more interesting than
others, and presses the “speed up” button (see figure 1) to give 5-LOW a preference of 5,
compared to 1 for the rest of the groups. Figure 10 shows the number of tuples processed
for different groups as a function of time. After some time (point T1 in the figure), 5-
LOW’s confidence interval has narrowed sufficiently, and the analyst shifts interest to group
1-URGENT, giving it a preference of 5, and reducing that of all other groups to 1.8 We see
that online reordering is able to meet user preferences quite well, despite the interesting
groups being rare. In contrast, a sequential scan will actually process the interesting group
5-LOW more slowly than others, because it is uncommon.
Figure 11 shows the rate at which the confidence intervals for the estimates decrease for
the same query. We see that the reorder operator automatically adjusts the rates of processing
so as to decrease confidence intervals based on user preferences. Also note that at T1 when
Informix under CONTROL: Online Query Processing 629

306 HELLERSTEIN, AVNUR AND RAMAN

Figure 11. Decrease in confidence interval for different groups for Query 2 with online reordering. Note that the
confidence interval for 5-LOW decreases sharply at the beginning, and that the preference for 1-URGENT drops
sharply at T1.

the preference for 1-URGENT is increased, the other groups stop for a while (until T2)
(in fact 5-LOW plateaus for so long after T2 that the analyst sees no further updates for
this group before quitting at 800 chronons; this same behavior is also seen in Figure 10).
Likewise, there is a sudden spurt in the processing of tuples from 1-URGENT from T1
until T2. This happens because, reorder realizes that it has processed very few tuples from
1-URGENT with respect to the new preferences, and starts to compensate for it. After T2,
the different groups get processed according to their preferences. This compensation arises
because of the performance goal that we use: we view the preferences as a weight on the
importance of a group, and try to make the weighted average confidence interval across all
groups decrease as fast as possible. The motivation for this goal, as well as performance for
several other goals, is given in greater detail in Raman et al. (1999).
The mechanism for doing the reordering to decrease the confidence intervals in an optimal
manner is explained in Raman et al. (1999). In that paper we also present results that show
that online reordering performs well against a variety of data distributions, processing costs,
and user preference change scenarios. We also show its efficacy in reordering for an online
enumeration application, and for speeding up traditional batch query processing.
After seeing the estimates for 800 chronons, the query does not seem to exhibit any
interesting trends. Hence the analyst decides to proceed in a different direction, and submits
a new query (Query 3), which drills down on the average lineitem prices, grouping by the
o order priority. Online reordering has helped the analyst to narrow down the estimates
for the group of interest and drill down much before Query 2 completes; the entire query
takes 58730 chronons to run.
Query 3 needs to access the Order and Lineitem tuples, and since the join predicate
is equality it can be evaluated with a hash ripple join. Again, we use online reordering on
Order to allow user control over the processing of tuples from different order priorities.
This query involves running a hash-ripple join over a reorder operator.
630 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 307

Figure 12. Decrease in confidence interval for different groups for Query 3 with online reordering. Note that the
confidence interval for 3-MEDIUM decreases much faster than the others.

Figure 12 shows the decrease in confidence intervals for different groups over time.
The analyst increases the preference for group 3-MEDIUM to 20 at the beginning of the
processing, and this is reflected in the faster decrease in its confidence interval. We see that
within 100 chronons, a reasonably accurate picture emerges, especially for 3-MEDIUM. In
contrast, a regular hash join took 68210 chronons to execute this query, which is about
three orders of magnitude slower! Notice that this query involves running a hash ripple join
over a reorder operator. Hence the join replays tuples from Lineitem using a cache of the
previous results from reorder as explained in Section 4.2.2. Further comparisons of ripple
joins are presented in Haas and Hellerstein (1999) including a comparison of hash, index,
and block ripple joins with batch algorithms, and a study of the importance of tuning join
aspect ratios for shrinking confidence intervals.

6. Conclusions and future work

We have implemented a prototype online query processing system in Informix UDO, which
supports ad-hoc select-project-join queries with grouping and aggregation. Our prototype
cannot be considered “shippable” software, but it is fairly complete. The set of algorithms
we developed provides interactive processing for a large family of SQL queries, and we
were able to address most of the details required to implement a usable general-purpose
system. We have been able to interface our system with 4GL-based client tools with little
difficulty, resulting in quite serviceable online data analysis applications.
A number of lessons emerged in our work. Most significantly, we found that extending an
SQL engine to provide online query processing is neither trivial nor overly daunting. A full
implementation does require extending the database engine from within, and extensions
are required from the access methods through the query operators, optimizer, and client
API. But most of the changes are analogous to facilities already present in a complete SQL
engine, and are therefore relatively simple to add.
Informix under CONTROL: Online Query Processing 631

308 HELLERSTEIN, AVNUR AND RAMAN

We believe that many of our lessons in extending the system are pertinent outside the spe-
cific domain of online query processing. For example, any extension of a database system
to handle new interactions will require some of the same API considerations mentioned in
this paper. Similarly, additions of new access methods entail new considerations in database
design. In addition, the issues raised by composing algorithms into large query plans are gen-
erally important, and perhaps too often neglected outside the context of relational database
systems. Because these kinds of issues are often ignored in initial publications, new data
analysis techniques are often passed over by practitioners. On occasion these pragmatic
issues can also become rather interesting research problems, which would not naturally
arise outside the context of a full implementation.
The most significant lesson for us resulted from the leverage we got from cross-
disciplinary synergy: in our case, the mixture of statistics, computer systems and user
interfaces. Our work was motivated by user interface goals; these mapped to system per-
formance requirements, which were in turn clarified by optimizing for statistical metrics.
Ripple join is perhaps the clearest example of this synergy: in order to achieve a balance
between interactivity and output quality for multi-table queries, we developed a new query
processing algorithm, which led to a very non-standard dynamic optimization paradigm
(adjusting aspect ratios) based on variance of input to statistical estimators. We believe that
this kind of cross-disciplinary synthesis is a fertile approach for developing future computer
systems.
A number of issues remain as future work in online query processing. We have barely
begun to study the user interface issues involved in online data analysis; much work remains
to be done in understanding and constructing usable applications, especially for online enu-
meration and visualization. In the systems arena, we are extending our results to work in a
parallel context, which entails algorithmic work as well as statistical work—e.g., parallel
ripple joins are akin to stratified sampling techniques, which affects the estimators and con-
fidence intervals for online aggregation. It would also be helpful to find general techniques
for handling subqueries—many typical decision-support queries contain subqueries (e.g.
the TPC-D benchmark queries (Transaction Processing Council)), and some of those cannot
be rewritten away.
More generally, we are interested in the way that interactive techniques change the process
of data analysis and knowledge discovery: when long-running operations become more
interactive, the distinction between so-called “automated” techniques (e.g., data mining) and
user-driven techniques (data visualization, SQL) are significantly blurred. The implications
are likely to go beyond the obvious issues of faster interaction, to suggest new and more
natural human-computer interactions for data analysis.

Appendix: A. Issues in implementing aggregate estimation

Online aggregation queries raise additional complications, since query processing must be
integrated with estimation techniques. While this paper does not dwell on the statistics
involved in our estimators, we do describe some of the system implementation issues that
they raise.
632 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 309

A.1. Choice of confidence intervals

Three different classes of confidence intervals for online aggregation are detailed in (Haas,
1997). Deterministic confidence intervals give strict bounds on the error of an estimator, us-
ing bounds on the cardinalities of the tables and on the values of the input to the aggregation
function. Conservative estimators are based on Hoeffding’s inequality (Hoeffding, 1963),
and can be used for samples of any size, but appear inapplicable to certain aggregation
functions like STDDEV. Large sample estimators are based on Central Limit Theorems,
and can be used on large samples (where large can be as small as 20–30 tuples) to provide
quite tight confidence intervals. All three of these estimators can be extended to work on
the cross-product samples provided by ripple joins (Haas and Hellerstein, 1999). We prefer
to use large-sample confidence intervals because they are typically much tighter than de-
terministic or conservative estimates (Haas, 1997) and apply to a wider range of aggregate
functions. However, there are two scenarios in which large-sample confidence intervals are
inappropriate.
First, large sample confidence intervals fluctuate wildly at the beginning of a query,
until a “large enough” sample has been processed. In order to avoid misleading the user,
conservative confidence intervals should be used at first, until a large sample has been
obtained. A reasonable choice might be to switch to large-sample confidence intervals only
after 40 or 50 tuples have passed the query’s WHERE clause. For those aggregates that
cannot use conservative confidence intervals, it makes sense to simply postpone producing
confidence intervals for that period. In general it is difficult to robustly decide when the
Central Limit Theorem “kicks in” and one can switch estimators; in practice the sample
sizes resulting from “big picture” database queries become large enough very quickly.
Second, towards the end of processing a query large-sample confidence intervals are too
conservative. Even after all tuples have been processed, a large-sample confidence interval
will have a finite, although small, width. Thus towards the end of a query it is appropriate
to switch to using deterministic confidence intervals. This switch can be done as soon as
the deterministic interval is tighter than the large-sample interval.
In most situations, we expect user to stop processing online queries well before they
complete, so the issue of deterministic confidence intervals may seem irrelevant. However
deterministic confidence intervals can get used quite often because of reordering in GROUP
BY queries. If a user is especially interested in a particular group, they will speed up that
group’s delivery to the point where it all tuples from the group will get fetched quickly; this is
analogous to the “end-of-query” scenario for that group. If large-sample confidence intervals
were used, that group could have a wide confidence interval even though it was finished.

A.2. Interaction of reordering and aggregate estimation

Reordering the data delivery biases the order in which tuples are read from different groups;
in this sense it does not produce unbiased random samples. However, a key property of
our reordering methods is that, within a group, reordering does not affect the order of
tuples produced, and hence unbiased random samples are guaranteed per group. Since the
Informix under CONTROL: Online Query Processing 633

310 HELLERSTEIN, AVNUR AND RAMAN

aggregate function for each group in a GROUP BY query is estimated independently, no


bias is involved in the estimation.

A.3. Calculation of quantities for estimation

A number of quantities are required as inputs to our estimators and confidence-interval


computations. We briefly describe how we derive these in our implementation.
First, the estimators we use require knowledge of the actual number of tuples read from
each table before any selections are applied. Obtaining this information required us to modify
the RSAM layer slightly, since RSAM only returns tuples that satisfy relevant selections in
the query’s WHERE clause. This was the only situation where we added code in the RSAM
layer.
Second, a few of our estimators and confidence interval computations require knowledge
of the cardinalities of individual groups in a table (Haas, 1997; Hass and Hellerstein, 1999).
To obtain this information for an Index Stride, we require a histogram on the GROUP BY
column; a more natural solution would be to use a ranked index (Knuth, 1973; Antoshenkov,
1992; Aoki, 1998), but this is not available in most commercial databases. In the first phase
of the reordering operator described at the end of 3.2, we can estimate a group’s cardinality
as the fraction of tuples from that group in the currently scanned sample. The statistics in
the confidence intervals need to take the error of this estimation into account (Haas, 1997;
Hass and Hellerstein, 1999); the statistical details remain as future work. In the second
phase of the reordering operator, the exact cardinalities of all groups are known, so our
usual estimation techniques apply.
Third, deterministic confidence intervals also require upper and lower bounds on the in-
puts to the aggregation functions. If the input to the function is a column (e.g. AVG(enroll.
grade)), these bounds are typically available from the system catalogs. If the input is more
complex (e.g. AVG(f(enroll.grade) for some UDF f) the only bounds available are
those of the data type of the input (e.g. MAXINT is an upper bound for integers).

Acknowledgments

Bruce Lo implemented our user interface for online aggregation, and was involved in
early development of our Informix extensions. Chris Olston and Andy Chou developed the
CLOUDS system we describe in Section 2.3, and Christian Hidber developed the CARMA
data mining algorithm described in Section 2.4. We are indebted to the CONTROL group at
Berkeley for many interesting discussions: Andy Chou, Christian Hidber, Bruce Lo, Chris
Olston, Tali Roth and Kirk Wylie. We are also grateful to Peter Haas of IBM Almaden
Research for his many contributions to this work. Our relationship with Informix was made
possible by Mike Stonebraker and Cristi Garvey; this kind of close academic/industrial
collaboration is all too rare. At Informix, we received great help from Chih-Po Wen, Robyn
Chan, Paul Friedman, Kathey Marsden, and Satheesh Bandaram. Helen Wang and Andrew
MacBride contributed to an earlier Postgres-based prototype of online query processing.
This work was supported by a grant from Informix Corporation, a California MICRO
634 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 311

grant, NSF grant IIS-9802051, and a Sloan Foundation Fellowship. Computing and network
resources for this research were provided through NSF RI grant CDA-9401156.

Notes

1. Continuous Output and Navigation Technology with Refinement On Line.


2. The other SQL aggregates, MIN and MAX, are “needle-in-a-haystack” scenarios that cannot be satisfied via
sampling schemes like online aggregation: the minimum or maximum value may not appear until the last tuple
is fetched. However, related user-defined-aggregates are amenable to online aggregation: e.g., 99-percentile,
which displays a tuple that can be said to be within
of the 99’th percentile with some confidence.
3. The term clustering in relational databases is distinct from the statistical notion of clustering. A table in a
database is said to be clustered on a column (or list of columns) if it is stored on disk in ascending order of that
column (or ascending lexicographic order of the list of columns).
4. In order for this scheme to work, one must declare the function f ( ) to SQL as being “NOT VARIANT”.
This informs the system that the value of f ( ) for a tuple will not change over time, and the index can serve
as a cache of the f ( ) value for each tuple. This might seem counterintuitive since f ( ) generates random
numbers, but the point is that the use of the random numbers is static: once they define an initial random
ordering of the table, that ordering is static until the table is reclustered.
5. Index Stride is naturally applicable to other types of indices as well, but we omit discussion here.
6. The name “ripple join” has two sources. One is shown in the pictures in figures 4 and 5—the algorithm sweeps
out the plane like ripples spreading in a pond. The other source is the rectangular version of the algorithm,
which produces “Rectangles of Increasing Perimeter Length”.
7. We ran all experiments on a lightly-loaded dual-processor 200 MHz UltraSPARC machine running SunOS 5.5.1
with 256MB RAM. We used the INFORMIX Dynamic Server with Universal Data Option version 9.14 which
we enhanced with online aggregation and reordering features. We used a separate disk for the side-disk for
online reordering. We used a statistical confidence parameter (Haas, 1997) of 95% for our large sample
confidence intervals. Note that we did not bother to tune our Informix installation carefully, since our online
results were already performing sufficiently well. The performance of online and batch results presented here
are not necessarily indicative of the peak performance available in a well-tuned installation, and readers should
not use this study to extrapolate about the performance of Informix UDO.
8. To ensure that preferences are changed at a fixed, repeatable point in our experiments, we modified the reorder
operator to read in the preference change points from a configuration file instead of from the GUI of figure 1.

References

Agrawal, R. 1997. Personal communication.


Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. 20th International Con-
ference on Very Large Data Bases, Santiago de Chile, September 1994.
Aiken, A., Chen, J., Stonebraker, M., and Woodruff, A. 1996. Tioga-2: A direct-manipulation database visualization
environment. In Proc. 12th IEEE International Conference on Data Engineering, New Orleans, February 1996.
Antoshenkov, G. 1992. Random sampling from pseudo-ranked B+ trees. In Proc. 18th International Conference
on Very Large Data Bases, Vancouver, August 1992.
Antoshenkov, G. and Ziauddin, M. 1996. Query processing and optimization in Oracle Rdb. VLDB Journal,
5(4):229–237.
Aoki, P.M. 1998. Generalizing “search” in generalized search trees. In IEEE International Conference on Data
Engineering, Orlando, February 1998.
Astrahan, M., Blasgen, M., Chamberlin, D., Eswaran, K., Gray, J., Griffiths, P., King, W., Lorie, R., McJones, P.,
Mehl, J., Putzolu, G., Traiger, I., Wade, B., and Watson, V. 1976. System R: Relational approach to database
management. ACM Transactions on Database Systems, 1(2):97–137.
Avnur, R. and Hellerstein, J.M. 2000. Eddies: Continuously adaptive query processing. In Proc. ACM-SIGMOD
International Conference on Management of Data, Dallas, May 2000.
Informix under CONTROL: Online Query Processing 635

312 HELLERSTEIN, AVNUR AND RAMAN

Bayardo Jr., R.J. and Miranker, D.P. 1996. Processing queries for first-few answers. In Fifth Intl. Conf. Information
and Knowledge Management, Rockville, MD.
Carey, M.J. and Kossmann, D. 1997. On saying “Enough Already!” in SQL. In Proc. ACM-SIGMOD International
Conference on Management of Data, Tucson, May 1997.
Carey, M.J. and Kossmann, D. 1998. Reducing the braking distance of an SQL query engine. In Proc. 24th
International Conference on Very Large Data Bases, New York City.
Chaudhuri, S. and Gravano, L. 1996. Optimizing queries over multimedia repositories. In Proc. ACM-SIGMOD
International Conference on Management of Data, Montreal, June 1996.
Chaudhuri, S. and Gravano, L. 1999. Evaluating top-k selection queries. In Proc. International Conference on Very
Large Data Bases, Edinburgh.
Chaudhuri, S. and Narasayya, V. 1998. AutoAdmin “What-If” index analysis utility. In Proc. ACM-SIGMOD
International Conference on Management of Data, Seattle, June 1998.
Chaudhuri, S. and Shim, K. 1996. Optimization of queries with user-defined predicates. In Proc. 24th International
Conference on Very Large Data Bases, Bombay (Mumbai), September 1996.
Cherniack, M. 1998. Building query optimizers with combinators. PhD Thesis, Brown University.
DeWitt, D.J., Katz, R.H., Olken, Frank, Shapiro, L.D., Stonebraker, R.M., and Wood, D. 1984. Implementation
techniques for main memory database systems. In Proc. ACM-SIGMOD International Conference on Manage-
ment of Data, Boston, June 1984.
Donjerkovic, D. and Ramakrishnan, R. 1999. Probabilistic optimization of Top N queries. In Proc. International
Conference on Very Large Data Bases, Edinburgh.
Fagin, R. 1998. Fuzzy queries in multimedia database systems. In Proc. ACM SIGACT-SIGMOD-SIGART
Symposium on Principles of Database Systems, Seattle, June 1998.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. 1996. The kdd process for extracting useful knowledge from
volumes of data. Communications of the ACM, 39(11).
Fushimi, S., Kitsuregawa, M., and Tanaka, H. 1986. An overview of the system software of a parallel relational
database machine GRACE. In Proc. 24th International Conference on Very Large Data Bases, Kyoto, August
1986.
Gibbons, P.B. and Matias, Y. 1998. New sampling-based summary statistics for improving approximate query
answers. In Proc. ACM-SIGMOD International Conference on Management of Data, Seattle.
Gibbons, P.B., Poosala, V., Acharya, S., Bartal, Y., Matias, Y., Muthukrishnan, S., Ramaswamy, S., and Suel, T.
1998. Aqua: System and techniques for approximate query answering. Technical Report, Bell Laboratories.
Graefe, G. 1993. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73–170.
Gray, J. and Graefe, G. 1997. The five-minute rule ten years later, and other computer storage rules of thumb.
SIGMOD Record, 26(4).
Haas, P.J. 1996. Hoeffding inequalities for join-selectivity estimation and online aggregation. IBM Research
Report RJ 10040, IBM Almaden Research Center.
Haas, P.J. 1997. Large-sample and deterministic confidence intervals for online aggregation. In Proc. 9th Interna-
tional Conference on Scientific and Statistical Database Management, Olympia, WA, August 1997.
Haas, P.J. and Hellerstein, J.M. 1999. Ripple algorithms for online aggregation. In Proc. ACM-SIGMOD Interna-
tional Conference on Management of Data, Philadelphia, May 1999.
Haas, P.J., Naughton, J.F., Seshadri, S., and Swami, A.N. 1996. Selectivity and cost estimation for joins based on
random sampling. Journal of Computer System Science, 52:550–569.
Harinarayan, V., Rajaraman, A., and Ullman, J.D. 1996. Implementing data cubes efficiently. In Proc. ACM-
SIGMOD International Conference on Management of Data, Montreal, June 1996.
Hellerstein, J.M. 1997a. The case for online aggregation. Computer Science Technical Report CSD-97-958,
University of California, Berkeley.
Hellerstein, J.M. 1997b. Online processing redux. IEEE Data Engineering Bulletin, 20(3).
Hellerstein, J.M. 1998a. Looking forward to interactive queries. Database Programming and Design, 11(8):28–
33.
Hellerstein, J.M. 1998b. Optimization techniques for queries with expensive predicates. ACM Transactions on
Database Systems, 23(2).
Hellerstein, J.M., Avnur, R., Chou, A., Hidber, C., Olston, C., Raman, V., and Roth, T. 1999. Interactive Data
Analysis with CONTROL. IEEE Computer 32(9):51–59.
636 Chapter 7: Data Warehousing

INFORMIX UNDER CONTROL 313

Hellerstein, J.M., Haas, P.J., and Wang, H.J. 1997. Online aggregation. In Proc. ACM-SIGMOD International
Conference on Management of Data, Tucson, May 1997.
Hellerstein, J.M. and Naughton, J.F. 1996. Query execution techniques for caching expensive methods. In Proc.
ACM-SIGMOD International Conference on Management of Data, Montreal, June 1996.
Hidber, C. 1997. Online association rule mining. In Proc. ACM-SIGMOD International Conference on Manage-
ment of Data, Tucson, May 1997.
Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. Journal of the American
Statistical Association, 58.
Hou, W.C., Ozsoyoglu, G., and Taneja, B.K. 1988. Statistical estimators for relational algebra expressions. In
Proc. 7th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Austin, March
1998.
Hou, W.C., Ozsoyoglu, G., and Taneja, B.K. 1989. Processing aggregate relational queries with hard time con-
straints. In Proc. ACM-SIGMOD International Conference on Management of Data, Portland, May–June 1989.
Hyperion Essbase OLAP Server, 1998. URL http://www.hyperion.com/essbaseolap.cfm.
Illustra Information Technologies, Inc. 1994. Illustra User’s Guide, Illustra Server Release 2.1.
Informix Corp. 1998a. Sampling: The latest breakthrough in decision support technology. Informix White Paper
000-21681-70.
Informix Corp. 1998b. C-ISAM Version 7.24 for the UNIX Operating System.
Informix Corp. 1998c. Informix Dynamic Server with Universal Data Option 9.1x.
Knuth, D.E. 1973. The Art of Computer Programming: Vol. 3, Sorting and Searching. Addison-Wesley.
Leung, T.Y.C., Pirahesh, H., Seshadri, P., and Hellerstein, J.M. 1998. Query rewrite optimization rules in IBM
DB/2 universal database. In Readings in Database Systems, 3rd ed., M. Stonebraker and J.M. Hellerstein (Eds.).
San Francisco: Morgan-Kaufmann.
Lipton, R.J., Naughton, J.F., Schneider, D.A., and Seshadri, S. 1993. Efficient sampling strategies for relational
database operations. Theoretical Computer Science, 116:195–226.
Livny, M., Ramakrishnan, R., Beyer, K.S., Chen, G., Donjerkovic, D., Lawande, S., and Myllymaki, J. 1997.
DEVise: Integrated querying and visualization of large datasets. In Proc. ACM-SIGMOD International Confer-
ence on Management of Data, Tucson, May 1997.
Lynch, C. and Stonebraker, M. 1988. Extended user-defined indexing with application to textual databases. In
Proc. 14th International Conference on Very Large Data Bases, Los Angeles, August–Septeber 1998.
Maier, D. and Stein, J. 1986. Indexing in an object-oriented DBMS. In Proc. 1st Workshop on Object-Oriented
Database Systems, Asilomar, September 1986.
Morgenstein, J.P. 1980. Computer based management information systems embodying answer accuracy as a user
parameter. PhD Thesis, U.C. Berkeley.
O’day, V. and Jeffries, R. 1993. Orienteering in an information landscape: How information seekers get from here
to there. In INTERCHI.
Ohno, P. 1998. Visionary. Informix Magazine.
Olken, F. 1993. Random sampling from databases. PhD Thesis, University of California, Berkeley.
Papadopoulos, G. Chief Technology Officer. 1997. Sun Microsystems. Untitled talk. Berkeley NOW Retreat, July
1997.
Perlin, K. and Fox, D. 1993. Pad: An alternative approach to the computer interface. In Proc. ACM SIGGRAPH,
Anaheim, pp. 57–64.
Pilot Software 1998. Announces release of PDSS 6.0. URL http://www.pilotsw.com/about/pressrel/pr72998.htm.
Raman, V., Chou, A., and Hellerstein, J.M. 1999a. Scalable spreadsheets for interactive data analysis. In DMKD
Workshop.
Raman, V., Raman, B., and Hellerstein, J.M. 1999b. Online dynamic reordering for interactive data processing. In
Proc. International Conference on Very Large Data Bases, Edinburgh.
Rao, J. and Ross, K.A. 1998. Reusing invariants: A new strategy for correlated queries. In Proc. ACM-SIGMOD
International Conference on Management of Data, Seattle, June 1998.
Red Brick Systems, Inc. 1998. Red brick warehouse. URL http://www.redbrick.com/products/rbw/rbw.html.
Seshadri, P. and Swami, A. 1995. Generalized partial indexes. In Proc. 11th IEEE International Conference on
Data Engineering, Taipei, March 1995.
Shneiderman, B. 1982. The future of interactive systems and the emergence of direct manipulation. Behavior and
Information Technology, 1(3):237–256.
Informix under CONTROL: Online Query Processing 637

314 HELLERSTEIN, AVNUR AND RAMAN

Shukla, A., Deshpande, P., and Naughton, J.F. 1998. Materialized view selection for multidimensional datasets.
In Proc. 24th International Conference on Very Large Data Bases, New York City.
Silberschatz, A., Read, R.L., and Fussell, D.S. 1992. A multi-resolution relational data model. In Proc. 18th
International Conference on Very Large Data Bases, Vancouver, August 1992.
QL 1998. Server 7.0 OLAP services. URL http://www.microsoft.com/backoffice/sql/70/whpprs/olapoverview.
htm.
Stonebraker, M. 1989. The case for partial indexes. SIGMOD Record, 18(4):4–11.
Stonebraker, M. and Kemnitz, G. 1991. The POSTGRES Next-Generation database management system. Com-
munications of the ACM, 34(10):78–92.
Tan, K., Goh, C.H., and Ooi, B.C. 1999. Online feedback for nested aggregate queries with multi-threading. In
Prov. International Conference on Very Large Data Bases, Edinburgh.
Transaction Processing Council. TPC-D Rev. 1.2.3 Benchmark Specification. URL http://www.tpc.org/dspec.html.
Vrbsky, S.V. and Liu, J.W.S. 1993. APPROXIMATE—A query processor that produces monotonically improving
approximate answers. IEEE Transactions on Knowledge and Data Engineering, 5(6):1056–1068.
Waldspurger, C.A. and Weihl, W.E. 1995. Lottery scheduling: Flexible proportional-share resource management.
In First Symposium on Operating Systems Design and Implementation (OSDI).
Walter, T., Chief Technical Officer. 1998. NCR parallel systems. Complex queries. NSF Database Systems Indus-
trial/Academic Workshop, October 1998.
Wilschut, A.N. and Apers, P.M.G. 1991. Dataflow query execution in a parallel main-memory environment. In
Proc. First Intl. Conf. Parallel and Distributed Info. Sys. (PDIS), pages 68–77, Miami Beach, December 1991.
Winter, R. and Auerbach, K. 1998. The big time: 1998 winter VLDB survey. Database Programming and Design.
Zilberstein, S. and Russell, S.J. 1996. Optimal composition of real-time systems. Artificial Intelligence,
82(1/2):181–213.
DynaMat: A Dynamic View Management System for Data Warehouses

Yannis Kotidis Nick Roussopoulos


Department of Computer Science Department of Computer Science
University of Maryland University of Maryland
kotidis@cs.umd.edu nick@cs.umd.edu

Abstract to select an appropriate set of views that would provide


Pre-computation and materialization of views with aggregate the best performance benefits. The amount of redundancy
functions is a common technique in Data Warehouses. Due to the added is controlled by the data warehouse administrator
complex structure of the warehouse and the different profiles of the who specifies the space that is willing to allocate for the
users who submit queries, there is need for tools that will automate materialized data. Given this space restriction and, if
the selection and management of the materialized data. In this available, some description of the workload, these algorithms
paper we present DynaMat, a system that dynamically materializes return a suggested set of views that can be materialized for
information at multiple levels of granularity in order to match the better performance.
demand (workload) but also takes into account the maintenance This static selection of views however, contradicts the
restrictions for the warehouse, such as down time to update the
dynamic nature of decision support analysis. Especially
views and space availability. DynaMat unifies the view selection
for add-hoc queries where the expert user is looking for
and the view maintenance problems under a single framework using
a novel “goodness” measure for the materialized views. DynaMat interesting trends in the data repository, the query pattern
constantly monitors incoming queries and materializes the best set is difficult to predict. In addition, as the data and these
of views subject to the space constraints. During updates, DynaMat trends are changing overtime, a static selection of views
reconciles the current materialized view selection and refreshes the might very quickly become outdated. This means that
most beneficial subset of it within a given maintenance window. the administrator should monitor the query pattern and
We compare DynaMat against a system that is given all queries periodically “re-calibrate” the materialized views by re-
in advance and the pre-computed optimal static view selection. running these algorithms. This task for a large warehouse
The comparison is made based on a new metric, the Detailed where many users with different profiles submit their queries
Cost Savings Ratio introduced for quantifying the benefits of view is rather complicated and time consuming. Microsoft’s [Aut]
materialization against incoming queries. These experiments show
is a step towards automated management of system resources
that DynaMat’s dynamic view selection outperforms the optimal
and shows that vendors have realized the need to simplify
static view selection and thus, any sub-optimal static algorithm that
has appeared in the literature. the life of the data warehouse administrator.
Another inherit drawback of the static view selection
1 Introduction is that the system has no way of tuning a wrong
selection, i.e use results of queries that couldn’t be
Materialized views represent a set of redundant entities answered by the materialized set. Notice that although
in a data warehouse that are used to accelerate On-Line OLAP queries take an enormous amount of disk I/O and
Analytical Processing (OLAP). A substantial effort of the CPU processing time to be completed, their output is,
academic community in the last years [HRU96, GHRU97, in many cases, relatively small. “Find the total
Gup97, BPT97, SDN98] has been for a given workload, volume of sales for the last 10 years” is
 This research was sponsored partially by NASA under grant NAG 5- a fine example of that. Processing this query might take
2926, by NSA/Lucite under contract CG9815, by a gift from Advanced hours of scanning vast tables and aggregating, while the
Communication Technology, Inc., and by the University of Maryland result is just an 8-byte float value that can be easily “cached”
for future use. Moreover, during roll,up operations, when
Institute for Advanced Computer Studies (UMIACS).

we access data at a progressively coarser granularity, future


queries are likely to be totally computable out of the results
of previous operations, without accessing the base tables at
all. Thus, we expect a great amount of inter-dependency
among a set of OLAP queries.
Furthermore, selecting a view set to materialize is just
the tip of the iceberg. Clearly, query performance is
DynaMat: A Dynamic View Management System for Data Warehouses 639

tremendously improved as more views are materialized. The main benefit of DynaMat, is that it represents a
With the ratio $$/disk-volume constantly dropping, disk complete self-tunable solution that relieves the warehouse
storage constraint is no longer the limiting factor in the administrator from having to monitor and calibrate the
view selection but the window to refresh the materialized system constantly. In our experiments, we compare
set during updates. More materialization implies a larger DynaMat against a system that is given all queries in
maintenance window. This update window is the major data advance and the pre-computed optimal static view selection.
warehouse parameter, constraining over-materialization. These experiments show that the dynamic view selection
Some view selection algorithms [Gup97, BPT97] take into outperforms the optimal static view selection and thus,
account the maintenance cost of the views and try to minimize any sub-optimal static algorithm proposed in the literature
both query-response time and the maintenance overhead [HRU96, GHRU97, Gup97, BPT97].
under a given space restriction. In [TS97] the authors The rest of the paper is organized as follows: Section 2
define the Data Warehouse configuration problem as a state- gives an overview of the system’s architecture. Subsec-
space optimization problem where the maintenance cost of tions 2.2 and 2.3 discuss how stored results are being reused
the views needs to be minimized, while all the queries can for answering a new query, whereas in section 2.4 we ad-
be answered by the selected views. The trade-off between dress the maintenance problem for the stored data. Section 3
space of pre-computed results and maintenance time is also contains the experiments and in section 4 we draw the con-
discussed in [DDJ+ 98]. However, none of these publications clusions.
considers the dynamic nature of the view selection problem,
nor they propose a solution that can adapt on the fly to 2 System overview
changes in the workload.
DynaMat is designed to operate as a complete view
Our philosophy starts with the premise that a result is a management system, tightly coupled with the rest of the
terrible thing to waste and that its generation cost should be data warehouse architecture. This means that DynaMat
amortized over multiple uses of the result. This philosophy can co-exist and co-operate with caching architectures that
goes back to our earlier work on caching of query results on operate at the client site like [DFJ+ 96, KB96]. Figure 1
the client’s database ADMS + , architecture [RK86, DR92], depicts the architecture of the system. View Pool V is the
the work on prolonging their useful life through incremental information repository that is used for storing materialized
updates [Rou91] and their re-use in the ADMS optimizer results. We distinguish two operational phases of the system.
[CR94]. This philosophy is a major departure from the static The first one is the “on-line” during which DynaMat answers
paradigm of pre-selecting a set of views to be materialized queries posed to the warehouse using the Fragment Locator
and run all queries against this static set. to determine whether or not already materialized results can
In this paper we present DynaMat, a system that be efficiently used to answer the query. This decision is
dynamically materializes information at multiple levels of based upon a cost model that compares the cost of answering
granularity in order to match the demand (workload) but a query through the repository with the cost of running the
also takes into account the maintenance restrictions for the same query against the warehouse. A Directory Index is
warehouse, such as down time to update the views and maintained in order to support sub-linear search in V for
space availability. DynaMat unifies the view selection and finding candidate materialized results. This structure will
the view maintenance problems under a single framework be described in detail in the following sections. If the
using a novel “goodness” measure for the materialized search fails to reveal an efficient way to use data stored
views. DynaMat constantly monitors incoming queries in V for answering the query then the system follows the
and materializes the best set of views subject to the space conventional approach where the warehouse infrastructure
constraints. During updates, DynaMat reconciles the current (fact table+indices) is queried. Either-way, after the result is
materialized view selection and refreshes the most beneficial computed and given to the user, it is tested by the Admission
subset of it within a given maintenance window. The critical Control Entity which decides whether or not it is beneficial
performance issue is how fast we can incorporate the updates to store it in the Pool.
to the warehouse. Clearly if naive re-computation is assumed During the on-line phase, the goal of the system is to
for refreshing materialized views, then the number of views answer as many queries as possible from the pool, because
will be minimum and this will lessen the value of DynaMat. most of them will be answered a lot faster from V than from
On the other hand, efficient computation of these views the conventional methods. At the same time DynaMat will
using techniques like [AAD+ 96, HRU96, ZDN97, GMS93, quickly adapt to new query patterns and efficiently utilize the
GL95, JMS95, MQM97] and/or bulk incremental updates system resources.
[RKR97] tremendously enhances the overall performance of The second phase of DynaMat is the update phase, during
the system. In DynaMat any of these techniques can be which updates received from the data sources get stored
applied. In section 2.4.2 we propose a novel algorithm that in the warehouse and materialized results in the Pool get
based on the goodness measure, computes an update plan for refreshed. In this paper we assume, but we are not restricted
the data stored in the system. to, that the update phase is “off-line” and queries are not
640 Chapter 7: Data Warehousing

update

MAX_POOL_SIZE

User Data Warehouse

pool size
Query Interface

Admission Control Entity


W
Fragment Directory
Locator Index
u1 u2
time

View
Pool Figure 2: The time bound case

Figure 1: DynaMat’s architecture

pool size
replace

permitted during this phase. The maximum length of the MAX_POOL_SIZE

update window W is specified by the administrator and W

would probably lead us to evict some of the data stored u1 u2


in the pool as not update-able within this time constraint. time

2.1 View Pool organization Figure 3: The space bound case


The View Pool utilizes a dedicated disk storage for managing
materialized data. An important design parameter is the type
of secondary storage organization that will be used. DynaMat The space bound case is when the size of the pool is the
can support any underling storage structure, as long as we can constraining factor and not W . In this case, when the pool
provide a cost model for querying and updating the views. becomes full, we have to use some replacement policy. This
Traditionally summary data are stored as relational tables can vary from simply not admitting more materialized results
in most ROLAP implementations, e.g [BDD+ 98]. However, to the pool, to known techniques like LRU, FIFO etc, or to
tables alone are not enough to guarantee reasonable query using heuristics for deciding whether or not a new result is
performance. Scanning a large summary table to locate more beneficial for the system than an older one. Figure 3
an interesting subset of tuples can be wasteful and in shows the variations in the pool size in this case. Since we
some cases slower than running the query against the assumed a sufficiently large update window W , the stored
warehouse itself, if there are no additional indices to support results are always update-able and the actual content of the
random access to the data. Moreover, relational tables pool is now controlled by the replacement policy.
and traditional indexing schemes, are in most cases space Depending on the workload, the disk space and the update
wasteful and inadequate for efficiently supporting bulk window, the system will in some cases act as in time bound
incremental update operations. More eligible candidate and in others as in space bound, or both. In such cases views
structures include multidimensional arrays like chunked files are evicted from the pool, either because there is no more
[SS94, DRSN98] and also Cubetrees [RKR97]. Cubetrees space or they can not be updated within the update window.
are multidimensional data structures that provide both
storage and indexing in a single organization. In [KR98] we 2.2 Using MRFs as the basic logical unit of the pool
have shown that Cubetrees, when used for storing summary A multidimensional data warehouse (MDW) is a data
data, provide extremely fast update rates, better overall query repository in which data is organized along a set of
performance and better disk space utilization compared to dimensions D = fd1 ; d2 ; : : : ; dng. A possible way to
relational tables and conventional indexes. design a MDW is the star-schema [Kim96] which, for each
During the “on-line” phase of the warehouse, results from dimension it stores a dimension table Di that has di as its
incoming queries are being added in the Pool. If the pool had primary key and also uses a fact table F that correlates
unlimited disk space, the size of the materialized data would the information stored in these tables through the keys
grow monotonically overtime. During an update phase ui , d1 ; : : : ; dn . The Data Cube operator [GBLP96] performs

some of the materialized results may not be update-able the computation of one or more aggregate functions for
within the time constraint of W and thus, will be evicted all possible combinations of grouping attributes (which are
from the pool. This is the update time bound case shown actually attributes selected from the dimension tables Di ).
in Figure 2 with the size of the pool increasing between the The lattice [HRU96] representation of the Data Cube in
two update phases u1 and u2 . The two local minimums Figure 4 shows an example for three dimensions, namely
correspond to the amount of materialized data that can be a, b and c. Each node in the lattice represents a view that

updated within W and the local maximums to the pool size aggregates data over the attributes present in that node. For
at the time of the updates. example (ab) in an aggregate view over the a and b grouping
DynaMat: A Dynamic View Management System for Data Warehouses 641

attributes.1 For instance, suppose that D = fproduct;storeg is the set of


dimensions in the MDW, with values 1  product  1000
and 1  store  200 respectively. The hyper-plane
abc

~q = f50; (1; 200)g corresponds to the SQL query:


ab ac bc

select product, store, aggregate list


a b c
from F
where product=50
group by product, store
none

where aggregate list is a list of aggregate functions (e.g


Figure 4: The Data Cube lattice for dimensions a, b and c sum,count). If the grouping was done on attributes different
than the dimension keys then the actual SQL description
would include joins between some dimension tables and
q=(50,(1,200))
the fact table. This type of queries are called slice queries
200
[GHRU97, BPT97, KR98]. We prefer the MR notation over
the SQL description because it describes the workload in the
Data Cube space independent of the actual schema of the
store

((1,1000),s_id)
MDW.
The same notation permits us to represent the materialized
results of MR queries which we call Multidimensional Range
(50,s_id)
Fragments (MRFs). DynaMat maps each SQL query to
1
one, or more, MR queries. Given such a MR-query and
1 50 1000
product a cost model for accessing the stored MRFs, we want to
find the “best” subset of them in V to answer q. Based
Figure 5: Querying stored MRFs on the definition of MRFs, we argue that is doesn’t pay to
check for combinations of materialized results for answering
The lattice is frequently used by view selection algorithms q. With extremely high probability, q is best computable
[HRU96, GHRU97, SDN98] because it captures the compu- out of a single fragment f or not computable at all. We
tational dependencies among the elements of the Data Cube. will try to demonstrate this with the following example:
Such dependencies are shown in Figure 4 as directed edges Suppose that the previous query ~ q = f50; (1; 200)g is given.
that connect two views, if the pointed view computes the If no single MRF in the pool computes q, then a stored
other one. In Figure 4 we show only dependencies between MRF that partially computes q is of the form f50; s idg
adjacent views and not those in the transitive closure of this or f(1; 1000); s idg, where s id is some store value, see
lattice. For example, view (a) can be computed from view Figure 5. In order to answer q there should be at least one
(ab), while view (abc) can be used to derive any other view. such fragment for all values of s id between 1 and 200.
In this context, we assume that the warehouse workload is a Even if such a combination exists, it is highly unlikely that
collection of Multidimensional Range queries (MR-queries) querying 200 different fragments to get the complete result
each of which can be visualized as a hyper-plane in the Data provides a cost-effective way to answer the query.
Cube space using a n-dimensional “vector” ~ q: MRFs provide a slightly coarser grain of materialization
if we compare them with a system that materializes views
~q = fR1 ;R2; ::: ;Rn g (1) with arbitrary ranges for the attributes. However, if we allow
where Ri is a range in the dimension’s di domain. We restrict fragments with arbitrary ranges to be stored in the pool, then
each range to be one of the followings: the probability that a single stored fragment can solely be
used to answer a new query is rather low, especially if most
 a full range: Ri = (mindi ; maxdi ), where mindi and of the materialized results are small, i.e they correspond to
maxdi are the minimum and maximum values for key small areas in the n-dimensional space. This means that
di . we will need to use combinations of stored fragments and
 a single value for di perform costly duplicate eliminations to compute an answer
for a given query. In the general case that k fragments
 an empty range which denotes a dimension that is not compute some portion of the query there might be up to 2k
present in the query. combinations that need to be checked for finding the most
1 For simplicity in the notation, in this paper we do not consider the case
efficient way to answer the query. Having too many small
where the grouping is done over attributes other than the dimension keys di . fragments with possible overlapping sections which require
However our framework is still applicable in the presence of more grouping additional filtering in the pool, results in poor performance
attributes and hierarchies, using the extensions of [HRU96] for the lattice. not only during query execution but also during updates. In
642 Chapter 7: Data Warehousing

most cases, updating fewer, larger fragments of views (as


in a MRF-pool) is preferable. We denote the number of
fragments in the pool as jVj. In section 2.4.2 we show that

customer
the overhead of computing an update plan for the stored data
grows linearly with jVj2 , making the MRF approach more query

scalable. Smith

2.3 Answering queries using the Directory Index


As we described, when a MR-query q is posted to the data
warehouse, we scan V for candidate fragments that answer
q. Given a MRF f and a query q, f answers q iff for 1
product
1000

every non-empty range Ri of the query, the fragment stores


exactly the same range and for every empty range Ri = () Figure 6: Directory for view (product; customer)
the fragment’s corresponding range is either empty or spans
the whole domain of dimension i2 . We say in this case that
hyper-plane f~ covers ~
and returned to the user. If no exact match exists, assuming
q. we are given a cost model for querying the fragments, we
Instead of testing all stored fragments against the query, select the best candidate from the pool, to compute q. If view
DynaMat uses a directory, the Directory Index (see Figure 1),
f is the materialized result of q, the fragment that was used
to compute f is called the father of f and is denoted as f´. If
to further prune the search space. This is actually a set of
however no fragment in V can answer q, the query is handled
indices connected through the lattice shown in Figure 4. Each
node has a dedicated index that is used to keep track of all by the warehouse. In both cases the result is passed to the
fragments of the corresponding view that are stored in the Admission Control Entity that checks if it can be stored in
pool. For each fragment f there is exactly one entry that
the pool.
contains the following info:
As the number of MRFs stored in the pool is typically in
 Hyper-plane f~ of the fragment
the order of thousands, we can safely assume that in most
cases the Directory Index will be memory resident. Our
 Statistics (e.g number of accesses, time of creation, last experiments validate this assumption and indicate that the
access) look-up cost in this case is negligible. In cases where the
index can not fit in memory, we can take advantage of the
 The father of f (explained below). fact that the pool is reorganized with every update phase and
use a packing algorithm [RL85] to keep the R-trees compact
For our implementation we used R-trees based on the f~ and optimized at all times.
hyper-planes to implement these indices. When a query q ar-
rives, we scan using ~q all views in the lattice, that might con- 2.4 Pool maintenance
tain materialized results f whose hyper-planes f~ cover ~ q. For For maintaining the MRF-pool, we need to derive a goodness
example if ~ q = f(1; 1000); (); Smithg is the query hyper- measure for choosing which of the stored fragments we
plane for dimensions product, store and customer, then prefer. This measure is used in both the on-line and the
we first scan the R-tree index for view (product;customer) update phases. Each time DynaMat reaches the space or time
using rectangle f(1; 1000); (Smith; Smith)g. Figure 6 bounds we use the goodness for replacing MRFs. There can
depicts a snapshot of the corresponding R-tree for view be many criteria to define such a goodness. Among those we
(product;customer ) and the search rectangle. The shaded tested, the following four showed the best results:
areas denote MRFs of that view that are materialized in
the pool. Since no fragment is found, based on the  The time that the fragment was last accessed by the
dependencies defined in the lattice, we also check view system to handle a query:
(product;store;customer ) for candidate fragments. For goodness(f ) = tlast access (f )
this view, we “expand” the undefined in q store dimen-
sion and search the corresponding R-tree using rectangle This information is kept in the Directory Index. Using
f(1; 1000); (minstore; maxstore ); (Smith; Smith)g. If a this time-stamp as a goodness measure, results in an Least
fragment is found, we “collapse” the store column and ag- Recently Used (LRU) type of replacement in both cases.
gregate the measure(s) to compute the answer for q.  The frequency of access freq(f ) for the fragment:
Based on the content of the pool V , there are three
possibilities. The first is that a stored fragment f matches goodness(f ) = freq(f )
exactly the definition of the query. In this case, f is retrieved
The frequency is computed using the statistics kept in the
2 In
the latter case we have to perform an additional aggregation to Directory Index and results in a Least Frequently used
compute the result, as will be explained. (LFU) replacement policy.
DynaMat: A Dynamic View Management System for Data Warehouses 643

 The size size(f ) of the result, measured in disk pages: [GMS93, GL95, JMS95, MQM97, RKR97] that handle
grouping and aggregation queries.
goodness (f ) = size( f ) For our framework, we assume that the sources provide
the differentials of the base data, or at least the log files are
The intuition behind this approach is that larger fragments available. If this is the case, then an incremental update
are more likely to be hit by a query. An additional policy can be used to refresh the pool. In this scenario we
benefit of keeping larger results in the pool is that jVj gets also assume that all interesting aggregate functions that are
smaller, resulting in faster look-ups using the Fragment computed are self-maintainable [MQM97] with respect to
Locator and less complexity while updating the pool. We the updates that we have. This means that a new value for
refer to this case as the Smaller-Fragment-First (SFF) each function can be computed solely from the old value and
replacement policy. from the changes to the base data.
 The expected penalty rate of recomputing the fragment, Computing an initial update plan
if it is evicted, normalized by its actual size: Given a pool with jVj being in the order of thousands, our
goal is to derive an update plan that allows us to refresh as
f req (f )  c(f )
goodness (f ) = many fragments as possible within a given update window
size( f )
W . Computing the deltas for each materialized result is

unrealistic, especially if the deltas are not indexed somehow.


c( f ) is the cost of re-computing f for a future query. We
In our initial experiments we found out that the time spent
used as an estimate of c(f ) the cost of re-computing the
on querying the sources to get the correct deltas for each
fragment from its father, which is computable in constant
fragment is the dominant factor. For that reason our pool
time. This metric is similar to the one used in [SSV96]
maintenance algorithm extracts, in a preprocessing step, all
for their cache replacement and admission policy. We
the necessary deltas and stores them in a separate view
refer to this case as the Smaller Penalty First (SPF).
dV materialized as a Cubetree. This provides a efficient

indexing structure for the deltas against multidimensional


In the remaining of this section we describe how the
range queries. The overhead of loading a Cubetree with the
goodness measure is used to control the content of the pool.
deltas is practically negligible3 compared to the benefit of
2.4.1 Pool maintenance during queries having the deltas fully indexed. Assume that lowdi and hidi
are the minimum and maximum values for dimension di that
As long as there is enough space in the pool, results from are stored in all fragments in the pool. These statistics are
incoming queries are always stored in V . In cases where we easy to maintain in the Directory Index. View dV includes
hit the space constraint, we have to enforce a replacement all deltas within the hyper-plane:
policy. This decision is made by our replace algorithm
using the goodness measure of the fragments. The algorithm ~
dV = f(lowd1 ; hid1 ); : : : ; (lowdn ; hidn )g
takes as input the current state of the pool V , the new
For each fragment f in V we consider two alternative ways
computed result f and the space restriction S . A stored
of doing the updates:
fragment is considered for eviction only if its goodness is less
than that of the new result. At a first step a set Fevicted of such  We can query dV to get the updates that are necessary for
fragments with the smaller goodness values is constructed. refreshing f and then update the fragment incrementally.
If during this process we can not find candidate victims the We denote the cost of this operation as U CI (f ). It
search is aborted and the new result is denied storage in consists of the cost of running the MR-query f~ against dV
the pool. When a fragment fvictim is evicted the algorithm to get the deltas and the cost of updating f incrementally
updates the f ather pointer for all other fragments that point from the result.
to fvictim . In section 2.4.2 we discuss the maintenance of
the f ather pointers.
 If the fragment was originally computed out of another
result f´ we estimate the cost of recomputing f from its
2.4.2 Pool maintenance during updates father f´, after f´ has been updated. The cost of computing
f from its father is denoted as U CR (f ) and includes the
When the base relations (sources) are updated, the data
cost of running MR-query f~ against the fragment f´, plus
stored in the MDW, and therefore the fragments in the
the cost of materializing the result.
pool, have to be updated too. Different update policies
can be implemented, depending on the types of updates, the The system computes the costs for the two4 alternatives
properties of the data sources and the aggregate functions and picks the minimum one, denoted as U C (f ) for each
that are being computed by the views. Several methods have
been proposed [AAD+ 96, HRU96, ZDN97] for fast (re)-
3 Cubetree’s loading rate is about 12GB/hour in a Ultra 60 with a single

SCSI drive.
computation of Data Cube aggregates. On the other hand, 4 A third alternative, is to recompute each fragment from the sources.

incremental maintenance algorithms have been presented This case is not considered here, because the incremental approach is
644 Chapter 7: Data Warehousing

fragment. Obviously, this plan is not always the best one. cost for k evictions is O(kjVj). In the extreme case where
There is always the possibility that another result f1 has W is too small that only a few fragments can be updated this
been added in the pool after f was materialized. Since leads to an O(jVj2 ) total cost for computing a feasible update
the selection of the father of f was done before f1 was plan. However, in many cases just a small fraction of the
around, as explained in section 2.3, the above plan does stored results will be discarded resulting in close to O(jVj)
not consider recomputing f from f1 . An eager maintenance complexity.
policy of the father pointers would be to refine them whenever
necessary, e.g set father(f ) = f1 , if it is more cost effective
3 Experiments
to compute f from f1 than from its current father f´. We have
decided to be sloppy and not refine the father pointers based The comparison and analysis of the different aspects of the
on experiments that showed negligible differences between system made in this section is based on a prototype that we
the lazy and the eager policy. The noticeable benefit is have developed for DynaMat. This prototype implements
that the lazy approach reduces the worst case complexity of the algorithms and different policies that we present in this
the replace and the makeFeasible algorithm that is paper as well as the Fragment Locator and the Directory
discussed in the next section from O(jVj3 ) down-to O(jVj2 ), Index, but not the pool architecture. For the latter we used
thus making the system able to scale for large number of the estimator of the Cubetree Datablade [ACT97] developed
fragments. By the end of this phase, the system has computed for the Informix Universal Server for computing the cost of
the initial update plan, which directs the most cost-effective querying and updating the fragments.
way to update each one of the fragments using one of the two We have created a random MR-query generator that
alternatives, i.e incrementally from dV or by re-computation is tuned to provide different statistical properties for the
from another fragment. generated query sets. A important issue for establishing
a reasonable set of experiments was to derive the measures
Computing a feasible update plan for a given window
P
The total update cost of the pool is UC (V ) = f 2V UC (f ).
to base the comparisons upon. The Cost Saving Ratio (CSR)
was defined in [SSV96] as a measure of the percentage of
If this cost is greater than the given update window W we the total cost of the queries saved due to hits in their cache
have to select a portion of V that will not be materialized system. This measure is defined as:
in the new updated version of the pool. Suppose that we P cihi
choose to evict some fragment f . If f is the father of another CSR = Pi c r
fragment fchild that is to be recomputed from f , then the i i i
real reduction in the update cost of the pool is less than where ci is the cost of execution of query qi without using
UC (f ), since the update cost of fchild increases. For the lazy their cache, hi is the number of times that the query was
satisfied in the cache and ri is the total number of references
approach for maintaining the father pointer we forward the
father pointer for fchild : set father(fchild ) = father(f ). to that query. This metric is also used in [DRSN98] for
We now have to check if recomputing fchild from father(f )
is still a better choice than incrementally updating fchild from
their experiments. Because query costs vary widely, CSR is Ph
dV . If UC new (fchild ) is the new update cost for fchild then more appropriate metric than the common hit ratio: Pr. i
i

the potential update delta, i.e the reduction in UC (V ), if we


i
i
However, a drawback in the above definition for our case,
evict fragment f is: is that it doesn’t capture the different ways that a query qi
X might “hit” the Pool. In the best scenario, qi exactly matches
Udelta (f ) = UC (f ), (UC new (fchild ),UC old (fchild )) a fragment in V . In this case the savings is defined as ci ,
fchild 2V :father (fchild )=f where ci is the cost of answering the query at the MDW.
However, in cases where another result is used for answering
If the initial plan is not feasible, we discard at a first step qi the actual savings depend on how “close” this materialized
all fragments whose update cost UC (f ) is greater than the result is to the answer that we want to produce. If cf is cost
window W . If we still hit the time constraint, we evict of querying the best such fragment f for answering qi , the
more fragments from the pool. In this process, there is no savings in this case is ci , cf .5 To capture all cases we define
point in evicting fragments whose Udelta value is less or the savings provided by the pool V for a query instance qi as:
equal to zero. Having such fragments in the pool reduces
the total update cost because all their children are efficiently
8< 0 if qi can not be answered by V
updated from them. For the remaining fragments we use the si = : ci if there is an exact match for qi in V
goodness measure to select candidates for eviction until the c i , cf if f from V was used to answer qi
remaining set is update-able within the given window W . If
the goodness function is computable in constant time, the using the above formula we define the Detailed Cost Saving
5 c and c do not include the cost to fetch the result which is payable
expected to be faster. However, for sources that do not provide their i f
differentials during updates, we can consider using this option. even if an exact match is found.
DynaMat: A Dynamic View Management System for Data Warehouses 645

65 64
"SPF" "SPF"
"LFU" "LFU"
"LRU" "LRU"
"SSF" 62 "SSF"
60

60
55
DCSR (%)

DCSR (%)
58

50
56

45
54

40 52
2 4 6 8 10 12 14 20 25 30 35 40 45 50
updates updates

Figure 7: The time bound case, first 15x1500 queries Figure 8: The time bound case, remaining 35x1500 queries

Ratio as: Pi i 65
"SPF"

=P
"LFU"
"LRU"
s "SSF"
60
DC S R
i i c

DCSR (%)
55

DCSR provides a more accurate measure than CSR for OLAP


queries. CSR uses a “binary” definition of a hit: a query hits 50

the pool or not. For instance if a query is computed at the


MDW with cost ci = 10; 000 and from some fragment f with
45

cost cf = 9; 500, CSR will return a savings of 10; 000 for the 40
20 25 30 35
updates
40 45 50

“hit”, while DCSR will credit the system will only 500 units
based on the previous formula. DCSR captures the different Figure 9: The space bound case
levels of effectiveness of the materialized data against the
incoming queries and describes better the performance of
the system. For the first experiment we tested the time-bound case.
The rest of this section is organized as follows: Subsec- The size of the pool was chosen large enough to guarantee
tion 3.1 makes a direct comparison of the different ways no replacement during queries and the time allowed for
to define the goodness as described in 2.4. Subsection 3.2 updating the fragments was set to 2% of WData Cube , where
compares the performance of DynaMat against a system that WData Cube is the estimated time to update the full Data

uses the optimal static view selection policy. All experi- Cube. For a more clear view we plot in Figure 7 the DCSR
ments were ran using an Ultra SPARC 60 with 128MB of overtime for the first 15 sets of queries, starting with an
main memory. empty pool. In the graph we plot the cumulative value of
DCSR at the beginning of each update phase, for all queries
3.1 Comparison of different goodness policies that happened up to that phase. The DCSR value reaches
In this set of experiments we compare the DCSR under the 41.4% at the end of the first query period of 1,500 queries
four different goodness policies LRU, LFU, SFF and SPF. that were executed against the initially empty pool. This
We used a synthetically generated dataset that models super- shows that simply by storing and reusing computed results
market transactions, organized by the star schema. The from previous queries, we cut down the cost of accessing
MDW had 10 dimensions and a fact table containing 20 the MDW to 58.6%. Figure 8 shows how DCSR changes
million tuples. We assumed 50 update phases during the for the remaining queries. All four policies quickly increase
measured life of the system. During each update phase we their savings, by refining the content of the pool while doing
generated 250,000 new tuples for the fact table that had to updates, up to a point where all curves flatten out. At all
be propagated to the stored fragments. The size of the full times, SPF policy is the winner with 60.71% savings for the
Data Cube for this base data after all updates where applied whole run. The average I/O per query, was 94.84, 100.08,
was estimated to be about 708GB. We generated 50 query 106.18 and 109.09 MB/query for the SPF, LFU, LRU and
sets with 1,500 MR-queries each, that were ran between the SFF policies respectively. The average write-back I/O cost
updates. These queries were selected uniformly from all due to the on-the-fly materialization was about the same in
210 = 1; 024 different views in the Data Cube lattice. In all cases ('19.8MB/query). For the winner SPF policy the
order to simulate hot spots in the query pattern the values average time spend on searching the Directory Index was
asked by the queries for each dimension are following the negligible (about 0.4msecs/query). Computing a feasible
80-20 law: 80% of the times a query was accessing data from update plan took on the average 37msecs, and 51msecs in
20% of the dimension’s domain. We also ran experiments the worst case. The number of MRFs stored in the pool by
for uniform and Gaussian distributions for the query values the end of the last update phase was 206.
but are not presented here as they were similar to the 80-20% Figure 9 depicts DCSR overtime in the space-bound case
distribution. for the last 35 sets of queries, calculated at the beginning
646 Chapter 7: Data Warehousing

100
"DYNAMIC"
"OPTIMAL-STATIC"
70

80
60

DCSR (%)
50
DCSR (%)

60
static wins
40
dynamic wins
40
30

20
5
20
4
upd 3 4
ate 3 0
win 2
dow 1
2
(%) 0 10 20 30 40 50 60
(%
) 0 0
1
space view

Figure 10: The space & time bound case Figure 11: DCSR per view for uniform queries on the views

of each update phase. In this experiment there was no time lattice with n dimensions and no hierarchies there are 2n
restriction for doing the updates, and the space that was different views. A static view selection, depending on the
allocated for the pool was set to 14GB, i.e 2% of the full space and time bounds, contains some combination of these
views. For for n = 6, the search space contains 22 =
6
Data Cube size. In this case, the content of the pool is
managed by the replace algorithm, as the limited size of 18; 446; 744; 073; 709; 551; 616 possible combinations of the
the pool results in frequent evictions during the on-line mode. 64 views of the lattice. Obviously some pruning can be
Again the SPF policy showed the best performance with a applied. For example, if a set of views is found feasible there
DCSR of 59.58%. For this policy, the average time spend on is no need to check any of its subsets. Additional pruning
the replace algorithm, including any modifications on the of large views is possible depending on the space and time
Directory Index, was less that 3msecs per query. Computing restrictions that are specified, however for non trivial cases
the initial update plan for the updates, as explained in this exhaustive search is not feasible even for small values
section 2.4.2, took 10msecs on the average. Since there was of n.
no time restriction and thus, the plan was always feasible, We used SOLVE to compute the optimal static view
there was no additional overhead for refining this plan. The selection for a six-dimensional subset of our supermarket
final number of fragments in the pool was 692. dataset, with 20 million tuples in the fact table. There
In a final experiment we tested the four policies for the were 40 update phases, with 100 thousand new tuples being
general case, where the system is both space and time bound. added in the fact table each time. The time window for
We varied the time window for the updates from 0.2% up the updates was set to the estimated 2% of that of the full
to 5% of WData Cube and the size of the pool from 0.2% Data Cube (WData Cube ). We created 40 sets of 500 MR-
up to 5% of the full Data Cube size, both in 0.2% intervals. queries each, that were executed between the updates. These
Figure 10 shows the DCSR for each pair of time and space queries targeted uniformly the 64 different views in the 6-
settings for the SPF policy, that outperformed the other dimensional Data Cube lattice. This lack of locality of the
three. We can see that even with limited resources DynaMat queries represents the worst-case scenario for the dynamic
provides substantial savings. For example, with just 1.2% case that needs to adapt on-the-fly to the incoming query
of disk space and 0.8% time window for the updates, we get pattern. For the static view selection this was not an issue,
over 50% savings compared to accessing the MDW. because SOLVE was given all queries in advance. The
optimal set returned, after 3 days of computations in an
3.2 Comparison with the optimal static view selection Ultra SPARC 60, includes 23 out of the 64 full-views in
In the experiments in the previous section we saw that the 6-dimensional Data Cube. The combined size of these
the SPF policy provides the best goodness definition for views when stored as Cubetrees in the disk is 281MB (1.6%
a dynamic view (fragment) selection during both updates of the full Data Cube). For the most strict and unfavorable
(time bound case) and queries (space bound case), or both. comparison for the dynamic case, we set the size of the pool to
An important question however is how the system compares the same number. Since the dynamic system started with an
with a static view selection algorithm [HRU96, GHRU97, empty pool, we used the first 10% of the queries as a training
Gup97, BPT97] that considers only fully materialized views. set and measured system’s performance for the remaining
Instead of comparing each one of these algorithms with 90%. We used the SPF policy to measure the goodness of
our approach, we implemented SOLVE, a module that the MRFs for the dynamic approach.
given a set of queries, the space and time restrictions, The measured cumulative DCSR for the two systems was
it searches exhaustively all feasible view selections and about the same: 64.04% for the dynamic and 62.06% for the
returns the optimal one for these queries. For a Data Cube optimal static. The average I/O per query for the dynamic
DynaMat: A Dynamic View Management System for Data Warehouses 647

100
100
90 "DYNAMIC" "DYNAMIC with 10% space"
"OPTIMAL-STATIC" "DYNAMIC"
80 "OPTIMAL-STATIC"
80
70
DCSR (%)

60

DCSR (%)
60
50
40 40
30
20 20
10
0 0
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 0 10 20 30 40 50 60
number of grouping attributes per query (avg) view

Figure 12: Dynamic vs Optimal-Static selection varying the Figure 13: DCSR per view for space = 10%
average number of grouping attributes per query
90
"DYNAMIC"
"OPTIMAL-STATIC"
system was 108.11MB and the average write-back I/O cost 85
2.18MB. For the optimal static selection the average I/O per
80
query is 112.94MB and no write-back, without counting the

DCSR (%)
overhead of materializing the statically selected views for 75

the first time.


70
For a more clear view on the performance differences
between the static and the dynamic approach, we computed 65

the DCSR per view and plotted them in decreasing order of 60


savings in Figure 11. Notice that the x-axis labeling does 5 10 15 20 25
update phase
30 35 40

not correspond to the same views for the two lines. The
plot shows that the static view selection performs well for Figure 14: Dynamic vs Optimal-Static selection for drill-
the 23 materialized views, however for the rest 41 views its down/roll-up queries
savings drops to zero. DynaMat on the other hand provides
substantial savings for almost all the views. On the right hand
side of the graph are the larger views of the Data Cube. Since used by the optimal static selection. This number however
most results from queries on these views are too big to fit is rather small for todays standards. We ran two more
in the pool, even DynaMat’s performance decreases because experiments with pool size 5% (878MB) and 10% (1.7GB)
they can not be materialized in the shared disk space. of the full Data Cube size. The optimal static selection does
Figure 12 depicts the performance of both systems for a not refine the selected views because of the update window
non-uniform set of queries where the access to the views constraint (2%). DynaMat, on the other hand, capitalizes the
is skewed. The skewness is controlled by the number of extra disk space and increases the DCSR from 64.04% to
grouping attributes in each query. As this number increases,6 68.34 and 78.22% for the 5% and 10% storage. Figure 13
it favors accesses on views from the upper levels of the Data depicts the computed DCSR per view for this case. As more
Cube lattice, which views are bigger in size and need larger disk space is available, DynaMat achieves even more savings
update window. These views, because of the space and by materializing more fragments from the larger views of the
time constraints are not in the static optimal selection. On Data Cube.
the other hand, the dynamic approach materializes results In the previous experiment the queries that we ran were
whenever possible and for this reason it is more robust selected uniformly from all 64 views in the Data Cube lattice.
than the static selection, as the workload shifts to the larger This is the worst case scenario for DynaMat which gains
views of the lattice. As the average number of grouping a lot more from locality of follow-up queries. Often in
attributes per query reaches 6, almost all queries in the OLAP, users do drill-downs or roll-ups, where starting
workload access the single top-level six-dimensional view from a computed result, they refine their queries and ask
of the lattice. DynaMat adapts nicely to such workload and for a more or less detailed view of the data respectively.
allocates most of the pool space to MRFs of that view. That DynaMat can enormously benefit from the roll-up queries
explains the performance of DynaMat going up at the right because these queries are always computable from results
hand side of the graph. that were previously added in the pool. To simulate such a
The pool size in the above experiments was set to 1.6% workload we tuned our query-generator to provide 40 sets
of the full Data Cube as this was the actual size of the views of 500 queries each with the following properties: 40% of
6 Having three grouping attributes per query, on the average, corresponds the times a user asks a query for a randomly selected view
to the previous uniform view selection. from the Cube, 30% of the times the user performs a roll-up
648 Chapter 7: Data Warehousing

operation on the last reported result and 30% of the times the References
[AAD+ 96] S. Agrawal, R. Agrawal, P. Deshpande,
user performs a drill-down.
For this experiment, we used the previous set up for the
A. Gupta, J. Naughton, R. Ramakrishnan, and
2% and 10% time and space bound and we re-computed
S. Sarawagi. On the Computation of Multidi-
the optimal static selection for the new queries. Figure 14
depicts DCSR for this workload. Compared to the previous mensional Aggregates. In Proc. of VLDB, pages
example, DynaMat further increases its savings (83.84%) by 506–521, Bombay, India, August 1996.
taking advantage of the locality of the roll-up queries.
[ACT97] ACT Inc. The Cubetree Datablade.
http://www.act-us.com, August 1997.
4 Conclusions [Aut] AutoAdmin Project, Database Group, Microsoft
Research.
In this paper we presented DynaMat, a view management
system that dynamically materializes results from incoming [BDD+ 98] R. G. Bello, K. Dias, A. Downing, J. Feenan,
queries and exploits them for future reuse. DynaMat
J. Finnerty, W. D. Norcott, H. Sun,
unifies view selection and view maintenance under a single
framework that takes into account both the time and space A. Witkowski, and M. Ziauddin. Materialized
constraints of the system. We have defined and used the Views In Oracle. In Proc. of VLDB, pages 659–
Multidimensional Range Fragments (MRFs) as the basic 664, New York City, New York, August 1998.
logical unit of materialization. Our experiments show that
compared to the conventional static paradigm that considers [BPT97] E. Baralis, S. Paraboschi, and E. Teniente. Ma-
only full views for materialization, MRFs provide a finer terialized View Selection in a Multidimensional
and more appropriate granularity of materialization. The Database. In Proc. of VLDB, pages 156–165,
operational and maintenance cost of the MRFs, which Athens, Greece, August 1997.
includes any directory look-up operations during the online
mode and the derivation of a feasible update plan during [CR94] C.M. Chen and N. Roussopoulos. The imple-
updates, remains practically negligible, in the order of mentation and Performance Evaluation of the
milliseconds. ADMS Query Optimizer: Integrating Query Re-
We compared DynaMat against a system that is given sult Caching and Matching. In Proc. of the 4th
all queries in advance and the pre-computed optimal static Intl. Conf. on Extending Database Technology,
view selection. These experiments indicate that DynaMat pages 323–336, 1994.
outperforms the optimal static selection and thus any sub-
optimal view selection algorithm that has appeared in the [DDJ+ 98] L. Do, P. Drew, W. Jin, V. Junami, and D. V.
literature. Another important result that validates the Rossum. Issues in Developing Very Large Data
importance of DynaMat, is that just 1-2% of the Data Cube Warehouses. In Proceedings of the 24th VLDB
space and 1-2% of the update window for the full Data Cube Conference, pages 633–636, New York City,
are sufficient for substantial performance improvements.
New York, August 1998.
However, the most important feature of DynaMat is that it
represents a complete self-tunable system that dynamically [DFJ+ 96] S. Dar, M.J. Franklin, B. Jonsson, D. Srivastava,
adjusts to new patterns in the workload. DynaMat relieves and M. Tan. Semantic Data Caching and
the warehouse administrator from having to monitor and Replacement. In Proc. of the 22th International
calibrate the system constantly regardless of the skewness of
Conference on VLDB, pages 330–341, Bombay,
the data and/or of the queries. Even for cases that there is
no specific pattern in the workload, like the uniform queries India, September 1996.
used for some of our experiments, DynaMat manages to [DR92] A. Delis and N. Roussopoulos. Performance
pick a set of MRFs that outperforms the optimal static view
and Scalability of Client-Server Database Ar-
selection. For more skewed query distributions, especially
for workloads that include a lot of roll-up queries, the chitectures. In Proc. of the 18th VLDB, pages
performance of DynaMat is even better. 610–623, Vancouver, Canada, 1992.

[DRSN98] P. M. Deshpande, K. Ramasamy, A. Shukla,


and J.F. Naughton. Caching Multidimensional
5 Acknowledgments Queries Using Chunks. In Proceedings of
We would like to thank Kostas Stathatos and Alexandros the ACM SIGMOD, pages 259–270, Seattle,
Labrinidis for their helpful comments and suggestions. Washington, June 1998.
DynaMat: A Dynamic View Management System for Data Warehouses 649

[GBLP96] J. Gray, A. Bosworth, A. Layman, and H. Pi- [RK86] N. Roussopoulos and H. Kang. Preliminary De-
ramish. Data Cube: A Relational Aggrega- sign of ADMS+ , : A Workstation-Mainframe
tion Operator Generalizing Group-By, Cross- Integrated Architecture for Database Manage-
Tab, and Sub-Totals. In Proc. of the 12th ICDE, ment Systems. In Proc. of VLDB, pages 355–
pages 152–159, New Orleans, February 1996. 364, Kyoto, Japan, August 1986.
IEEE.
[RKR97] N. Roussopoulos, Y. Kotidis, and M. Rous-
[GHRU97] H. Gupta, V. Harinarayan, A. Rajaraman, and sopoulos. Cubetree: Organization of and Bulk
J. Ullman. Index Selection for OLAP. In Pro- Incremental Updates on the Data Cube. In Pro-
ceedings of ICDE, pages 208–219, Burming- ceedings of the ACM SIGMOD International
ham, UK, April 1997. Conference on Management of Data, pages 89–
99, Tucson, Arizona, May 1997.
[GL95] T. Griffin and L. Libkin. Incremental Mainte-
nance of Views with Duplicates. In Proceedings [RL85] N. Roussopoulos and D. Leifker. Direct Spatial
of the ACM SIGMOD, pages 328–339, San Jose, Search on Pictorial Databases Using Packed R-
CA, May 1995. trees. In Procs. of 1985 ACM SIGMOD, pages
17–31, Austin, 1985.
[GMS93] A. Gupta, I.S. Mumick, and V.S. Subrahmanian.
Maintaining Views Incrementally. In Proceed- [Rou91] N. Roussopoulos. The Incremental Access
ings of the ACM SIGMOD Conference, pages Method of View Cache: Concept, Algorithms,
157–166, Washington, D.C., May 1993. and Cost Analysis. ACM–Transactions on
Database Systems, 16(3):535–563, September
[Gup97] H. Gupta. Selections of Views to Materialize
1991.
in a Data Warehouse. In Proceedings of ICDT,
pages 98–112, Delphi, January 1997. [SDN98] A. Shukla, P.M. Deshpande, and J.F. Naughton.
[HRU96] V. Harinarayan, A. Rajaraman, and J. Ullman. Materialized View Selection for Multidimen-
Implementing Data Cubes Efficiently. In Proc. sional Datasets. In Proceedings of the 24th
of ACM SIGMOD, pages 205–216, Montreal, VLDB Conference, pages 488–499, New York
Canada, June 1996. City, New York, August 1998.

[JMS95] H. Jagadish, I. Mumick, and A. Silberschatz. [SS94] S. Sarawagi and M. Stonebraker. Efficient
View Maintenance Issues in the Chronicle Data Organization of Large Multidimensional Arr
Model. In Proceedings of PODS, pages 113– ays. In Proceedings of ICDE, pages 328–336,
124, San Jose, CA, 1995. Houston, Texas, 1994.

[KB96] A.M. Keller and J. Basu. A Predicate-based [SSV96] P. Scheuermann, J. Shim, and R. Vingralek.
Caching Scheme for Client-Server Database WATCHMAN: A Data Warehouse Intelligent
Architectures. VLDB Journal, 5(1), 1996. Cache Manager. In Proceedings of the 22th
VLDB Conference, pages 51–62, Bombay,
[Kim96] R. Kimball. The Data Warehouse Toolkit. John India, September 1996.
Wiley & Sons, 1996.
[TS97] D. Theodoratos and T. Sellis. Data Warehouse
[KR98] Y. Kotidis and N. Roussopoulos. An Alternative Configuration. In Proc. of the 23th International
Storage Organization for ROLAP Aggregate Conference on VLDB, pages 126–135, Athens,
Views Based on Cubetrees. In Proceedings of Greece, August 1997.
the ACM SIGMOD Conference, pages 249–258,
Seattle, Washington, June 1998. [ZDN97] Y. Zhao, P.M. Deshpande, and J.F. Naughton.
An Array-Based Algorithm for Simultaneous
[MQM97] I. S. Mumick, D. Quass, and B. S. Mumick. Multidimensional Aggregates. In Proceedings
Maintenance of Data Cubes and Summary of the ACM SIGMOD Conference, pages 159–
Tables in a Warehouse. In Proceedings of 170, Tucson, Arizona, May 1997.
the ACM SIGMOD Conference, pages 100–111,
Tucson, Arizona, May 1997.
Chapter 8
Data Mining

At the beginning of the 1990’s, there was increasing interest in pushing large-scale data analysis
beyond the traditional “query-response” model used in database systems. This was particularly
desired for decision-support applications, which try to help analysts use an organization’s data to
aid in strategic decision-making. One decision-support challenge we discussed earlier (Section 7)
is to efficiently answer complex queries over enormous databases. But very often the most
difficult problem is to figure out what questions to ask over a large, complex database. This is the
grand challenge problem for data mining: the database system should be able to efficiently and
effectively tell the analyst “what is interesting in my data”?

Data mining has roots in a number of fields, most significantly in database systems and a branch
of AI known as Machine Learning. Perhaps because of the AI roots of Data Mining, there is a
fog of quasi-anthropomorphic terminology surrounding the field. Data “mining” itself is a phrase
meant to evoke the computer as a prospector searching for gold “nuggets” in “mountains” of data.
The data mining process is often referred to as “Knowledge Discovery”, with the emphasis on
“Knowledge” rather than “data” implying an evolution from machines that just store bits into
intelligent beings that “know” things.

The leading data mining conference is called “Knowledge Discovery and Data Mining”
(SIGKDD). At the time of its founding, it was intended to bridge a gap between the database and
AI communities. From the database side, there was a desire to break away from the querying
metaphors that had driven the field, and look at algorithms for data analysis that didn’t have
simple declarative representations. From the AI side, there was a desire to deal more directly
with real-world issues that hadn’t received emphasis in that community: large-scale data
collections, end-user interactivity and dirty data. SIGKDD caused concern in each community
about unhealthy fracturing of the research landscape, and there was initial discussion about trying
to avoid yet another research area. But it has proven to be a popular breakaway organization, and
one that is reasonably healthy.

The name Data Mining seems quite general, but in fact is usually used in the research literature to
describe a handful of simply-stated algorithm specifications for extracting patterns from large
tables. By far the three most common of these are Classification, Clustering, and Association
Rules. The first two problems originated in the statistics and machine learning communities, the
third in the database community. We present one example of each in this chapter; for each
example we present a paper that describes an intuitive, scalable, disk-oriented approach to the
problem – one that is appropriate for large databases. We also present a fourth paper on emerging
approaches toward integrating these techniques into the framework of a traditional database
system.

Classification

The problem statement for classification is quite simple. You are given a table with n columns,
and you are asked to predict the value of one additional “missing” column based on the data that
you have. The missing column can take on one of k pre-specified values, or class labels. As one
toy example, consider a table of credit card applicant data, including attributes like age, zip code,
salary, home value, etc. The missing column could be called risk, taking on one of three possible
classes: low, medium, or high.
Introduction 651

Classification is typically an example of a supervised learning task, in which a human first


“trains” the algorithm on some example data, by manually providing the value of the missing
field for a number of example rows. Based on this training set, the classification algorithm
should be able to predict the missing value for any future data that arrives. This predictive
approach is often based on some statistical model, and in the context of the model provides a
robust mathematical interpretation for the algorithm’s choice of class labels. The “gotcha” with
this worldview is that one has to trust that the statistical models being used are appropriate. We
will return to this point at the end of this introduction.

The classification paper we present is an algorithm called SPRINT. It is not the most statistically
sophisticated algorithm in the literature. However, it is approachable to a student of database
systems, and it has the features that distinguish data mining from traditional machine learning: it
scales up nicely to big, disk-based data sets, and it is naturally parallelizable. SPRINT builds a
decision tree, which is a classic machine learning technique. Decision trees have various
limitations. One typical pitfall is that they can overfit the training data, i.e., too carefully capture
random distinctions among members of different classes in the training set. The result of
overfitting is a classifier that cannot accommodate normal variations within members of a single
class; various techniques can be parameterized to balance overfitting against over-generalizing.
Another problem is that a decision tree breaks down the data hierarchically in a way that forbids
“pivoting” along different attributes – e.g. if it first breaks down the credit card applicants by age,
it may not do a good job describing how all the applicants should break down by salary. On the
positive side, though, decision trees are very intuitive, and can be presented to end-users as output
to explain the classification process. As a result, they are often used both in the research
literature and in commercial data analysis tools such as CHAID, CART, and C4.5.

Clustering

Students often confuse clustering and classification on first exposure. In clustering, we again
have an input table of n columns, and the challenge of generating a label for each row in a
missing n+1’st column. However, there are two key differences from classification. First,
clustering is typically an unsupervised task, with no training phase. Second, there are no pre-
determined class labels involved in clustering. Instead, each row is to be assigned a cluster ID;
intuitively, rows that are similar should have the same cluster ID, and rows that are different
should have different cluster IDs. The cluster IDs themselves have no meaning; the description
of a cluster is simply the members of the cluster itself (or some aggregate summary of the
members). In SQL terms, clustering is a kind of generalized “GROUP BY” clause: rather than
grouping together tuples with the same values on certain attributes, we group together tuples with
similar values. Given these groups, we can compute aggregates (summaries) for each group to
try and characterize its properties. As an example, it may be useful to cluster customers in a sales
database into market segments, and try to summarize their buying patterns to decide how best to
target ads.

Clustering hinges on a number of parameters. First, a fixed definition of “similarity” (or


“distance”) is typically defined to make the problem well-stated. Usually this is some
combination of column-by-column distances. For a table with only numeric columns there are
many natural distance metrics; three common ones include:

• The traditional Euclidean distance from geometry, i.e. the square root of the sum of
squared columnwise distances (sometimes called the L2 norm).
652 Chapter 8: Data Mining

• The so-called “Manhattan distance”, which is analogous to moving on a fixed grid like
the streets of Manhattan, measuring distance along the sides of rectangles rather than
allowing diagonals. This is simply the sum of the columnwise distances (sometimes
called the L1 norm).

• The maximum distance on any column (sometimes called the L-infinity norm).

In many cases one needs to scale each column in some way before computing distances, to
account (for example) for the use of different units in each column. Also, some data types like
“color” don’t obviously map to a number line. Attributes from these unordered types (sometimes
called categorical attributes) add complexity to the problem, since the distance between any two
different values of a categorical attribute is not well defined.

Another parameter in clustering is to decide upon the “right” number of clusters to request. This
is tightly tied to two other parameters, namely how wide each cluster is allowed to get (its
“diameter”), and how close two separate clusters are allowed to get. Obviously if the algorithm
is told to generate only a few clusters, they are likely to have big diameters. Similarly, with only
a few clusters it may be impossible to cleanly separate all the nearby points.

We chose the BIRCH clustering algorithm for this collection of papers. Like SPRINT, it is not
the most statistically sophisticated algorithm that has been proposed, but it has a number of
attractive features. First, it should be approachable for a student of database systems, since it does
not require a great deal of math or statistics to understand. Second, it has strong echoes of the
Generalized Search Trees (GiSTs) of Chapter 5: it generates a hierarchy of partitions of the data,
labeling each partition with a descriptor. Unlike the original GiST paper, these labels are not
predicates but statistical distribution information – i.e., aggregate functions of the data below.
These kind of extended GiST subtree descriptors were also proposed for near-neighbor searching
and selectivity estimation by Aoki [Aoki98]. Hence BIRCH can be seen as a fairly natural
extension to GiST that “shaves off” the bottom of the tree to ensure that it fits in memory. This
tie-in between search structures (indexes) and data summarizations (mining models) is discussed
further in [Bar97], and is a potential research direction we discuss below.

For the somewhat more statistically ambitious reader, a paper by Bradley et al. [BFR98] is a good
alternative introduction to clustering for large tables. This work extends a traditional technique
from statistics called k-means clustering, and shows how to make it efficient over large data sets.
The k-means algorithm uses a Euclidean distance metric for distance. A follow-on paper by the
same authors [BRF00] generalizes that work to richer probabilistic models using the Expectation
Maximization (E-M) approach from statistics. E-M has two advantages over the prior approaches
we have mentioned: it does not require a Euclidean distance metric, and – perhaps more
significantly – it allows each row to be assigned to multiple clusters with varying probabilities
(“20% likely to be in cluster 1, 30% likely to be in cluster 2”, etc.) In essence, it changes the
problem statement for clustering to one of assigning k new columns to the table – one for each
cluster – and a probability value for each row in each of these columns. This shift in the problem
statement again makes clustering a predictive approach; one that can be reasoned about
probabilistically, under certain statistical models (a common one is a “mixture of Gaussians”, i.e.
a bunch of superimposed “bell curves”). By contrast, the other approaches we discussed are
strictly about minimizing distances, and do not have a clear probabilistic interpretation.
Introduction 653

Association Rules

The association rules problem comes out of the database community, and is the most recent of the
popular data mining problems we present. The traditional example of association rules is to take
as input a set of cash-register sales transactions – sometimes called “market baskets” – and from
them try to compute associations of the form “X _ Y (c%)” where X and Y are sets of items, and
the rule states that c% of transactions that contain X also contain Y. An example rule might be
“when people buy chips and salsa, they also buy avocados 5% of the time”. The value c is
termed the confidence of the rule (we will have more to say about this terminology later.) If very
few people bought the set of items X ∪_Yin the rule (the set {chips, salsa, avocados} in our case),
then the above rule is probably not very significant. Hence these rules should only be generated
when they are “supported” by a large number of sales transactions; the percentage of transactions
that contain the set of items in the rule is called the support of the rule.

The paper we present here by Agrawal and Srikant is the best-known paper on association rule
mining (though it is not the first and certainly not the last.) It breaks the problem down neatly into
that of (a) first filtering down to only those sets with sufficient support, and (b) among those sets,
finding the association rules. The solution to part (a) hinges on the insight that a set can only
have sufficient support if all its subsets have sufficient support; hence sets of bigger and bigger
numbers of items can be built up incrementally. The solution to part (b) uses a hashing scheme to
find the associations.

There has been a huge number of follow-on papers to this one, including techniques to deal with
scenarios where very many transactions have sufficient support (see [AY98] for a survey),
techniques for doing the rule mining in an online and interactive fashion (e.g. [Hidber]),
techniques for finding associations over quantities (e.g. age-brackets) not just item names (e.g.
[MY97,SA96]), and myriad other variations and generalizations.

One reason for the popularity of association rule mining research is that it does not require a great
deal of statistical sophistication; it is quite close to traditional relational query processing. As a
result, the database community has invested more energy into this mining problem than into
others; by contrast, the machine learning community has invested relatively little time on this
technique, and members of that community have been known to question the utility of association
rules.

A problem that the statistically-minded researchers have with association rules is that they are not
predictive in any statistical sense. They simply count combinations of items in an existing
database; they do not construct any statistical model with which to make predictions about future
states of the database. In fact, the use of the term confidence for association rule mining is rather
misleading, because it inaccurately echoes the notion of confidence intervals used in statistical
estimation. In statistics, confidence intervals bound the value of a quantity (e.g. “the average
value is 100, plus or minus 2”), and say how often the value should stay within those bounds for
all possible database states containing the sample (“with 95% probability”). In this sense they
are predictive measures over possible database states, while association rules are simply counts of
the current database state.
654 Chapter 8: Data Mining

Discussion

Briefly put, association rule mining is an example of a combinatorial (descriptive) approach to


data mining, while classification and E-M clustering are examples of a probabilistic (predictive)
approach.

At some level, it is a matter of religion whether descriptive or predictive approaches are better, or
whether such a comparison is meaningful. Descriptive approaches don’t provide any notion of
whether the outcome is likely to recur in future data sets; they may only describe some
happenstance in the current state of the database. On the other hand, predictive approaches
typically make their predictions based on some model: i.e., some assumptions about underlying
data distributions, which may or may not be a good match to the real-world phenomenon
generating the data. In the end, either approach must be justified by its success in solving real
problems.

This brings up another thorny issue in the current state of data mining. Despite the energy and
hype behind the field, it has had limited broad-based practical application to date. The success
stories mostly arise in narrow, custom-written applications; probably the best-known success is in
credit-card fraud detection, which is a classification problem. General-purpose data mining tools
like those sold by the SAS Institute have had modest success in the marketplace. Part of the
problem is that many mining tools remain difficult to use: the choice of mining models and
techniques to apply to a problem is often ambiguous, setting the various parameters of the various
techniques remains a black art, and in the end there is always uncertainty about the significance
of the results.

Another problem with current mining approaches is their limited scope: they typically operate on
a single table, with fairly simple, full-scan access. Our last paper addresses this problem to some
extent, by describing architectural and linguistic approaches to integrate mining algorithms into a
richer database query environment. This is an area of research that took a surprisingly long time
to emerge, perhaps because of the gap between the skill sets of database systems experts and
statistical data analysis experts. This gap has narrowed significantly in the last decade, to the
mutual benefit of both communities. We expect to see more integration of these approaches to
querying and analyzing information in future; some interesting work has emerged melding data
mining techniques with OLAP (e.g., [Sara01]). But much remains to be done, even in the first-
order problem of efficiently integrating the favorite mining models into SQL systems.

Before parting, we note briefly that there are a number of other problems under investigation in
the data mining community. These include the mining of sequential patterns (e.g. trends in stock
prices over time), identifying outliers and “dirty” data, mining multiple tables for approximate
relational dependencies (e.g. keys and foreign keys), etc. There are also twists on the traditional
clustering and classification problems for specific settings, e.g. for text documents, for graphs
(e.g. of hyperlinked documents), for XML, for unending data streams, etc. As with the earlier
discussion, there are many algorithm variants for each of these tasks, and the success stories,
when they exist, tend to be from carefully tuned solutions for specific domains.

In sum, data mining has come a long way in the past decade, and should now be part of any
database expert’s vocabulary and toolkit. On the other hand, there is still a huge distance between
mining’s current reality and the dream of an unsophisticated user asking the computer to “tell me
what is important”.
Introduction 655

References

[Aok98] P.M. Aoki. “Generalizing ‘Search’ in Generalized Search Trees”. In Proc. 14th IEEE
Int'l Conf. on Data Engineering (ICDE '98), Orlando, FL, Feb. 1998, 380-389.

[AY98] Charu C. Aggarwal and Philip S. Yu. “Mining Large Itemsets for Association Rules”.
IEEE Data Eng. Bull. 21(1): 23-31, 1998.

[Bar97] D. Barbará, et al. “The New Jersey Data Reduction Report”. IEEE Bulletin of the
Technical Committee on Data Engineering, Dec. 1997.

[BFR98] Paul S. Bradley, Usama M. Fayyad and Cory Reina. “Scaling Clustering Algorithms to
Large Databases”. In Proceedings of the Fourth International Conference on Knowledge
Discovery and Data Mining (KDD), August 27-31, 1998, New York City 9-15.

[BRF00] Paul S. Bradley, Cory Reina, Usama M. Fayyad. “Clustering Very Large Databases
Using EM Mixture Models.” In International Conference on Pattern Recognition (ICPR),
September 3-8, 2000, Barcelona, Spain, Volume II: 2076-2080.

[Hidber98] Christian Hidber. “Online Association Rule Mining.” In Proc. ACM SIGMOD
International Conference on Management of Data, June 1-3, 1999, Philadelphia, Pennsylvania,
pp. 145-156.

[MY97] R. J. Miller and Y. Yang. “Association Rules over Interval Data”, Proc. of the ACM
SIGMOD Int'l Conf. on the Management of Data, Tuscon, AZ, May, 1997.

[SA96] Ramakrishnan Srikant and Rakesh Agrawal. “Mining quantitative association rules in
large relational tables.” In ACM-SIGMOD International Conference on Management of Data,
pp. 1–12, 1996.

[Sara01] S. Sarawagi. “User-cognizant multidimensional analysis”. The VLDB Journal, 10(2-


3):224-239, 2001.
BIRCH: An Efficient Data Clustering Method for Very Large Databases 657
658 Chapter 8: Data Mining
BIRCH: An Efficient Data Clustering Method for Very Large Databases 659
660 Chapter 8: Data Mining
BIRCH: An Efficient Data Clustering Method for Very Large Databases 661
662 Chapter 8: Data Mining
BIRCH: An Efficient Data Clustering Method for Very Large Databases 663
664 Chapter 8: Data Mining
BIRCH: An Efficient Data Clustering Method for Very Large Databases 665
666 Chapter 8: Data Mining
BIRCH: An Efficient Data Clustering Method for Very Large Databases 667
SPRINT: A Scalable Parallel Classifier for Data Mining 669
670 Chapter 8: Data Mining
SPRINT: A Scalable Parallel Classifier for Data Mining 671
672 Chapter 8: Data Mining
SPRINT: A Scalable Parallel Classifier for Data Mining 673
674 Chapter 8: Data Mining
SPRINT: A Scalable Parallel Classifier for Data Mining 675
676 Chapter 8: Data Mining
SPRINT: A Scalable Parallel Classifier for Data Mining 677
678 Chapter 8: Data Mining
SPRINT: A Scalable Parallel Classifier for Data Mining 679
Fast Algorithms for Mining Association Rules 681
682 Chapter 8: Data Mining
Fast Algorithms for Mining Association Rules 683
684 Chapter 8: Data Mining
Fast Algorithms for Mining Association Rules 685
686 Chapter 8: Data Mining
Fast Algorithms for Mining Association Rules 687
688 Chapter 8: Data Mining
Fast Algorithms for Mining Association Rules 689
690 Chapter 8: Data Mining
Fast Algorithms for Mining Association Rules 691
692 Chapter 8: Data Mining
Efficient Evaluation of Queries with Mining Predicates

Surajit Chaudhuri Vivek Narasayya Sunita Sarawagi


Microsoft Corp. Microsoft Corp. IIT Bombay
surajitc@microsoft.com viveknar@microsoft.com sunita@it.iitb.ac.in

Abstract predicates. Today’s systems would evaluate the above query


by first selecting the customers who visited the MSNBC
Modern relational database systems are beginning to site, then applying the mining model (treated as black-box)
support ad hoc queries on mining models. In this paper, on the selected rows, and filtering the subset that are pre-
we explore novel techniques for optimizing queries that ap- dicted to be “baseball fans”. In contrast, we wish to exploit
ply mining models to relational data. For such queries, we the mining predicate for better access path selection, partic-
use the internal structure of the mining model to automat- ularly if “baseball fans” represent a very small fraction of
ically derive traditional database predicates. We present MSNBC visitors. The main challenge in exploiting mining
algorithms for deriving such predicates for some popular predicates is that each mining model has its own specific
discrete mining models: decision trees, naive Bayes, and method of predicting classes as a function of the input at-
clustering. Our experiments on Microsoft SQL Server 2000 tributes, and some of these methods are too complex to be
demonstrate that these derived predicates can significantly directly usable by traditional database engines.
reduce the cost of evaluating such queries. We present a general framework in which, given a min-
ing predicate, a model-specific algorithm can be used to
infer a simpler derived predicate expression. The derived
1. Introduction predicate expression is constrained to be a propositional
Progress in database technology has made massive ware- expression consisting of simple selection predicates on at-
houses of business data ubiquitous [10]. There is increasing tribute values. Such a derived predicate, which we call an
commercial interest in mining the information in such ware- upper envelope of the mining predicate, can then be ex-
houses. Data mining is used to extract predictive models ploited for access path selection like any other traditional
from data that can be used for a variety of business tasks. database predicate.
For example, based on a customer’s profile information, a We concentrate on predictive mining models that when
model can be used for predicting if a customer is likely to applied to a tuple ~x predict one of K discrete classes
buy sports items. The result of such a prediction can be c1 ; : : : cK . Most classification and clustering models fall in

leveraged in the context of many applications, e.g., a mail this category. For every possible class c that the model M
campaign or an on-line targeted advertisement. predicts, its upper envelope is a predicate of the form M c (~x)
Recently, several database vendors have made it possi- such that the tuple ~x has class c only if it satisfies the predi-
ble to apply predictive models on relational data using SQL cate Mc (~x), but not necessarily vice-versa. We require that
extensions. The predictive models can either be built na- Mc (~ x) is a propositional predicate expression consisting of

tively or imported, using PMML or other interchange for- simple selection conditions on attributes of ~x. Such upper
mat. This enables us to express queries containing min- envelopes can be added to the query to generate a semanti-
ing predicates such as: “Find customers who visited the cally equivalent query that would result in the same set of
MSNBC site last week and who are predicted to belong to answers over any database. Since M c (~x) is a predicate on
the category of baseball fans”. The focus of this paper is to the attributes of ~x, it has the potential of better exploiting
optimize queries containing such mining predicates. To the index structures and improving the efficiency of the query.
best of our knowledge, this is the first study of its kind. The The effectiveness of such semantic optimization depends
techniques described in this paper are general and do not on two criteria. First, we must demonstrate that the upper-
depend on the specific nature of the integration of databases envelope predicates can be derived for a wide set of com-
and data mining. monly used mining models. Second, we need to show that
We propose a technique that exploits knowledge of the the addition of these upper-envelope predicates can have a
mining model’s content to optimize queries with mining significant impact on the execution time for queries with
694 Chapter 8: Data Mining

mining predicates. In turn, this requires that our deriva- CREATE MINING MODEL Risk Class // Name of Model
tion of upper envelopes to be “tight” and the original min- (
ing predicate to be selective so that they are effective in in- Customer ID LONG KEY, // source column
Gender TEXT DISCRETE, // source column
fluencing the access path selection. Our extensive exper-
Risk TEXT DISCRETE PREDICT, // prediction column
iments on Microsoft SQL Server provide strong evidence Purchases DOUBLE DISCRETIZED(), // source column
of the promise of such semantic optimization. Moreover, Age DOUBLE DISCRETIZED, // source column
our experiments demonstrate that little overhead is incurred )
during optimization for using such upper envelopes. USING [Decision Trees 101] // Mining Algorithm
The model is trained using the INSERT INTO statement
Outline: The rest of the paper is organized as follows.
that inserts training data into the model (not discussed due
In Section 2 we review existing support for mining pred- to lack of space), Predictions are obtained from a model
icates in SQL queries in two commercially available rela- M on a dataset D using a prediction join [15] between D
tional database engines: Microsoft SQL Server’s Analysis and M. A prediction join is different from a traditional equi-
Server and IBM DB2’s Intelligent Miner Scoring facility. join on tables since the model does not actually contain data
In Section 3 we present algorithms for deriving such pred- details. The following example illustrates prediction join.
icates for three popular discrete mining models: decision
SELECT D.Customer ID, M.Risk
trees, naive Bayes classifiers and clustering. In Section 4
FROM [Risk Class] M
we discuss the operational issues of using upper envelopes PREDICTION JOIN
to optimize queries with mining predicates. In Section 5 we (SELECT Customer ID, Gender, Age, sum(Purchases) as SP
report the results of our experimental study to evaluate the FROM Customers D Group BY Customer ID, Gender, Age ) as D
effectiveness of our technique in improving the efficiency ON M.Gender = D.Gender
of queries with mining predicates. We discuss related work and M.Age = D.Age
in Section 6. and M.Purchases = t.SP
Where M.Risk = “low”
2. Expressing Mining Queries in Existing Sys- In this example, the value of “Risk” for each customer
tems is not known. Joining rows in the Customers table to the
model M returns a predicted “Risk” for each customer. The
In this section, we describe some of the possible ap- WHERE clause specifies which predicted values should be
proaches to expressing database queries with mining predi- extracted and returned in the result set of the query. Specif-
cates. We emphasize that our techniques are general in the ically, the above example has the mining predicate Risk =
sense that they do not depend on the specific nature of such ”low”.
integration of databases and data mining.
2.3. IBM DB2
2.1. Extract and Mine IBM’s Intelligent Miner (IM) Scoring product integrates
the model application functionality of IBM Intelligent
The traditional way of integrating mining with query-
Miner for Data with the DB2 Universal Database [21] [1].
ing is to pose a traditional database query to a relational
Trained mining models in flat file, XML or PMML format
backend. The mining model is subsequently applied in the
can be imported into the database. We show an example
client/middleware on the result of the database query. Thus,
of importing a classification model for predicting the risk
for the example in the introduction, the mining query will be
level of a customer into a database using a UDF called ID-
evaluated in the following phases: (a) Execute a SQL query
MMX.DM impClasFile().
at the database server to obtain all the customers who vis-
ited MSNBC last week (b) For each customer fetched into INSERT INTO IDMMX.ClassifModels values (’Risk Class’,
the client/middleware, apply the mining model to determine IDMMX.DM impClasFile(’/tmp/myclassifier.x’))
if the customer is predicted to be a “baseball fan”.
Once the model is loaded, it can be applied to compatible
2.2. Microsoft Analysis Server records in the database by invoking another set of User De-
fined Functions (UDFs). An example of applying the above
In the Microsoft Analysis Server product (part of SQL
classification mining model (“Risk Class”) on a data table
Server 2000) mining models are explicitly recognized as
called Customers is shown below.
first-class table-like objects. Creation of a mining model
corresponds to schematic definition of a mining model. The SELECT Customer ID, Risk
following example shows creation of a mining model that FROM (
predicts risk level of customers based on source columns SELECT Customer ID, IDMMX.DM getPredClass(
gender, purchases and age using Decision trees. IDMMX.DM applyClasModel(c.model,
Efficient Evaluation of Queries with Mining Predicates 695

IDMMX.DM applData(IDMMX.DM applData(’AGE’,s.age), Lower BP > 91


’PURCHASE’,s.purchase))) as Risk yes no
FROM ClassifModels c, Customer list s
<
WHERE c.modelname=’Risk Class’ and s.salary 40000
Age > 63 Upper BP > 130
yes no
) WHERE Risk = ’low’ yes no
1
2 2
Overweight?
The UDF IDMMX.DM applData is used to map the
yes no
fields s.salary and s.age of the Customer list table into the
1
corresponding fields of the model for use during predic- 2
tion. The UDF applyClasModel() applies the model on
the mapped data and returns a composite result object that Figure 1. Example of a decision tree
AND (overweight)) OR ((lowerBP  91) AND (upper BP
has along with the predicted class other associated statis-
tics like confidence of prediction. A second UDF ID-
> 130))”. Similarly, of class c 2 is “((lower BP > 91) AND

(age  63)) OR ((lower BP > 91) AND (age > 63) AND
MMX.DM getPredClass extracts the predicted class from
(not overweight)) OR ((lowerBP  91) AND (upper BP 
this result object. The mining predicate in this query is:
Risk = ’low’.
130))”.
3. Deriving Upper Envelopes for Mining Pred- Extraction of upper envelopes for rule-based classi-
icates fiers [27, 14] is similarly straightforward. A rule-based
learner consists of a set of if-then rules where the body
We present algorithms for deriving upper envelopes for of the rule consists of conditions on the data attributes and
three popular mining models. We focus on mining models the head (the part after “then”) is one of the k class-labels.
that produce a discrete class as output. The class of models The upper envelope of each class c is just the disjunction of
whose prediction is real-valued is a topic of our future work. the body of all rules where c is the head. Unlike for deci-
For some models like decision trees and rule-based classi- sion trees, the envelope may not be exact because some rule
fiers, derivation of such predicates is straightforward as we learners allow rules of different classes to overlap. There-
show in Section 3.1. The process is more involved for naive fore, an input instance might fire off two rules, each of
Bayes classifiers and clustering as we show in Sections 3.2 which predicts a different class. Typically, a resolution pro-
and Sections 3.3 respectively. cedure based on the weights or sequential order of rules is
In deriving these upper envelopes two conflicting issues used to resolve conflict in such cases. It may be possible to
that arise are the tightness and complexity of the upper enve- tighten the envelope in such cases by exploiting the knowl-
lope predicate. An upper envelope of a class c is said to be edge of the resolution procedure.
exact if it includes all points belonging to c and no point be- 3.2. naive Bayes Classifiers
longing to any other class. In most cases, where the model
is complex we need to settle for looser bounds because both Extracting the upper envelopes for naive Bayes classi-
the complexity of the enveloping predicate and the running fiers is considerably more difficult than for decision trees.
time for deriving the upper envelope might get intolerable. We first present a primer on naive Bayes classifiers in Sec-
Complex predicates are also ineffective in improving the ef- tion 3.2.1. Then we present two algorithms for finding up-
ficiency of the query because the DBMS might spend a lot per envelopes in Sections 3.2.2. Finally, we present a proof
of time in evaluating these otherwise redundant predicates. of correctness in Section 3.2.3.
We revisit these issues in Sections 4.2. 3.2.1. Primer on naive Bayes classifiers
3.1. Decision trees Bayesian classifiers [27] perform a probabilistic model-
ing of each class. Let ~x be an instance for which the clas-
In a decision tree [29] the internal nodes define a simple
sifier needs to predict one of K classes c 1 ; c2 ; : : : cK . The
test on one of the attributes and the leaf-level nodes define
predicted class C (~x) of ~x is calculated as
a class label. An example of a decision tree is shown in
Figure 1. The class label of a new instance is determined C (~
x) = argmaxk Pr(ck j~x) = argmaxk
jk k
Pr(~x c ) Pr(c )

by evaluating the test conditions at the nodes and based on Pr(~x)


the outcome following one of the branches until a leaf node where Pr (ck ) is the probability of class c k and Pr (~xjck ) is
is reached. The label of the leaf is the predicted class of the probability of ~x in class c k . The denominator Pr (~x) is
the instance. We extract the upper envelope for a class c, the same for all classes and can be ignored in the selection
by ANDing the test conditions on the path from the root of the winning class.
to each leaf of the class and ORing them together. Clearly, Let n be the number of attributes in the input data. Naive
this envelope is exact. For the example in Figure 1 the upper Bayes classifiers assume that the attributes x1 ; : : : ; xn of ~x
envelope of class c 1 is “((lower BP > 91) AND (age > 63) are independent of each other given the class. Thus, the
696 Chapter 8: Data Mining

above formula becomes: of the domain of dimension d) member combinations. A


Y n
medium sized data set in our experiments took more than
C (~
x) = argmaxk ( Pr(xd jck) Pr(ck )) (1) 24 hours for just enumerating the combinations. We next
=1
X
d
present a top-down algorithm that avoids this exponential
n
enumeration.
= argmaxk ( log Pr(xd jc k) + log Pr(ck )) (2)
d =1 A top-down algorithm The algorithm proceeds in a top-
Ties are resolved by choosing the class which has the down manner recursively narrowing down the region be-
higher prior probability Pr(c k ). longing to the given class c k for which we want to find the
The probabilities Pr (x d jck ) and Pr (ck ) are estimated us- upper envelope. The main intuition behind this algorithm
ing training data. For a discrete attribute d, let m 1d : : : mnd d is to exploit efficiently computable upper bounds and lower
denote the nd members of the domain of d. For each mem- bounds on the probabilities of classes to quickly establish
ber mld , during the training phase we learn a set of K values the winning and losing classes in a region consisting of sev-
corresponding to the probability Pr (x d = mld jck ). Con- eral combinations.
tinuous attributes are either discretized using a preprocess- The algorithms starts by assuming that the entire re-
gion belongs to class c k . It then estimates an upper bound
ing step (see [17] for a discussion of various discretization
maxProb(cj ) and lower bound minProb(c j ) on the proba-
methods) or modeled using a single continuous probabil- bilities of each class cj as follows:
ity density function, the most common being the Gaussian
distribution. In this paper we will describe the algorithm
maxProb(cj ) = Pr(cj )
Y
n

max Pr(mld jcj )


assuming that all attributes are discretized. 2 1:::nd

Y
l
d=1
n
Example An example of a naive Bayes classifier is shown
minProb(cj ) = Pr(cj ) min Pr(mld jcj )
in Table 1 for K = 3 classes, n = 2 dimensions, first di-
d=1
l 21:::nd
mension d0 having n0 = 4 members and the second di-
mension d1 having n1 = 3 members. The triplet along Computation of these bounds requires time only linear in
the column margin show the trained Pr (m j 1 jck ) values for the number of members along each dimension. In Fig-
each of the three classes for dimension d 1 . The row margin ure 2(a) we show the minProb (second row) and maxProb
shows the corresponding values for dimension d 0 . For ex- (third row) values for the region shown in Figure 1. For ex-
ample, the first triplet in the column margin (.01, .7, .05) ample, in the figure the minProb value of 0.0005 for class
stands for (Pr(m01 jc1 ); Pr(m01 jc2 ); Pr(m01 jc3 )) respec- c2 is obtained by multiplying the three values Pr ( c 2) =

tively. The top-margin shows the class priors. Given these j


0:5; minl20::3 Pr (ml0 c2 ) = min(0:1; 0:1; 0:4; 0:4) =
parameters, the predicted class for each of the 12 possible j
0:1; minl20::2 Pr(ml1 c2 ) = min(0:7; 0:29; 0:01) = 0:01.
distinct instances ~x (found using Equation 1) is shown in Using these bounds we partially reason about the class
the internal cells. For example, the value 0.001 for the top- of the region to distinguish amongst one of these three out-
leftmost cell denotes Pr (~xjc1 ) where ~x = (m00 ; m01 ). comes.
3.2.2. Finding the upper envelope of a class 1. MUST- WIN: All points in the region belong to class
We next present algorithms for finding the upper en- ck . This is true if the minimum probability of class c k

velope to cover all regions in the n dimensional attribute (minProb(ck )) is greater than the maximum probabil-
space where the naive Bayes classifier will predict a given ity (maxProb(c j )) values of all classes cj .
class ck . For example, the upper envelope for class c 2 2. MUST- LOSE: No points in the region belong to class
in the example of Figure 1 is (d 0 2 fm20 ; m30 g AND ck . This is true if there exists a class c j for which
d1 2 fm01 ; m11 g) OR (d1 = m01 ). We will express this
maxProb(ck ) < minProb(cj ). In this case class cj will
envelope as two regions described by their boundaries as win over class ck at all points in this region.
(d0 : [2::3]; d1 : [0::1]) _ (d1 : [0::0]).
A simple way to find such envelopes is to enumerate for 3. AMBIGUOUS: Neither of the previous two conditions
each combination in this n dimensional space the predicted apply, i.e., possibly a subset of points in the region be-
class as we have done for the example above. We can then long to the class.
cover all combinations where class c k is the winner with In Section 3.2.3 we sketch a proof of why these bounds are
a collection of contiguous regions using any of the known correct and also show how to improve them further.
multidimensional covering algorithms [2, 30]. Each region When the status of a region is AMBIGUOUS, we need to
will contribute one disjunct to the upper envelope. This is first shrink the region and then split it into smaller regions,
in fact a generic algorithm applicable to any classification re-evaluate the upper and lower bounds in each region and
n Q
algorithm, not simply naive Bayes. Unfortunately, it is im-
practically slow to enumerate all d=1 nd (nd is the size
recursively apply the above tests until all regions either sat-
isfy one of the first two terminating conditions or the al-
Efficient Evaluation of Queries with Mining Predicates 697

d1 # p(c1 ) = 0:33; p(c2 ) = 0:5; p(c3 ) = 0:17


m01 .01, .7, .05 .001, .03, .0005 (c2 ) .001, .03, .0005 (c2 ) .0002, .1, .004 (c2 ) .0002, .1, .004 (c2 )
m11 .5, .29, .05 .07, .01, .0005 (c1 ) .07, .01, .0005 (c1 ) .009, .06, .004 (c2 ) .009, .06, .004 (c2 )
m21 .49, .1, .9 .07, .0005, .009 (c1 ) .07, .0005, .009 (c1 ) .009, .002, .07 (c3 ) .009, .002, .07 (c3 )
d0 ! .4, .1, .05 .4, .1, .05 .05, .4, .4 .05, .4, .4
m00 m10 m20 m30

Table 1. Example of a naive-Bayes classifier. Refer the Example paragraph of Section 3.2.1 for a description.

Region: d0 ; d1 [0::3]; [0::2] [0::3]; [2::2] [0::3]; [0::1] [0::1]; [0::1] [2::3]; [0::1]
MinProb: .0002, .0005, .0005 .0002, .03, .0005 .009, .0005, .0005 .07, .0005, .0005 .009, .002, .004
MaxProb: .07, .1, .07 .0014, .1, .004 .07, .06, .07 .07, .01, .009 .009, .06, .07
Status: AMBIGUOUS MUST- LOSE AMBIGUOUS MUST- WIN AMBIGUOUS

(a) Starting region (b) Tighter bounds (c) Shrinking d1 (d) 1st child on splitting d0 (e) 2nd child
with member m21 of d1 to [0..1] into [0..1] and [2..3]

Figure 2. First three steps of finding predicates for class c1 of the classifier in Figure 1 showing a shrinkage step along dimension
1 followed by a split along dimension 0. In each box, the first line identifies the boundary of the region, the second and third lines
show respectively the minProb and maxProb values of each of the three classes. The fourth line is the status of the region with
respect to class c1 .

gorithm has made a maximum number of splits (an input dimensions, we only remove members from the two ends to
parameter of the algorithm). A sketch of the algorithm ap- maintain contiguity.
pears below. In Figure 2(a), from the minProb and maxProb values of
Algorithm 1 UpperEnvelope(ck ) the starting region [0::3]; [0::2] we find that for class c 1 nei-
1: T : Tree initialized with the entire region as root; ther of the MUST- WIN or MUST- LOSE situation hold. Hence
2: while number of tree nodes expanded < Threshold do the situation is AMBIGUOUS for c1 and we attempt to shrink
3: r = an unvisited leaf of T ; this region. In Figure 2(b) we show the revised bounds
4: r .status = Compute using c k and maxProb, minProb for the last member m 21 of dimension 1. This leads to
values of r; a MUST- LOSE situation for class c 1 because in the region
5: if r.status = MUST- WIN then mark r as visited; maxProb for class c 1 is smaller than minProb for class c 2 .
6: if r.status = MUST- LOSE then remove r from T ; The new maxProb and minProb values in the shrunk region
7: if r.status = AMBIGUOUS then are shown in Figure 2(c). The shrunk region is again in an
8: Shrink r along all possible dimensions; AMBIGUOUS state and we attempt to split it next.
9: Split r into r1 and r2 ;
Split: Regions are split by partitioning the values along a
10: Add r1 and r2 to T as children of r;
dimension. In evaluating the best split, we want to avoid
11: end if
methods that require explicit enumeration of the class of
12: end while
each combination. In performing the split our goal is to sep-
13: Sweep T bottom-up merging all contiguous leaves;
arate out (as best as possible) the regions which belong to
14: Upper Envelope(c k ) = disjunct over all leaves of T .
class ck from the ones which do not belong to c k . For this,
Shrink: We cycle through all dimensions and for each we rely on the well-known entropy function [27] for quan-
dimension d evaluate for each of its member m ld the tifying the skewness in the probability distribution of class
ck along each dimension. The details of the split are exactly

Y
maxProb(cj ; d; mld ) and minProb(c j ; d; mld ) value as
as in the case of binary splits during decision tree construc-
j
maxProb(cj ; d; mld ) = Pr(cj ) Pr(mld cj ) j
max Pr(mre cj )
r
tion. We evaluate the entropy function for split along each

Y
e=d
6 member of each dimension and choose the split which has
the lowest average entropy in the two sub-regions. The only
j
minProb(cj ; d; mld ) = Pr(cj ) Pr(mld cj ) j
min Pr(mre cj )
r
difference is that we do not have explicit counts of each
e=d 6

class, instead we rely on the probability values of the mem-


We use these revised tighter bounds to further shrink the bers on each side of the splitting dimension.
region where possible. We test the MUST- LOSE condition Continuing with our example, in Figure 2(d) and (e) we
above on the revised bounds and remove any members of an show the two regions obtained by splitting dimension d 0
unordered dimension that satisfy this condition. For ordered into [0..1] and [2..3]. The first sub-region shown in Fig-
698 Chapter 8: Data Mining

6 Q.
= minv (Pr (ck )
Q n
j
ure 2(d) leads to a MUST- WIN situation and gives one dis- Also d=1 Pr(vd ck )) =
j
j k
n
junct for the upper envelope of class c 1 . The second re- Pr (ck ) d=1 minvd Pr(vd cj )
because all the terms
gion is still in an AMBIGUOUS situation – however a second
round of shrinkage along dimension d 1 on the region leads max() beyond the
Q
within the product are non-negative. Similarly, moving the
leaves the result unchanged. Thus,
to an empty region and the top-down process terminates. )
(3) (4).
Merging regions: Once the above top-down split process
terminates, we merge all regions that do not satisfy the
We next present a lemma that will help us get exact
MUST- LOSE condition. During the course of the above par-
bounds for the case when the number of classes K = 2.
titioning algorithm we maintain the tree structure of the split
so that whenever all children of a node belong to the same Lemma 3.2 When the number of classes K = 2, the
class, they can be trivially merged together. This is followed
j
MUST- WIN and the MUST- LOSE bounds are exact when
by another iterative search for pairs of non-sibling regions the probability values Pr(v d cj ) in condition 3 of Lemma
that can be merged. The output is a set of non-overlapping j
Pr(vd jcj )
3.1 are replaced with Pr 0 (vd cj ) = maxi=k Pr . Let
j
(vd jci )
regions that totally subsume all combinations belonging to 0 0
6

pj (mld ) denote Pr (mld cj ). We wish to prove that, when


a class.
K = 2 condition 4 is equivalent to
Complexity The above top-down algorithm has a com-
Y
n Y
n
plexity of O(tnmK ) where t is the threshold that con- c
Pr( k ) p mld ) > max
0
min k (
l j =k
Pr(cj ) max j (
l
p mld )
0
(5)
trols the depth of the tree to which we expand and m = d=1
6
d=1
maxn d=1 (nd ) is the maximum length of a dimension. Con-
trast this with the exponential complexity K nd=1 nd of
Q Similar results hold for the MUST- LOSE condition.
just the enumeration step of the naive algorithm.
P ROOF. Omitted due to lack of space.
3.2.3. Formal Results
This section contains a sketch of the proof of correctness 3.3. Clustering
of the top-down algorithm and can be skipped on first read-
Clustering models [22] are of three broad kinds: parti-
ing.
tional, hierarchical and fuzzy. We concentrate on parti-
The main concern about the correctness of the above al-
tional clusters where the output is a set of k clusters and
gorithm arises from the use of the maxProb and minProb
each point is assigned to exactly one of these k clusters.
bounds in determining the two MUST- WIN and MUST- LOSE
Hierarchical and fuzzy clusters are a subject of our on-
conditions. We sketch a proof of why these bounds are cor-
going work. Partitional clustering methods can be fur-
rect and also present a set of improved bounds for the spe-
ther subdivided based on the membership criteria used for
j
cial case of two classes. In this proof we do not explicitly
assigning new instances to clusters. We consider three
discuss the case where there is a tie in the Pr (c k ~x) values
variants: centroid-based, model-based and boundary-based
of two classes.
(commonly arising in density-based clusters).
Lemma 3.1 If a region satisfies the MUST- WIN condition In the popular centroid-based method each cluster is as-
minProb(ck ) > maxj 6=k maxProb(cj ) then for every possi- sociated with a single point called the centroid that is most
ble cell v in the region the probability of class c k is greater representative of the cluster. An appropriate distance mea-

j
than the probability of every other class. Let p j (mld ) de- sure on the input attributes is used to measure the distance
note Pr(mld cj ). We wish to prove that between the cluster centroid and the instance. A common

c
Pr( k )
Y
n
min k ( p mld ) > max Pr(cj )
Yn
max j ( p mld) (3)
distance function is Euclidean or weighted Euclidean. The
instance is assigned to the cluster with the closest cen-
l j =k l troid. This partitions the data space into K disjoint par-
d=1 d=1
6

titions where the i-th partition contains all points that are
implies

Y
n Y
n
! closer to the ith centroid than to any other centroid. A clus-
ter’s partition could take arbitrary shapes depending on the
8v c
Pr( k ) pk (vd ) > max Pr(cj ) pj (vd ) (4) distance function, the number of clusters and the number
j =k
d=1 d=1
6

of dimensions. Our goal is to provide an upper envelope


That is, (3) ) (4). Similar results hold for the MUST-LOSE on the boundary of each partition using a small number of
condition. hyper-rectangles.

P ROOF. Let f (v; j ) denote Pr (cj ) d=1 Pr(vd cj ).


Q n
j
A second class of clustering methods is model-
based [25]. Model-based clustering assumes that data is
If minv f (v; k ) > maxj 6=k maxv (f (v; j )) then generated from a mixture of underlying distributions in
0 0
f (v; k ) > f (v ; j ) for all values v and all classes which each distribution represents a group or a cluster.
Efficient Evaluation of Queries with Mining Predicates 699

We show that both distance based and model-based clus- on M.Prediction column. An example of such a query is
ters can be expressed exactly as naive Bayes classifiers to identify customers who a data mining model predicts
for the purposes of finding the upper envelopes. Consider to be either baseball fans or football fans. For such a
distance-based clustering first. Let c 1 ; c2 : : : cK be the K mining predicate, the upper envelope is a disjunction of
clusters, n be the number of attributes or dimensions of an the upper envelopes corresponding to each of the atomic
instance ~x and (c1k : : : cnk ) be the centroid of the k -th clus- mining predicates. Thus, if M ci denotes the predicate
ter. Assume a weighted Euclidean distance measure. Let
(w1k : : : wnk ) denote the weight values. Then, a point ~x is lW
(M:P rediction column = ci ), we can express the over-
all disjunct as: i=1 Mci
assigned to a cluster as follows:
Xn w 2
Join predicates between two predicted columns: An-
cluster of ~x = argmax k dk (xd cdk )
other form of join predicates is M1.Prediction column1 =
d=1 M2.Prediction column2. Such predicates select instances
This is similar in structure to Equation 2 with the prior term on which two models M1 and M2 concur in their predicted
missing. In both cases, for each component of ~x, we have class labels. An example of such a query is “Find all mi-
a set of K values corresponding to the K different clus- crosoft.com visitors who are predicted to be web devel-
ters/classes. We sum over these n values along each dimen- opers by two mining models SAS customer model and
sion and choose of these K sums the class with the largest SP SS customer model”. In order to optimize this query
sum. using upper envelopes, we assume that the class labels
For several model-based clusters the situation is similar. for each of the mining models can be enumerated during

P
Each group k is associated with a mixing parameter called
K
k ( k=1 k = 1) in addition to the parameters  k of the
optimization by examining the metadata associated with
the mining models. In typical mining models we expect
distribution function of that group. Thus, an instance will the number of classes to be quite small. Let the class
labels that are common to these two mining models be
be assigned to the cluster with the largest value of
cluster of ~x = argmax k (k fk (~xjk ))
fc1 ; c2 ; ::; ck g. Then, the above join predicate, is equiv-
W
k
alent to this disjunction: i=1 (M1.Prediction column1 =
When the distribution function f k treats each dimension in-
dependently, for example, mixtures of Gaussians with the
W
M2.Prediction column2 = c i ). Adopting the notation of the
previous paragraph, this can be expressed as: i (M 1ci ^
covariance entries zero, we can again express the above ex- M 2ci ). Note that if M1 and M2 are identical models, then
pression in the same form as Equation 2. the resulting upper envelope results in a tautology. Con-
Boundary-based clusters [18] explicitly define the versely, if M1 and M2 are contradictory, then the upper en-
boundary of a region within which a point needs to lie in velope evaluates to false and the query is guaranteed to re-
order to belong to a cluster. Deriving upper envelopes is turn no answers. These observations can be leveraged dur-
equivalent to covering a geometric region with a small num- ing the optimization process to improve efficiency.
ber of rectangles. This is a classical problem in computa-
tion geometry for which several approximate algorithms ex- Join predicates between a predicted column and
ist [30, 2]. Further investigation of this problem is part of a data column: Consider predicates of the form
our future work. M1.Prediction column = T.Data column that check if the
prediction of a mining model matches that of a database
4. Optimizing Mining Queries column. An example of this type of predicate is: “Find all
customers for whom predicted age is of the same category
So far we have considered examples of mining predi-
as the actual age”1 . Such queries can occur, for example, in
cates of the form “Prediction column = class label”. In Sec-
cross-validation tasks. Evaluation of the above query seems
tion 4.1, we show a wider class of mining predicates that
to require scanning the entire table. Fortunately, like in the
may be optimized using upper envelopes for mining predi-
previous paragraph, we can use the approach of enumerat-
cates of the above form. Then in Section 4.2 we discuss the
ing the set of possible class labels. Once again, such an
key steps needed in enabling such optimization in a tradi-
approach is feasible since in most mining models we ex-
tional relational database engine.
pect the number of class to be small. If the set of classes
4.1. Types of mining predicates
We discuss three additional types of mining predicates
W
are fc1 ; c2 ; ::; ck g, then, we can derive an implied predi-
cate i (M 1ci ^ T:Data column = ci ). This transforms
that can be optimized using the derived per-class upper en- the query to a disjunct or a union of queries. More impor-
velopes. tantly, we now have the option of leveraging the content
of the mining model for access path selection. For exam-
IN predicates: A simple generalization is mining pred-
icates of the form: M.Prediction column IN (c 1 ; : : : ; cl ), 1 In this example, we consider age as a discretized attribute with the

where c1 ; : : : ; cl are a subset of the possible class labels domain consisting of three categories: ‘’young”, “middle-aged’, “senior”.
700 Chapter 8: Data Mining

ple, for the i-th disjunct, the optimizer can potentially con- overhead in themselves. However, our strategy for opti-
sider either the predicate T :Data column = c i or a predi- mization relies on the following assumptions about query
cate in M 1ci for access path selection. Of course, the final optimization and evaluation.
plan depends on other alternatives considered by the opti-
mizer (including sequential scan) but our rewriting opens Complexity of upper envelopes does not impact execu-
the door for additional alternatives. In addition to the above tion cost: We assume the following: (a) The evaluation
technique, the traditional approach of exploiting transitivity of upper envelopes do not add to the cost of the query.
of predicates in the WHERE clause can also be effective. This is consistent with traditional assumptions made in
For example, if the query contains additional predicates on database optimization since every upper envelope consists
T.Data columns that indirectly limits the possible domain of AND/OR expression of simple predicates. (b) The opti-
values M1.Prediction column can assume, then we can ap- mizer is well-behaved and is not misguided by the introduc-
ply the optimization of the IN predicates discussed earlier tion of additional complex boolean predicates due to upper
in this section. For example, if the query were “Find all envelopes. We rely on optimizers whose selectivity com-
customers for which predicted age is the same as the ac- putations and access path selections are robust for complex
tual age and the actual age is either old or middle-aged” boolean expressions. Although we make the above two as-
then, via transitivity of the predicate, we get a predicate sumptions for simplicity, they rarely hold in all situations.
M.Prediction column IN ( ’old’, ’middle-aged’) for which Failure to satisfy condition (a) can be dealt by more careful
we can add the upper-enveloping predicates as discussed in rewriting. For example, if none of the predicates in the up-
the earlier paragraphs. per envelope is chosen for access path, the upper envelope
4.2. Key Steps in Optimization of Mining Predicates can be removed at the end of the optimization. In general,
The framework for optimizing queries with mining pred- we need to retain only a subset of relevant upper envelope
icates has two key parts. First, during training of the mining for evaluation as filter conditions. We omit these details due
models, upper envelopes for mining predicates of the form to lack of space. Unfortunately, handling violation of condi-
Model.Prediction column = class label have to be precom- tion (b) is more challenging, yet happens routinely. Today’s
puted using the algorithms described in Section 3. Precom- query optimizers often degenerate to sequential scan when
putation of such “atomic” upper envelopes reduces over- presented with a complex AND/OR expression. This would
head during query optimization. Second, during query opti- negate any benefits of upper envelopes as the latter typically
mization we optimize queries with mining predicates using consist of several disjuncts over conjuncts of atomic pred-
the following key steps: icates on the data columns. Despite past work (e.g., [28]),
handling complex filter conditions remains a core challenge
1. Apply traditional normalization and transitivity rules for SQL query optimizers. This remains an area of our ac-
to the given query to derive an equivalent query to be tive research in the context of query optimization. However,
used for the following steps. for the time being, we rely on thresholding of the number
2. For each mining predicate f in the query, do the fol- of disjuncts (see Section 3.2) and simplification based on
lowing. Assume that the mining predicate f references selectivity estimates to limit the complexity so that com-
a mining model m f : mercial optimizers are able to exploit upper envelopes. We
omit detailed discussion due to lack of space.
(a) Look up the information on class labels of mf

from the database, if needed.


Accessing content of mining models during query opti-
(b) Depending on the type of the mining predicate, mization should be enabled: Our strategies for deriving
derive an additional upper envelope u f using the upper envelopes (as described in Section 4.1) requires ac-
techniques described in Section 4.1. Computa- cess to content of the mining models (e.g., class labels) dur-
tion of such an upper envelope requires looking ing optimization. Such information is different from the tra-
up “atomic” upper envelopes computed during ditional statistical information about tables because the cor-
training (see earlier in this subsection). rectness of our optimization is impacted if the mining model
(c) Replace mf with mf ^ uf . is changed. In such cases, we need to invalidate an execu-
tion plan (if cached or persisted) in case it had exploited up-
3. Apply normalization and transitivity rules to derive an
per envelopes. Nonetheless, our approach of leveraging the
equivalent query. If new mining predicates are in-
content of mining models is justified because mining mod-
ferred, return to step 2, else return.
els evolve slowly and the size of a typical mining model is
Our experiments demonstrate that the additional work relatively small compared to data size. Therefore, optimiza-
during training to derive “atomic” upper envelopes as well tion time is not severely impacted for accessing the content
as step 2(b) during query optimization add little additional of a mining model.
Efficient Evaluation of Queries with Mining Predicates 701

Data Set Test size Training size # of # of mining query and identify mining predicates for which gen-
in millions classes clusters eration of upper envelopes may be possible. In our current
Anneal-U 1.83 598 6 6 implementation, generation of upper envelope predicates is
Balance-Scale 1.28 416 3 5 not integrated with the database engine; rather we rewrite
Chess 1.63 2130 2 5
the mining query externally to include the upper envelope
Diabetes 1.57 512 2 5
predicates, and submit the rewritten query to the database
Hypothyroid 1.78 1339 2 5
Letter 1.28 15000 26 26 engine. The upper envelopes are generated during train-
Pairty5+5 1.04 100 2 5 ing time by referring the MINING MODEL CONTENT
Shuttle 1.85 43500 7 7 schema rowset defined in the OLE DB for Data Mining [15]
Vehicle 1.73 564 4 5 interface.
Kdd-cup-99 4.72 100000 23 23
Evaluation Methodology: For each class (or cluster), we
first generate the query with the upper envelope predicate
Table 2. Summary of Data Sets used in experiments for that class. Thus, if T is the table containing the test
data, and hpi is the upper envelope predicate, we generate
5. Experiments the query “SELECT * FROM T WHERE hpi”. We cre-
In this section, we present results of experiments to eval- ate a workload file containing all queries for the (data set,
uate the effectiveness of upper envelope predicates gener- mining model) combination. Thus, the number of queries
ated by algorithms presented in Section 3. Our experiments in the workload file is equal to the number of classes (or
focussed on three important aspects: (i) Impact of upper en- clusters) for that (data set, mining model) combination. To
velope predicates on the running time and physical plan of generate an appropriate physical design for this workload,
queries. We study this in Section 5.2.1. (ii) Degree of tight- we invoke the Index Tuning Wizard tool [12, 4] that ships
ness of the approximation, studied in Section 5.2.2. (iii) with Microsoft SQL Server 2000 by passing it the above
Time taken to generate upper envelope predicates. The sig- workload file as input, and implement the index recommen-
nificant outcome of the last experiment was that in almost dations proposed by the tool. We then execute the workload
all data sets the time to precompute the upper envelope on the database and record the plan and running time of each
predicate for each class (see Section 4.2) was a negligible query in the workload. We compare this with a query that
fraction of the model training time. Likewise, the time to performs a full scan of the table, i.e., “SELECT * FROM
look up “atomic” upper envelope predicates was insignifi- T”.
cant compared to the time for optimizing the query. We do Although in practice, mining queries may also contain
not present further details of this experiment due to lack of other predicates, the above comparison with a “SELECT *”
space. query is reasonable since our goal is to determine if addi-
5.1. Experimental Setup tion of upper envelopes can reduce running time in a signifi-
Mining Models: We have implemented the algorithms cant number of cases (due to indexed access path selection).
presented in Section 3 for the decision tree, naive Bayes and Whether upper envelope predicates are indeed chosen over
clustering mining models. We generated decision tree and other predicates for indexing will of course depend on other
clustering mining models using Microsoft Analysis Server predicates and their relative selectivity. Finally, a design
that ships with Microsoft SQL Server 2000. For generat- that stores the class label with each tuple (e.g., as an addi-
ing naive Bayes mining models we used the discrete naive tional column) in the base relation is not acceptable since
Bayes inducer packaged with the MLC++ machine learning (a) It does not scale well with the number of mining models
library [23]. and (b) In many cases, mining queries are issued not over
the base relations but on queries (or views) over possibly
Data Sets: We report numbers on 10 data sets consist- multiple base relations Note that such precomputation of
ing of 9 UCI [7] data sets and the 1999 KDDcup data set the class label may however be appropriate in limited cases
available at [5]. Table 2 summarizes various characteris- (e.g., in materialized views).
tics of each data set. We generated the test data set (for
the UCI data sets) by repeatedly doubling all available data 5.2. Results
until the total number of rows in the data set exceeded 1 5.2.1. Impact of Upper Envelope Predicates on Run-
million rows. This way, the data distribution of each col- ning Time and Plan
umn (and hence selectivity of predicates on the column) in We first evaluate the impact of upper envelope predicates
the test data set is the same as in the training data set. All on the running time of all queries for all mining models.
data sets were stored in Microsoft SQL Server databases. The following table shows the average reduction in running
Implementation: When executing a mining query, we time over all queries for each type of mining model, com-
first identify the mining model object(s) referenced in the pared to a full scan of the data. We note that the reduction
702 Chapter 8: Data Mining

Decision Tree Mining Model: Impact on Plans Naïve Bayes Mining Model: Impact on Plans
Percentage of plans changed

Percentage of plans changed


100% 100%
80% 80%
60% 60%
40% 40%
20% 20%
0% 0%

r
d

d
e

e
+5

+5
e

e
-U

-U
up

up
s

s
es

s
e

e
tte

tte
oi

oi
cl

cl
tl

tl
es

es

e
al

al
ut

ut
y5

y5
al

al
et

et
dc

dc
yr

hi

yr

hi
le

le
sc

sc
ch

ch
ne

sh

ne

sh
ab

ve

ab

ve
th

th
rit

rit
kd

kd
e-

e-
po

po
an

pa

an

pa
di

di
nc

nc
hy

hy
la

la
ba

ba
Data Set Data Set

Figure 3. Impact of upper envelope predicates on physical Figure 4. Impact of upper envelope predicates on physical
plan for decision tree model plan for naive Bayes model

in running time we report here is in comparison to a “SE-


Clustering Mining Model: Impact on Plans
LECT *”, which does not include the time for actually in-

Percentage of plans changed


voking the mining model on the columns. If the application 100%
80%
of mining models is time consuming, then we can expect to
60%
see an even greater percentage reduction. 40%
Decision Tree Naive Bayes Clustering 20%

73.7% 63.5% 79.0% 0%

To further analyze the reason for the reduced running

e
+5

tle
-U

up
s

es
e

tte
oi

cl
es
al

ut
al

y5
et

dc

hi
yr

le
sc

ch
ne

sh
ab

ve
th

r it
kd
e-
time, we measured the impact of the upper envelope predi-

po
an

pa
di
nc

hy
la
ba

cates on the physical plan chosen by the query optimizer.


Data Set
For a given data set and mining model, we recorded for
each query whether the plan chosen by the query optimizer
changed compared to the query without upper envelope Figure 5. Impact of upper envelope predicates on physical
predicates. A plan is said to have changed if either: (a) The plan for clustering model
query optimizer chose one or more indexes to answer the
query. (b) The query optimizer decided to use a “Constant upper envelope predicate is NULL. In such cases, the opti-
Scan” operator since upper envelope predicate was NULL mizer does not need to access any data to answer the query.
(i.e., it does not need to reference the data at all to answer A more detailed analysis of the average reduction in running
the query). The table below shows the percentage of queries time as a function of the selectivity (both original and upper
for which the plan changed over all data sets and mining envelope) of the class/cluster over all classes and clusters of
models. all mining models and data sets is shown in Figure 6. We
Decision Tree Naive Bayes Clustering see that the reduction in running time is most significant
72.7% 75.3% 76.6% when the selectivity is below 10%. Also, a comparison of
As we can see from this table, for all types of mining the bars for original and upper envelope selectivities shows
models, a significant fraction of the queries had their phys- that the low reduction in running time for higher selectivi-
ical plans altered as a result of introducing upper envelope ties is not a reflection of the effectiveness of our algorithm.
predicates. Rather, when a predicate’s selectivity is high (e.g., above
We now analyze these results further by drilling-down 10%) the optimizer rarely selects indexes, particularly non-
into the results for each data set. Figures 3, 4 and 5 show clustered indexes. Thus, for high selectivity classes, adding
these numbers for the decision tree, naive Bayes and clus- upper envelope predicates is rarely useful, even if we could
tering mining models respectively. We observe that upper find exact predicates.
envelope predicates have greater impact on the plan for data Finally, we noticed that in many cases, the upper enve-
sets where the number of classes is relatively large (e.g., kd- lope predicates generated by our algorithms for these data
dcup, letter, shuttle etc.), and less impact for data sets where sets are relatively simple, i.e., consisting of few disjuncts.
number of classes is small (e.g., Diabetes, Parity etc.). This This increases the likelihood that the query optimizer can
is due to the fact that when the number of classes is large, use an index lookup to answer the query. Overall, this ex-
there are typically more classes with small selectivity for periment confirms our intuition that inclusion of upper en-
which the query optimizer picks an index to answer the velope predicates significantly impacts the plan, and hence
query. In fact, in some cases, the selectivity is 0, i.e., the running times of queries with mining predicates.
Efficient Evaluation of Queries with Mining Predicates 703

Reduction in Running Time vs. Selectivity tructure for supporting mining on not only stored results but
also on the result of an arbitrary database query was made
100%
in [9]. Agrawal et al [3] looked at the problem of gener-
Average Reduction in

80%
ating decision tree classifiers such that the predicate could
Running Time

60% Original easily be pushed into the SQL query. However, they do not
40% Upper-Envelope discuss how the method will work for other mining models.
20% Other complementary areas of work include construction
0%
of mining models using SQL [31] and defining language
0.0-0.01 0.01-0.1 0.1-0.2 0.2-1.0 extensions and application programming interfaces for in-
Selectivity tegrating mining with relation DBMS [26], [1] and [15].
More recently, database systems, such as Microsoft Analy-
Figure 6. Running Time improvement vs. Original Selec- sis Server or IBM DB2 have enabled specification of such
tivity: All mining models and data sets queries. However, none of these systems exploit mining
predicates for optimization in a general setting. Our paper
1 represents the first work in that direction.
Upper Envelope Selectivity (log scale)

Our work can be viewed as part of the broader field of se-


0.1
mantic query optimization. Early work in database systems
0.01
recognized value of query modification (e.g., in INGRES)
whereby a semantically implied predicate, perhaps derived
0.001 from integrity constraints, is added to make evaluation of
the query more efficient. Our technique follows the same
0.0001 approach but our novelty is in the specific information we
exploit - the internal structure of the mining model to derive
0.00001
0.00001 0.0001 0.001 0.01 0.1 1
upper envelopes. To the best of our knowledge, this has not
Original Selectivity (log scale) been attempted before. There has been past study of upper
envelopes that represent approximation to given recursive
Figure 7. Tightness of approximation: naive Bayes and queries [11, 8] but these do not apply to mining predicates.
clustering Recently, there has been work on optimization of user-
defined functions and predicates [19, 13, 32]. Mining pred-
5.2.2. Tightness of Approximation icates can certainly be viewed as user-defined predicates
In this experiment, we compare the tightness of approx- Thus, it is an interesting research question whether our idea
imation of the upper envelopes for naive Bayes and cluster- of deriving implied database predicates based on content of
ing. For decision-trees, since the upper envelopes are ex- mining models can be effectively applied to other examples
act, this comparison is not necessary. Figure 7 shows for of user-defined predicates as well.
all classes in all data sets for the naive Bayes and cluster- The problem of rule extraction from hard to interpret
ing mining models, a scatter plot of the original selectivity models like Neural networks [24, 16] bears resemblance,
of each class vs. the selectivity of the corresponding up- but differs from our problem in that the extracted rules need
per envelope predicate (on a log scale). Each point in the to approximate the classification function but are not re-
scatter-plot corresponds to one class of a data set. quired to be implied predicates (upper envelopes). More-
As we see from the figure, a significant fraction of the over, the algorithm for rule learning as proposed in [24] re-
upper envelope predicates either have selectivities close to quires an enumeration of the discretized input space, similar
the original selectivity or have selectivity small enough that to our first-cut bottom up algorithm. Such an approach has
use of indexes for answering the predicate is attractive. been shown to be infeasible in our case.
Most cases where the algorithm failed to find a tight upper The coverage problem has been addressed in several dif-
envelope correspond to cases where the original selectivity ferent contexts, including, covering a set of points with
is large to start with. In such cases, the upper envelope pred- the smallest number of rectangles [30, 2], covering a col-
icates are unlikely to be useful for improving access paths lection of clauses with simpler terms in logic minimiza-
even if they were exact. tion problems [20] and constructing clusters with rectilin-
ear boundaries [6]. Despite the apparent similarities, the
6. Related Work
coverage problems differ in a few important aspects. First,
Our work falls in the broader area of integration of data they assume that the points are already enumerated in the
mining and database systems and there are several pieces of n-dimensional space. This is not a feasible option in our
related work in that area. The case for building the infras- case. Next, the first two problems require an exact cover of
704 Chapter 8: Data Mining

smallest size whereas we only need an upper envelope. Fi- [16] M. W. Craven and J. W. Shavlik. Using neural networks for
nally, most of these approaches assume a small number of data mining. In Future Generation Computer Systems, 1997.
dimensions (two or three) and do not scale to higher dimen- [17] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and
sions. unsupervised discretization of continuous features. In Proc.
12th International Conference on Machine Learning, pages
References 194–202. Morgan Kaufmann, 1995.
[18] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-
[1] SQL multimedia and application packages part 6: Data min-
based algorithm for discovering clusters in large spatial
ing, ISO draft recommendations, 1999.
[2] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. databases with noise. In Proc. of the 2nd Int’l Conference on
Automatic subspace clustering of high dimensional data for Knowledge Discovery in Databases and Data Mining, Port-
data mining applications. In Proc. ACM SIGMOD Interna- land, Oregon, August 1996.
[19] J. M. Hellerstein and M. Stonebraker. Predicate migration:
tional Conf. on Management of Data, Seattle, USA, June
Optimizing queries with expensive predicates. In SIGMOD
1998.
[3] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. Conference, pages 267–276, 1993.
An interval classifier for database mining applications. In [20] S. J. Hong. MINI: A heuristic algorithm for two-level logic
Proc. of the VLDB Conference, pages 560–573, Vancouver, minimization. In R. Newton, editor, Selected Papers on
British Columbia, Canada, August 1992. Logic Synthesis for Integrated Circuit Design. IEEE Press,
[4] S. Agrawal, S. Chaudhuri, L. Kollar, and V. Narasayya. In- 1987.
dex tuning wizard for Microsoft SQL Server 2000. White [21] IBM. IBM Intelligent Miner Scoring, Administration and
paper. http://msdn.microsoft.com/library/ Programming for DB2 Version 7.1, March 2001.
[22] L. Kaufman and P. Rousseeuw. Finding Groups in Data:
techart/itwforsql.htm, 2000.
[5] S. D. Bay. The UCI KDD archive [http://kdd.ics.uci.edu]. An Introduction to Cluster Analysis. John Wiley and Sons,
Irvine, CA: University of California, Department of Infor- 1990.
mation and Computer Science, 1999. [23] R. Kohavi, D. Sommerfield, and J. Dougherty. Data mining
[6] M. Berger and I. Regoutsos. An algorithm for point clus- using MLC++: A machine learning library in C++. In Tools
tering and grid generation. IEEE Transactions on Systems, with Artificial Intelligence, pages 234–245. IEEE Computer
Man and Cybernetics, 21(5):1278–86, 1991. Society Press, available from http://www.sgi.com/
[7] C. Blake and C. Merz. UCI repository of machine learning tech/mlc/, 1996.
databases, 1998. [24] H. Lu, R. Setiono, and H. Lui. Neurorule: A connectionist
[8] S. Chaudhuri. Finding nonrecursive envelopes for datalog approach to data mining. In Proc. of the Twenty first Int’l
predicates. In Proceedings of the Twelfth ACM SIGACT- conf. on Very Large Databases (VLDB), Zurich, Switzer-
SIGMOD-SIGART Symposium on Principles of Database land, Sep 1995.
Systems, May 25-28, 1993, Washington, DC, pages 135– [25] G. McLachlan and K. Basford. Mixture models: Inference
146, 1993. and applications to clustering, 1988.
[9] S.Chaudhuri. Data mining and database systems: Where is [26] R. Meo, G. Psaila, and S. Ceri. An extension to sql for min-
the intersection? In Bulletin of the Technical Committee on ing association rules. Data Mining and Knowledge Discov-
Data Engineering, volume 21, Mar 1998. ery, 2(2):195–224, 1998.
[10] S. Chaudhuri and U. Dayal. An overview of data warehouse [27] T. Mitchell. Machine Learning. McGraw-Hill, 1997.
and OLAP technology. ACM SIGMOD Record, March 1997. [28] C. Mohan, D. Haderle, Y. Wang, and J. Cheng. Single table
[11] S. Chaudhuri and P. G. Kolaitis. Can datalog be approx- access using multiple indexes: optimization, execution, and
imated? In Proceedings of the Thirteenth ACM SIGACT- concurrency control techniques. In Proc. International Con-
SIGMOD-SIGART Symposium on Principles of Database ference on Extending Database Technology, pages 29–43,
Systems, May 24-26, 1994, Minneapolis, Minnesota, pages 1990.
86–96, 1994. [29] J. R. Quinlan. C4.5: Programs for Machine Learning. Mor-
[12] S. Chaudhuri and V. R. Narasayya. An efficient cost-driven gan Kaufman, 1993.
index selection tool for microsoft sql server. In VLDB’97, [30] R. A. Reckhow and J. Culberson. Covering simple orthog-
Proceedings of 23rd International Conference on Very Large onal polygon with a minimum number of orthogonally con-
Data Bases, August 25-29, 1997, Athens, Greece, pages vex polygons. In Proc. of the ACM 3rd Annual Computa-
146–155, 1997. tional Geometry Conference, pages 268–277, 1987.
[13] S. Chaudhuri and K. Shim. Optimization of queries [31] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating as-
with user-defined predicates. In VLDB’96, Proceedings of sociation rule mining with databases: alternatives and im-
22th International Conference on Very Large Data Bases, plications. In Proc. ACM SIGMOD International Conf. on
September 3-6, 1996, Mumbai (Bombay), India, pages 87– Management of Data, Seattle, USA, June 1998.
98, 1996. [32] G. M. Wolfgang Scheufele. Efficient dynamic programming
[14] W. W. Cohen. Fast effective rule induction. In Proc. 12th algorithms for ordering expensive joins and selections. In
International Conference on Machine Learning, pages 115– Proc. of the 6th Int’l Conference on Extending Database
123. Morgan Kaufmann, 1995. Technology (EDBT), Valencia, Spain, 1998.
[15] M. Corporation. OLE DB for data mining. http://www.
microsoft.com/data/oledb.
Chapter 9
Web Services and Data Bases

In this section we include a collection of papers motivated by the emergence of the World Wide
Web (WWW) as the major delivery mechanism for application services as well as information
from both data bases and text files. There are three thrusts that we focus on. First, the web
requires a different application architecture for DBMS services than commonly utilized in the
past. Hence, we first discuss architectural issues. Next, the web has prominently showcased the
retrieval and delivery of textual information. Hence, our second thrust is on querying textual
data. We then close with a discussion of “nirvana”, in the form of integrating textual data and
conventional structured data.

When the web was envisioned by Tim Berners-Lee, he had in mind a hypertext-oriented system
whereby users could link documents together with hyperlinks (URLs). In this way, a web of
inter-related documents could be constructed. Hypertext is obviously not a new idea, and linking
textual objects together has been a common theme for a long time. The web would have been yet
another hypertext proposal without the contribution of Marc Andreesen, who added an easy-to-
use GUI for the web called Mosaic. With Andreesen’s browser, a user could easily navigate
Berners-Lee’s web of documents.

Web protocols (HTTP, HTML) were designed with documents in mind. However, the web
quickly morphed into a delivery mechanism for conventional applications, especially ones that
interact with data bases. Major web sites such as Amazon.com, Ebay.com, and United.com are
front ends for large DBMS applications. As such, the web is now largely about delivery of
application services and not textual documents. This use of the web for something that was never
intended by the original developers has caused major headaches. Specifically, the web is
“stateless”, i.e. each HTTP request is independent of the previous one. In contrast, the interface
between a GUI and a typical application as well as the interface between an application and the
DBMS are very “stateful”. For example, an application will often maintain state about a user
session, such as the following:

• The user name and password for the session


• The application operations that the user is authorized to perform
• The maximum resource level that the user can consume

When an application is interacting with a DBMS, the state of this session includes:

• The user name and password on whose behalf the session is running
• The connection (socket) on which the user is communicating
• The current transaction
• The cursor position in a result set of records

As a result, web protocols present a big challenge to any system architect, namely how to
simulate “state” on top of a protocol that is fundamentally stateless. The typical resolution of this
dilemma is shown in Figure 9.1. On the user’s computer, the only program that is run is a web
browser. Hence, no part of the application exists on the client’s machine, and this architecture is
the ultimate in “thin client”. Although Java applets can be added to most web browsers to run
part of the clients application, they have never become popular. The reasons are varied, but
revolve around universality (you have to run on everybody’s browser), maintainability (what
706 Chapter 9: Web Services and Data Bases

about site specific applet bugs), and load times (it takes forever to upload a large applet,
especially over a modem to a home user).

The web browser specifies a URL to initiate communication via HTTP with a foreign web server.
If the web server is part of a large application site such as Amazon, then it is really one of a
hundred or more individual machines, often called “blade” servers because of their
interchangeable nature – they are typically commodity “jelly bean” hardware running Linux and
the Apache web server. Such computing “blades” can be cheaply expanded as traffic increases.
In front of this “web server farm” is a “packet-spraying” router, which load-balances incoming
messages among the farm. The only job of the web servers is to process HTTP requests and
responses; no part of the application runs on this farm. Each web server has a simple “plugin”
program (e.g. a script interpreter like PHP, PERL or Microsoft’s ASP) to communicate with the
next level, which is invariably an application server.

One might ask why the webserver tier is needed in this architecture. It is certainly possible to
bypass the web server level and communicate directly from the client level to the app server
level. To accomplish this, one must run a Java applet on the client that can open up a non-HTTP
connection directly to the app server. Besides the applet disadvantage mentioned previously,
there is one additional killer problem such an architecture faces: most firewalls are configured to
admit HTTP traffic but no other protocols. Because there is invariably a firewall at the front of
the site, guarding against unauthorized accesses, proprietary protocols are essentially never used
at this level. Finally, there are usually certain very common simple tasks (e.g. delivering static
content like the site’s home page) that the web server can handle, thus offloading the application
servers.

Each web server communicates with an application server, and there are typically a dozen or
more “blades” performing this function. Again, one want to easily expand the number of
machines as traffic grows. Moreover, most app servers are capable of performing load balancing
across a collection of machines, so there is no need for a router in front of the “app server farm”.
The job of the app server is to run the application with which the user is communicating.
WebLogic from BEA and WebSphere from IBM are the dominant app server products, with the
remainder of the market spread over dozens of vendors.

The app server communicates with the underlying DBMS using native DBMS session-oriented
protocols. There is often only a single large machine (say a Sun E10000 or E15000) running the
DBMS. Architects often utilize a single large machine, rather than many small ones at the DBMS
level, because of the difficulty of ensuring that each transaction touches only one machine.
Otherwise, only two choices remain: the database and workload can be strictly partitioned (e.g.
by having separate virtual “stores” for different product categories), or expensive two-phase
commits are required to commit transactions across DBMS servers. Neither of these is attractive
at the high end. If the capacity of a large shared memory multiprocessor (SMP) machine is
exceeded, then the decision is often to move to a shared-disk parallel (cluster) architecture, again
to minimize the difficulty of transaction commits.

The observant reader can note that the DBMS is essentially unaffected by the web. The major
difference between Figure 1 and a traditional client-server architecture is that an app server is the
object making DBMS requests, rather than a client machine. Moreover, on mainframe
configurations where a large number of clients need application service, architects often use a
transaction processing (TP) monitor such as CICS or IMS/DC. A TP monitor performs
essentially the same function as the application servers do in Figure 1. In summary, the web is
isolated above the DBMS by an app server, which functions very much like TP monitors have
Introduction 707

behaved on mainframes for a quarter of a century. This is a case where the statement that “the
web changed everything” does not hold.

On the other hand, there are huge demands put on application servers by the web. Because DBMS
experts must often get involved with application server issues, we have included a paper by Dean
Jacobs of BEA as our first selection in this section. It highlights some of the distinctions between
traditional TP monitors and modern app servers, and also focuses on the different kinds of state
that have to be managed in web applications, and the way these kinds of state are managed, either
in the app server or in the DBMS.

The second important aspect of web services is document retrieval. When the web is used for its
original purpose as a document repository, a user has the problem of finding any particular
document for which he is searching. Fairly early in the web’s evolution, a collection of sites
sprung up which “crawl” the web looking for publicly accessible documents. Then, they provide
keyword indexing for these documents in a web-wide index. Subsequently, a user can request the
documents that match a collection of keywords, and the site will return a collection of documents,
sorted in perceived order of relevance. Example sites of this sort include AltaVista, Google, and
Lycos, as well as subservices of portals like Yahoo!, Netscape and MSN. Obviously, these sites
have a massive data management problem, yet none use packaged database systems as part of
their solution. The next paper in this section by Eric Brewer, the architect of Inktomi (now
owned by Yahoo!), explains how document indexing sites work in database system terms, why
packaged DBMS solutions are not utilized in their design, and why the search engine design
diverges from traditional DBMSs.

If you have ever typed an important but common query – e.g. the name of a frequently occurring
medical condition that concerns you or a family member – you have experienced the frustration
of receiving 10,000 or so “hits”, and the resulting difficulty of finding the documents you are
interested in and trust amid all the “clutter”. The job of professional librarians is to organize large
collections, so users can find documents of interest. However, it is clearly impossible to hire
enough librarians to organize the web; hence the web will always be a place full of “clutter”.

As such, it is important to develop better algorithms to determine relevance of a document to a


particular user search. Traditional keyword indexing has its roots in the 1950’s [Luhn59], and
many of today’s most common commercial techniques (e.g. inverted indexes and the well-known
TFxIDF ranking metrics) have been around since the 1960’s [SL68]. The interested reader is
referred to an Information Retrieval textbook (e.g. [WMB99]) for more information on these.
Unfortunately, these basic techniques alone do poorly in the presence of large amounts of clutter.
To perform better a variety of add-on improvements have been suggested. One of the most
interesting, proposed by the Google developers, was to leverage the web’s additional hyperlink
information to help determine relevance; a related scheme was proposed by researchers at Cornell
and IBM [Klei99]. This tactic seems to work very well in practice, and for a while allowed
Google to give better answers than its competitors. According to scuttlebutt, all of the major
vendors now incorporate the Google technology, but that technology alone is still insufficient.
The icing on the cake apparently comes from considering the visual aspects of web documents –
essentially all search engines utilize the prominence of the word in the document (i.e. header,
title, author, boldface, etc.) to improve relevance. Hopefully, the technology will gradually
improve, leading to better facilities in the future. As an example of this class of system, we have
included one of the original Google architecture papers as our next selection in this section. It
reviews the Google ranking scheme, along with the architecture of Google’s server farm, which is
now fairly standard and was largely taken from Brewer and co.’s design at Inktomi.
708 Chapter 9: Web Services and Data Bases

The web has obviously catapulted information retrieval and document management into the
forefront of Computer Science. However, word-oriented technology has fundamental limits of
how well it can possibly perform. To do better, there are at least four directions being pursued.

First, one can develop sites that are specialized to a particular class of documents. In a
specialized domain, one should be able to do much better than general purpose sites. For
example, Charles Schwab is providing retrieval in the constrained universe of financial
documents. Such a constrained universe has two desirable features. First, the set of reasonable
queries is limited (you don’t ask Charles Schwab about the meaning of God). Second, and
perhaps more important, an expert can take the time to teach the system about the idioms and
slang that are typically used in the limited world (Any Schwab system must know about insider
trading, straddles, 401Ks, 403Bs, etc.). The next paper in this chapter is on BINGO!, a system for
efficiently doing “focused” crawls for specialized search sites.1

Second, one can exploit natural language (NL) understanding techniques to parse documents and
decipher their meaning. In addition, instead of accepting a collection of keywords from the user,
one could accept a question, and then parse it using NL techniques. NL has been around since the
1950’s, and slow steady improvements have been made. In our opinion, a general purpose NL
system that works well is still considerably beyond the state of the art. In the meantime,
techniques that use simpler notions of “understanding” taken from information retrieval – e.g. the
frequencies and co-occurrences of words – tend to perform better than any sophisticated attempts
at “understanding” language in a deep sense.

However, in constrained vertical markets NL systems are finding acceptance. Moreover, in


constrained universes, an NL system can be front-ended by a speech understanding facility. This
allows spoken input, rather than typed input, a definite advantage. Speaker-independent speech-
understanding systems are currently in use for stock quotations, telephone number lookup, airport
flight status, etc. We expect speech understanding and NL to make slow steady progress off into
the future, thereby enlarging the collection of constrained universes for which they work well.

A third area of exploration is site description facilities. Often a user is looking for a site, rather
than a specific document. For example, he might want a site that compares the safety records of
various car models. This will not be answered by a specific document, but by a site like
Consumers Report. To find such “services”, one can use the keyword search available in a public
indexing service and try to find the “home page” of a service using traditional technology.
However, much better results would be obtained if all services entered a stylized description of
what they do into a global repository. The user could then search though these descriptions for
services of interest. This is the goal of UDDI, pioneered by Microsoft. Over time, we expect
XML-based repositories of service capabilities to be widely available, and provide much better
“site finding” capabilities than traditional techniques.

However, it is the fourth area in which we have the most interest. Obviously, there are a large
number of unstructured documents available on the web. Such objects have no structure, other
than what can be deciphered using NL techniques. As noted in the first paper of this book, there
is a certain amount of “structured text” available on the web, such as want ads and resumes.
However, we do not expect this kind of data to increase radically in the future, for the reasons that


1
An added attraction of including the BINGO! paper in this collection is that it provides a whirlwind tour of techniques
from Information Retrieval and Machine Learning: TFxIDF metrics, cross-entropy (a.k.a. Kullback-Leibler
Divergence), Support Vector Machines, Kleinberg’s HITS algorithm, etc.
Introduction 709

were also noted in the first paper. Additionally, there is a considerable amount of “structured data
with text fields”. However, there is a truly massive amount of structured data available on the
web. Not only is there a large amount of “facts and figures” such as telephone directories,
demographic data, and weather data but also there is an even larger amount of transactional data.
This data includes the status of your shipment, your credit card transactions, etc. The latter is
usually available only through programmatic interfaces from the web. Hence, one must fill out a
form and be authorized to receive the data in question.

Based on these observations, there is a huge amount of leverage that would result from being able
to simultaneously query both structured data and text. For example a user might want to know
any news reports that mentioned a stock that changed in value by more than 1%. This requires a
“join” between financial ticker data and a news feed. This requires a federated data base system,
which “wraps” disparate data sources with “gateways” to construct a common structured data
model. The Mariposa system discussed in Chapter 4 of this book was an early example of this
sort of architecture. This approach bases its underlying processing mechanism on SQL and an
Object-relational data model.

Another approach would be to use XML and semi-structured data as the underlying model. In
fact, BEA has recently introduced an XML-based federation system, called Liquid Data. Since
XML is fundamentally semi-structured, this requires substantial changes to the federation
optimizer in order to function on this kind of data. The last two papers in this section deal with
XML processing. The first by Abiteboul presents a well written summary of the arguments for
semi-structured data. For a counterpoint, the reader is advised to read the XML section of the
first paper in this collection. The final two papers in this section discuss ways to perform query
processing on semi-structured data.

We close this section with a final comment on XML. Many people now believe that future
DBMSs will have to process both SQL-oriented data and XML data in the same system. Of
course, it is a fairly traditional schema design problem to “shred” XML objects into a collection
of tables, and then base all internal processing on SQL. A second approach is to build a native
XML-based DBMS, which views SQL tables as a special case of XML objects. Of course, there
is a lot of work to be done to figure out a sensible update model and view model for an XML-
based world. This may be more difficult than it would appear, given the historical lesson from
IMS logical data bases. A third option is to build a single DBMS that can switch modes, and
process either kind of data. Essentially all vendors are taking the first approach in the short run.
However, there is at least one large vendor that is actively pursuing each of the latter options. It
will be interesting to see which approach is the long term winner.

References

[Luhn59] Hans Peter Luhn, “Auto-encoding of documents for information retrieval systems.” In
M. Boaz, Modern Trends in Documentation (pp. 45-58). London: Pergamon Press. 1959

[SL68] Gerard Salton and Michael Lesk. “Computer Evaluation of Indexing and Text
Processing”. J. ACM 15(1): 8-36 (1968).

[WMB99] Ian H. Witten, Alistair Moffat and Timothy C. Bell. Managing Gigabytes (2nd ed.).
Morgan Kaufmann Publishers Inc. San Francisco, CA, USA. 1999.

[Klei99] Kleinberg, J.M. (1999) “Authoritative sources in a hyperlinked environment”, J. ACM,


Vol. 46, No. 5, pp.604--632.
710 Chapter 9: Web Services and Data Bases

Figure 9.1: A Typical Web Application Architecture


Combining Systems and Databases: A Search Engine Retrospective
Eric A. Brewer
University of California at Berkeley

Although a search engine manages a great deal of ing the index up to date with the data, nearly all of
data and responds to queries, it is not accurately which is remote and relatively awkward to access. Most
described as a “database” or DMBS. We believe that it data is “crawled” using the HTTP protocol and some
represents the first of many application-specific data automation, although there is also some data exchange
systems built by the systems community that must exploit via XML.
Despite the size and complexity of these systems,
the principles of databases without necessarily using the
they make almost no use of DBMS systems. There are
(current) database implementations. In this paper, we
many reasons for this, which we cover at the end, but
present how a search engine should have been designed the core hypothesis here is that, looking back, search
in hindsight. Although much of the material has not engines should have used the principles of databases,
been presented before, the contribution is not in the spe- but not the artifacts, and that other novel data-intensive
cific design, but rather in the combination of principles systems should do the same (covered in Section 8).
from the largely independent disciplines of “systems” These principles include:
and “databases.” Thus we present the design using the Top-Down Design: The traditional systems
ideas and vocabulary of the database community as a methodology is “bottom up” in order to deliver
model of how to design data-intensive systems. We then capabilities to unknown applications. However,
draw some conclusions about the application of data- DBMSs are designed “top down”, starting with the
base principles to other “out of the box” data-intensive desired semantics (e.g. ACID) and developing the
systems. mechanisms to implement those semantics. SEs are
also “whole” designs in this way; the semantics are
different (covered below), but the mechanisms
1 Introduction should follow from the semantics.
Search engines (SEs) are arguably the largest data Data Independence: Data exists in sets without
management systems in the world; although there are pointers. This allows evolution of representation
larger databases in total storage there is nothing close in and storage, and simplifies recovery and fault
query volume. A modern search engine handles over 3 tolerance.
billion documents, involving on the order of 10TB of
Declarative Query Language: the use of language to
data, and handles upwards of 150 million queries per define queries that says “what” to return not “how”
day, with peaks of several thousand queries per second. to compute it. The absence of “how” is the freedom
This retrospective is based primarily on almost nine
that enables powerful query optimizations. We do
years of work on the Inktomi search engine, from the not however use SQL (a DBMS artifact), but we do
summer of 1994 through the spring of 2003. It also
use the structure of a DBMS, with a query parser
reflects some the general issues and approaches of other and rewriter, a query optimizer, and a query
major search engines — in particular, those of Alta executor. We also define a logical query plan
Vista, Infoseek and Google — although their actual spe-
separate from the physical query plan.
cifics might differ greatly from the examples here.
Although queries tend to be short, there are more The fundamental problem with using a DBMS for a
than ten million different words in nearly all languages. search engine is that there is a semantic mismatch. The
This is a challenge for two reasons. First, implies track- practical problem was that they were remarkably slow:
ing and ranking ten million distinct words in three bil- experiments we performed in 1996 on Informix, which
lion documents including the position and relative had cluster support, were an order of magnitude slower
importance (e.g. title words) of every word. Second, than the hand-built prototype, primarily due to the
with so few words per query, most queries returns thou- amount of specialization that we could apply (see Sec-
sands of hits and ranking these hits becomes the primary tion 8). Most modern databases now directly support
challenge. text search, which is sufficient for most search applica-
Finally, search engines must be highly available and tions, although probably not for Yahoo! or Google.1
fresh, two complex and challenging data management The semantics for a DBMS start with the goals of
issues. Downtime contributes directly to lost revenue consistent, durable data, codified in the ideas of ACID
and customer churn. Freshness is the challenge of keep- transactions [GR97]. However, ACID transactions are not
the right semantics for search engines.
Copyright (c) 2004 by Eric A. Brewer.
712 Chapter 9: Web Services and Data Bases

First, as with other online services, there is a prefer- simple snapshot of the data, while most work, such as
ence for high availability over consistency. The CAP The- indexing, can be done offline without concern for avail-
orem [FB99,GL02] shows that a shared-data system must ability. For example, any work done offline can be
choose at most two of the following three properties: con- started and stopped at will, has a simple “start over”
sistency, availability, and tolerance to partitions. This model for recovery, and in general is very low stress to
implies that for a wide-area system you have to choose modify and operate since these efforts are not visible to
between consistency and availability (in the presence of end users.
faults), and SEs choose availability, while DBMSs choose
2.1 Crawl, Index, Serve
consistency.2 In addition, the index is always stale to some
degree, since updates to sites do not immediately affect the The first step is to “crawl” the documents, which
index. The explicit goal of freshness is to reduce the amounts to visiting pages in essentially the same way as
degree of inconsistency. an end user. The crawler outputs collections of docu-
Second, SEs can avoid a general-purpose update ments, typically a single file with a map at the beginning
mechanism, which makes isolation trivial. In particular, and thousands of concatenated documents. The use of
queries never cause updates, they are all read only. This large files improves system throughput, amortizes seek
implies query handling (almost) never deals with atomic- and directory operations, and simplifies management.3
ity, isolation, or durability. Instead, updates are limited to The crawler must keep track of which pages have been
atomic replacement of tables (covered in Section 5.2), and crawled or are in progress, how often to recrawl, and
only that code deals with atomicity and isolation. Durabil- must have some understanding of mirrors, dynamic
ity is even easier, since the SE is never the master copy: pages, and MIME types.
any lost data can generally be rebuilt from local copies or The indexer parses and interprets collections of doc-
even recrawled (which is how it is refreshed anyway). uments. Its output is a portion of the static database,
We start with an overview of the top-down design, fol- called a chunk, that reflects all of the scoring and nor-
lowed by coverage of the query plan and implementation malization for those documents. In general, the goal is
in Sections 3 and 4. Section 5 looks at updates, Section 6 to move work from the (online) web servers to the
at fault tolerance and availability, and Section 7 at a range indexer, so that the servers have the absolute minimum
of other issues. Finally, we take a broader look at data- amount of work to do per query. For example, the
intensive systems in Section 8. indexer does all of the work of scoring, generating typi-
cally a single normalized score for every word in every
document. The indexer does many other document anal-
2 Overview yses as well: determining the primary language and geo-
In a traditional database, the focus is on a general-pur- graphical region, checking for spam, and tracking
pose framework for a wide variety of queries with much of incoming and outgoing links (used for scoring). One of
the effort expended on data consistency in the presence of the more interesting and challenging tasks is to track all
concurrent updates. Here we focus on supporting many of the anchor text for a document, which is the hyper-
concurrent read-only queries, with very little variation in link text in all other (!) documents that point to this doc-
the range of queries, and we focus on availability more ument.
than consistency. Finally, the server simply executes queries against a
These constraints lead to an architecture that uses an collection of chunks. It performs query parsing and
essentially static database to serve all of the read-only que- rewriting, query optimization, and query execution.
ries, and a large degree of offline work to build and rebuild Since the only update operation is the atomic replace-
the static databases. The primary advantage of moving ment of a chunk (covered in Section 5), there are no
nearly everything offline is that it greatly simplifies the locks, no isolation issues, and no need for concurrency
online server and thus improves availability, scalability control for queries.
and cost.
We believe that most highly available servers should 2.2 Queries
follow this “snapshot” architecture — the server uses a Conceptually, a query defines some words and prop-
erties that a matching document should or should not
1: As an aside, the databases were also very expensive. However, as contain. A document is normally a web page, but could
we were among the first to build large web-database systems, we were also be a news article or an e-mail message. Each docu-
charged per “seat”, which in the fine print came down to distinct UNIX
ment is presumed unique, has a unique ID (DocID), a
user IDs. But all of the end users were multiplexed onto one user ID, so
URL and some summary information.
this was quite reasonable! Later the database companies changes the def-
inition of “user” and this trick was no longer valid. Documents contain words and have properties. We
2: Wide-area databases vary in their choice between availability and distinguish words from properties in that words have a
consistency. Those that choose availability operate some locations with
stale data in the presence of partitions and generally have a small window 3: In theory, a DBMS could be used for document storage, but it
that is stale (inconsistent) during normal operation (typically 30 seconds) would be a poor fit. Documents have a single writer, are only dealt
[SAS+96]. Those that choose consistency must make one side of a parti- with in large groups, and have essentially no concurrent access. See
tion unavailable until the partition is repaired. the Google File System [GGL03] for more on these issues.
Combining Systems and Databases: A Search Engine Retrospective 713

Document table, D, about 3B rows


Property Meaning
DocId URL Date Size Abstract
lang:english doc is in english
cont:java contains java applet Word table, about 1T rows:
cont:image contains an image WordID DocId Score Position Info
at:berkeley.edu domain suffix is berkeley.edu Property table, about 100B rows:
is:news is a news article WordID DocId
Table 1: Example Properties Term table, T, about 10M rows:
score (for this document) and properties are boolean String WordID Stats
(present or absent in the document). Table 1 lists some
examples. A query term is a word or a property. Figure 1: Basic Schema
Simple queries are just a list of terms that matching
documents must contain. Property matching is absolute:
a matching document must meet all properties. Word
matching is relative: documents receive relative scores 3 Logical Query Plan
based on how well they match the words. Given this simplified scoring, we turn to how to map
Complex queries include boolean expressions of a query into a query plan. This section looks at the logi-
terms based on AND, OR and NOT. Boolean expres- cal query plan and the next section looks at the physical
sions for properties are straightforward, but those for operators and plan implementation.
words are not. In particular, the expression (NOT word) In the original development of this work, we were
should not affect the scoring; it is really a property. not cognizant that we were defining a declarative query
We cover scoring in more detail in the appendix, but language and that it should have a query plan, an opti-
for now we will use simple definitions. A query is just a mizer, and a rewriter, and that we should cleanly sepa-
set of terms: rate the logical and physical query operators and plans.
We did know we needed a parser. The absence of this
Query Q { {w 1 w 2 }, w k } (1) view led to a very complicated parser that did ad-hoc
The score of a document d for query Q is the sum of an versions of query rewriting and planning, and some
overall score for the document and a score for each term optimization. The use of an abstract logical query plan is
in the query: one of the important principles to take from database
systems, and hence we retrospectively present the work
based on a clean logical query plan.
Score(Q d) { Quality d
For simplicity, we will limit the schema to three
(2) (large) tables: document info, word data and property
+ ¦ Score w i  d
i
data. Figure 1 shows the schema. Tables that we ignore
include those for logging (one row per search), advertis-
The quality term is independent of the query words and ing, and users (for personalization); we talk about some
reflects things like length (shorter is generally better), of these in Section 7.
popularity, incoming links, quality of the containing To simplify dealing with words and properties, we
site, and external reviews. The score for each word is a conceptually use an integer key for each word, the
determined at index time and depends on frequency and WordID. The term table, T, maps from the string of the
location (such as in the title or headings, or bold). term to the WordID for that word, and also keeps statis-
There are some important non-obvious uses for tics about the word (or property). The stats are used for
words. In general, any property of a document that is not both scoring and to compute the selectivity for query
boolean is represented by a metaword. Metawords are optimization.4 The simplest useful stat is the number of
artificial words that we add to a document to encode an rows in the table, which tells you how common the term
affine property. For example, to encode how frequently is in the corpus; high counts imply high selectivity and
a document contains images (rather than just yes or no), lower scores (since the word is common). Note that the
we add a metaword whose score reflects the frequency.
You can use this trick to encode many other properties,
such as overall document quality, number of incoming
or outgoing links, freshness, complexity, reading level, 4: Selectivity is the fraction of the input that ends up in the output,
and is thus a real number in the interval [0,1]. Ideally, a query plan
etc. Implicitly, these metrics are all on the same scale,
should apply joins with low selectivity first, since they reduce the data
but we can change the weighting at query time to con- for future joins. With multi-way equijoins (and semijoins), this is less
trol how to mix them. important since we aim to do them all at once. Of great confusion to
many is that high selectivity numbers imply the operation is not very
“selective” in the normal English usage of the word.
714 Chapter 9: Web Services and Data Bases

Result Set = [DocId, Score, URL, Date, Size, Abstract] Operator Meaning
e AND e Equijoin with scoring
e OR e Full outer join with scoring
DocId
[DocId, Score] [DocId, URL, Date, e FILTER p Semijoin: filter e by p
Size, Abstract] p AND p Equijoin without scoring
Top(k, Score) D p OR p Full outer join without scoring
Quality d NOT p s antijoin p (invert the set s)
score = + Score w  d NOT e s antijoin e (invert, omit score)
DocId ¦ i
i Table 2: Logical Operators
[DocId, Score] [DocId, Score]
prop: prop AN
AND prop
| prop OR prop
score = Quality(d) matching documents: NOT prop
| NO
score = ¦ Score wi  d NOT expr
| NO
i
| property

Figure 2: The General Query Plan There are seven corresponding logical operators as
After finding the set of matching documents and shown in Table 2. In this particular grammar, expr
their scores, the Top operator passes up the top k nodes have scores and prop nodes do not. The only way
results (in order) to an equijoin that adds in the to join an expr and a prop is through the FILTER oper-
document information. ator, which filters the word list on the left with the prop-
erty list on the right. Note that the logical negation for
term table is only used during query planning and is never an expr, NOT e, is a prop and not an expr. This is
referenced in the query itself. because there is no score for the documents not in the
Figure 2 shows the general plan for all queries. The set. This implies that it is not possible to ask for “-foo”
fact that all queries have essentially the same plan is a sig- as a top-level query (i.e. the set of all documents that do
nificant simplification over general-purpose databases, not contain “foo”). In practice, we actually do allow
and a key design advantage. The equijoins ( DocId ) join properties and negated expressions as top-level queries,
rows from each side that match on DocId, resulting in an which are useful for debugging. Note that the Top opera-
output row that has the union of the columns. We refer to tor (with query optimization) saves us from having to
the top equijoin as the “document join,” since it joins the return (nearly) the whole database.
top results with their document data. Normal queries are just a sequence of words with the
The Top(k, column) operator returns the top k items implicit operator AND between them. For example, the
from the input set ordered on column; it is not a common query:
database operator, although it appears in some extensions san francisco
to SQL and in the literature [CG99,CK98]. The input to Top
is the set of fully scored documents, which combines the maps to (san AND francisco). For properties:
document quality score with the sum of the word scores bay area lang:english
from the matching documents. Top can be implemented in
maps to ((bay AND area) FILTER
O(n) time using insertion sort into a k-element array.5 lang:english), which is the set of english-language
The next task is to produce the set of matching docu- documents that contain both words. A minus sign pre-
ments and their scores. Applying the top-down principle, ceding a word normally means negation, so that:
we design a small query language for this application,
rather than using SQL. Here is the BNF for one possible bay area -hudson
query language: maps to ((bay AND area) FILTER NOT hudson).
More complex queries usually come from an
expr: expr AN
AND expr
| expr OR expr “Advanced Search” page with a form-based UI, or from
| expr FI
FILTER prop a test interface that is amenable to scripting. We will use
| word the parenthesized representation directly for these que-
ries.

4 Query Implementation
5: Insertion sort is normally O(n lg n), but since we only keep a con-
Given the scoring functions and the logical opera-
stant number of results, k, we have a constant amount of work for each of
the n insertions.
tors, we next look at query optimization and the map-
Combining Systems and Databases: A Search Engine Retrospective 715

ping of logical operators onto the physical operators. We FILTER(e1, e2, ... ek)(p1 ... pn) oexpr
start by defining the physical operators and then show Multiway inner join with scoring for the
how to map the logical operators and some possible expressions and no scoring for the properties.
optimizations. We finish with parallelization of the plan
for execution on a cluster. Most queries map onto a one-deep plan using FIL-
TER. It is essentially an AND of all of its inputs, with only
4.1 Access Methods and Physical Operators the expressions used to compute the score. It imple-
There is really only one kind of access method: ments (e AND e) if there are no properties. Although it
sequential scan of a sorted inverted index, which is just a could also subsume (p AND p), it is better to use ANDp,
sorted list of all of the documents that contain a given since the latter returns a property list rather than an
term. For properties, this is just the sorted list of docu- expression list which avoids space for unused scores.
ments; for expressions we add the score for each docu- Figure 3 shows some example queries with their logical
ment. A useful invariant is to make expressions a and physical plans.
subclass of properties, so that an expression list can be One nice property of using multiway joins is that it
used for any argument expecting a property list. For mitigates the need for estimating selectivity. Selectivity
example, this means we do not need a separate negation estimation is normally needed to compute the size of an
operator for expressions and properties. We cover the input to another operator; increasing the fan in of an
physical layout of the tables in Section 4.3, when we (inner) join limits the work to the actual size of the
discuss implementation on a cluster. smallest input and thus decreases the need for estimates.
An unusual aspect of the physical plan is that we For example, for FILTER and ANDp, the output is lim-
cache all of the intermediate values (for use by other ited by the size of the smallest input (lowest selectivity).
queries), and do not pipeline the plan. Caching works Thus selectivity only matters when we cannot flatten a
particularly well, since there are no updates in normal subgraph to use a multiway join.
operation (updates are covered in Section 5). Given that 4.2 Query Optimizer
we keep all intermediate results, there is no space sav-
ings for pipelining. Pipelining could still be used to The optimizer has three primary tasks: map the logi-
reduce query latency, but we care more about through- cal query (including negations), exploit cached results,
put than latency, and throughput is higher without pipe- and minimize the number of joins by using large multi-
lining, due to lower per-tuple overhead and better way joins. As expected these optimizations often con-
(memory) cache behavior. Thus, we increase the latency flict, leading us to either heuristics or simple models (as
of a single query that is not cached, but reduce the aver- done in traditional optimizers). The basic heuristic is to
focus first on caching, second on flattening (using larger
age latency (with caching) and increase throughput. 6 multiway joins), and third on everything else.
Because the lists are sorted, binary operators become The focus on caching all subexpressions leads to the
merging operations: every join is a simple (presorted) atypical decision of using a top-down optimizer [Gra95],
merge join. In fact, there is no reason to do binary oper- rather than the bottom-up style that is standard for tradi-
ators: every join is a multiway merge join. The use of tional databases [Sel+79]. Although either could be made
multiway joins is a win because it reduces the depth of to work, the top-down approach makes it easy to find
the plan and thus the number of caching and scan steps the highest cached subexpression: we simply check the
(remember that intermediate results are not pipelined). cache as we expand downward. The bottom-up
In addition, it is useful to move negation into the multi-
way join as well, since the antijoin is a simple variation
of a merge join. A consequence of this is that for every
input to the multiway join, we add a boolean argument “bay area lang:english”
to indicate the positive or negative version of the input. Ø
This leads us to have only four physical operators: (bay AND area) FILTER lang:english
OR(e1, e2, ... ek)oexpr Ø
FILTER(bay, area)(lang:english)
Compute the full outer multiway join, with scoring.
We have left out the boolean flags; we will use
“¬e” as the input when we mean the negation. “san francisco contains:image
ORp(p1, p2, ... pk)oprop -contains:flash”
Multiway full outer join without scoring. Ø
((san AND francisco) FILTER contains:image)
ANDp(p1, p2, ... pk)oprop FILTER (NOT contains:flash)
Multiway inner join without scoring Ø
FILTER(san, francisco)(contains:image,
¬contains:flash)
6: In theory, we could choose to not cache some intermediate
results that we believe unlikely to be used again, and pipeline the Figure 3: Two example queries and their logical and
results, but this is not worth the extra complexity. physical plans
716 Chapter 9: Web Services and Data Bases

approach implies building many partial solutions as part of vidual terms. Thus, if FILTER(a,b,e) and FIL-
dynamic programming that are unnecessary if they are part TER(d,f) were in cache (but larger subsets were not),
of a cached subexpression: that is, you build up best sub- we might map:
trees before you realize that they are cached. Both FILTER(a, b, c, d, e, f) Ö
approaches typically require building partial solutions and
FILTER(FILTER(a,b,e), c, FILTER(d,f))
exploring parts of the space that are not used for the final
plan. One useful aspect of the cache is that we can keep
The basic mapping of logical operators is straightfor- the size of the cached set as part of its metadata, which
ward, with the only subtlety being how to map negations: allows us to exploit selectivity. In particular, given two
overlapping sets in cache that each represent k terms, we
a AND ¬b Ö FILTER(a)(¬b) choose the one with the smaller set size, since it is prob-
¬a AND ¬b ÖANDp(¬a, ¬b) ably more selective. In the above example, if FIL-
TER(a,c,d) was also in cache, we would select
a OR ¬b Ö OR(a, ¬b) between it and FILTER(a,b,e) based either on selec-
¬a OR ¬b ÖORp(¬a, ¬b) tivity, or the degree of caching of the remaining terms,
or both. It is in the exploration of the remaining terms
Note the use of ANDp and ORp when the output is a prop- that the top-down approach may explore parts of the
erty and not an expression. plan space that are not used.
Given a correct tree of physical operators, the next step As with traditional databases there are many other
is to optimize it, which consists mostly of flattening the possible optimizations with increasing complexity,
tree to use fewer but wider joins. The first step is to flatten which we ignore here. To give the flavor of these, con-
all chains of pure AND or pure OR, making liberal use of sider in the example above if c was not in cache, but
the commutative and associative properties. For example: FILTER(c)(e) was, then the latter could be used
(((a AND b) AND c) OR d) instead, since e is part of the larger conjunctive join, and
Ö OR(AND(a, b, c), d) ANDing it twice won’t affect the final set. However,
FILTER(c,e)() is not an acceptable replacement,
FILTER(a, FILTER(b)(c))(d)
since the score for e would be counted twice.7
ÖFILTER(a, b)(c, d)
4.3 Implementation on a Cluster
We can do more complicated forms of flattening by
using DeMorgan’s Law, which allows us to convert Once we have the optimized tree, we must map the
between and AND and OR operations. The basic conver- query onto the cluster. The approach we take is to
sion is: exploit symmetry, which simplifies design and adminis-
tration of the cluster. In particular, the bulk of every
a AND b Ö ¬(¬a OR ¬b) query goes to every node and executes the same code on
a OR b Ö ¬(¬a AND ¬b) different data, as in the SPMD model from parallel com-
puting [DGNP88].
An example use for flattening: From the database perspective, this means a mixture
a AND b AND NOT (c OR d) Ö of replication for small tables and horizontal fragmenta-
AND(a, b, AND(¬c, ¬d)) Ö tion (also known as “range partitioning”) for large
FILTER(a, b)(¬c, ¬d)
tables. In particular, the document, word and property
tables are all horizontally fragmented by DocID, so that
However, we have to be careful to keep score informa- a self-contained set of documents (with a contiguous
tion when applying DeMorgan’s Law. For example, is (a ranges of DocIDs) resides on each node. This structure
OR ¬b) an expression or a property? If it is an expression, simplifies updates to documents, and also makes it easy
not all of the elements of the set have scores and we must to mix nodes of different power, since we can give more
make some up (typically zero). If it is a property, then we powerful nodes more documents. The term table, which
forfeit the scoring information (from a) for later expres- maps term strings to WordIDs, is replicated on each
sions. Either policy can be made to work, although more node, and we use global values for the WordIDs, so that
flattening is possible when treating this case as a property, WordIDs can be used in physical queries instead of the
since otherwise (a OR ¬b) z ¬ANDp(¬a,b). strings.
The current heuristic is to flatten completely before An important point is that the DocID is essentially
looking for cached subexpressions, which is part of the random relative to the URL (such as a 128-bit MD5 or
general philosophy of using canonical representations for CRC), which means that documents are randomly
trees to ensure that only one form of a subexpression could spread across the cluster, which is important for load
appear in the cache. As a consequence, for a large multi- balancing and caching.
way join, we must look for subsets of the terms in the
cache. We first look for the whole k-way join, then for 7: Under overload conditions, this might be an acceptable replace-
each k-1 subset, then each k-2 subset, until we get to indi- ment, since it will be a small reranking of the same set of documents,
and the scoring function is always magic to some degree anyway.
Combining Systems and Databases: A Search Engine Retrospective 717

The word and property tables are (pre)sorted by it would require to store the whole DocID. Similarly, it
WordID for that node’s set of DocIDs, so that they are is important to make good use of all of the scoring bits,
ready for sorted merge joins. We maintain a hash index which can be done by a transformation of the scoring
on WordID for these tables to locate the beginning of the function.
inverted index for each term. It is somewhat easier to It turns out that the compression not only increases
think of the word and property tables as sets of “sub- the effective disk bandwidth, but also the cache size. By
tables”, one for each WordID. This is because each sub- keeping the in-memory representation compressed, we
table is independently compressed and cached, as increase the cache hit rate at the expense of having to
described below; the hash index on WordID is thus really decompress the table on every use. This turns out to be
in index of the sub-tables, and keeps track of whether or an excellent tradeoff, since modern processors can eas-
not they are in memory. ily do the decompression on the fly without limiting the
Initially, using a load balancer, a query is routed to off-chip memory bandwidth, and the cost of a cache
exactly one node, called the master for that query. Most miss is millions of cycles (since it goes to disk).
nodes will be the master for some queries and a follower A related optimization is preloading the cache on
for the others. The master node computes and optimizes startup. This turns out to be pretty simple to do and
the query plan, issues the query to all other nodes (the greatly reduces the mean-time-to-repair for a node that
followers), and collates the results. Each follower com- goes down (for whatever reason). In the case of a grace-
putes its top k results, and the master then computes the ful shutdown of the process, the node can write out its
top k overall. Finally, the last equijoin with the docu- cache contents, and even reuse the memory via the file
ment table, D, is done via a distributed hash join with cache when the new process starts. For an unexpected
one lookup for each of the k results (which may also be shut down, the process can use an older snapshot of the
cached locally). This is really a “fetch matches” join in cache, but will have to page it in, which is still faster
the style of Mackert and Lohman [ML86], which means that recomputing it (which requires reading all of the
that you simply fetch the matching tuples on demand constituent tables). The primary limitation is that the
rather than doing any kind of movement or repartition- snapshot must match the current version of the database;
ing of the table. both are marked with version numbers for this reason.
Using the master to compute the query plan has Finally, the use of a master node enables a powerful
some subtle issues. The primary advantage is that the kind of optimization based on the classic A* search
plan is computed only once, and all followers execute algorithm (from AI) [RN02], which employs a conserva-
only physical queries. However, the cache contents are tive heuristic to prune the search space. In particular,
not guaranteed to be the same, since different nodes may instead of simply sending out the query, the master exe-
have different amounts of cache space. In practice, the cutes the query locally first (which adds some latency),
cache size is always proportional to the size of the frag- and computes its top k local results, which it will have
ment, so the contents usually agree, but not always. For compute at some point anyway. The score of the kth local
example, a cache entry would be larger than usual if that result is a conservative lower bound on the scores of the
node has more occurrences of a particular word, which overall top k results. In particular, followers need not
may force something else out; however, since docu- pursue any subquery that cannot beat this lower bound.
ments are spread randomly this effect tends to even out. For example, for a term in a multiway join, there is typi-
Nonetheless, a follower may have to recompute some- cally some score below which you need not perform the
thing that the master expected to be in cache. join, since even with the best values for the other terms
the end score will not make the top k. Similarly, by
4.4 Other Optimizations
keeping track of the best score for the whole table, we
In addition to traditional database optimizations, may be able to eliminate whole terms.
search engines exploit some unusual tricks that merit
discussion. We cover three of them here.
One of the most important optimizations is compres- 5 Updates
sion of inverted indices. Although compression has been Although search engines are clearly read mostly, at
covered in the literature [CGK01], it is not widely used in some point we actually need to update the data. One
any major DBMS. It makes more sense for a search huge benefit of the top-down strategy is that we can
engine for a few reasons. First, there is no random exploit our complete control over the timing and scope
access for these sub-tables, they are always scanned in of updates.
their entirety as part of a sorted merge join. (The excep- We follow a few basic principles for updates. First,
tion is the document table, which is random access and nodes are independent, so we can update one node with-
is not compressed.) Second, there are no updates to the out concern for the impact on other nodes. Replicas are
tables, only whole replacement, so there is no issue of clearly an exception to this, since they must be updated
how to update a compressed table. The simplest good together, but their group is independent of other groups.
compression scheme is to use relative numbering for Second, we only update whole tables and not individual
DocIDs, since they are in sorted order and the density rows. This means that we never insert, update or delete a
may be high. This requires many fewer bits than the 32+ row, and that we need at most one lock for the whole
718 Chapter 9: Web Services and Data Bases

table. Third, updates should be atomic with respect to que- Each chunk has a version number that is unique and
ries; that is, updates always occur between queries. monotonically increasing, typically a sequence number.
To simplify updates, we define a chunk to be the unit The version number is used for cache invalidation, con-
for atomic updates. Earlier we mentioned that the tables tent debugging, and data rollback.
are partitioned by DocID among the nodes, but it is more
accurate to say that the databases are partitioned into 5.2 Atomic Updates
chunks, and that a node contains a contiguous range of Once we have a new chunk, we need to install it
chunks. Each chunk is a self-contained collection of docu- atomically. Conceptually, this is done by updating a ver-
ments with their word and property tables. sion vector [Cha+81], with one element for each chunk.
In practice, it is useful to split the cluster into multiple In the absence of caching this is trivial: it is suffi-
databases, called partitions. This allows each partition to cient to close and reopen the corresponding files. It is
have its own policies for replication and freshness. A slightly better to open the new files first, which allows
query still goes to all nodes (of all partitions), and the the existing queries to finish on the old version, while
DocIDs and WordIDs are still globally unique. For repli- new queries go to the new version. When the last pre-
cated partitions, which normally have two replicas, each update query completes, the old version files can be
node has the same chunks as it replica(s), and only one closed (and later deleted).
member of the replica group receives any given query in With caching, we must also invalidate the cache
the normal case. entries for the old version. The simplest implementation
The next two sections looks at the creation and instal- of caching uses a separate cache for each chunk, in
lation of chunks, and the following two look at more com- which case we can just invalidate the whole cache for
plex types of updates. that chunk. This works pretty well; other chunks keep
their caches intact and the overall performance impact is
5.1 Crawling and Indexing thus limited. Alternatively, caches can be unified for all
The first step for an update is to get the new content, of the chunks on a node, which improves performance,
which is usually done by crawling: visiting every docu- but chunk replacement invalidates the whole cache.
ment to verify that we have the current version, and The replacement of a specific chunk does not require
retrieving a new version if we do not. Indexing is the pro- the node to be stopped. Rather by using UNIX signals,
cess of converting a collection of documents into a chunk, we can use a management process to install chunks
and includes parsing and scoring, and the management of remotely. We also use signals to initiate a rollback to the
metadata, such as tracking incoming and outgoing links. previous version of a chunk. With some automation, the
It is easiest to think of a chunk as a range of DocID management process can update all of the chunks in a
values, which means that a chunk does not have a specific smooth rolling upgrade, and likewise update all of the
size per se, but rather an average size. This definition sim- nodes. Updating chunks incrementally limits the impact
plifies the addition and removal of documents from a of rebuilding the cache, since most of the cache remains
chunk, since there is no effect on neighboring chunks. As a intact; this makes it feasible to keep up with the ongoing
database grows, the average chunk size will grow until it load during an update.
reaches some threshold at which point it may be split into
two or more chunks. 5.3 Real-time Deletion and Updates
The simplest kind of crawl simply refreshes all of the So far, we have said that we do not do updates to
documents in one chunk, and then reindexes them. Some individual records. This is not strictly true, but is the
documents may have to be recrawled multiple times if right overall view, since the mechanism described here
their site is down, or they can be left out for this version is relatively heavyweight. There are some occasions
and recrawled next time, although eventually they are per- where it is useful to update a specific document immedi-
manently removed. ately. For example, a document known to be illegal may
The refresh rate is a property of the partition, and thus need to be removed immediately upon discovery. For
a property of all of its chunks. News partitions may be this purpose, we add a mechanism for real-time dele-
updated every fifteen minutes, while slow-changing con- tion, which also enables real-time updates.
tent, such as home pages, may be refreshed every two The general approach to deletion is to add a row that
weeks (or longer). means “item deleted” that we can then use as the right-
Document discovery, which is the process of finding hand side of an antijoin to cull the document from a set.
new documents for the database, is primarily a separate For real-time deletion, we add a very small table (usu-
process, although outgoing links are the main source of ally empty) to every chunk, which contains the list of
new documents. A separate database tracks metadata deleted documents. It is a property table, where the
about all of the sites, including new links and global prop- property it represents is “has been deleted”, and we
erties about spam sites, mirrors, paid content, etc. New apply it as a filter to every query. Since we add this filter
documents can be added to existing chunks when they are before optimization, it will be optimized as well. In the
next refreshed, or may be added to a new chunk in a sepa- normal case the top-most operator is already a FILTER,
rate partition, called the “new” partition. and the optimizer can just add the inverse of this table as
an extra property. Thus to delete a document in real
Combining Systems and Databases: A Search Engine Retrospective 719

time, we simply add a row to this special table, and then 6.1 Disk Faults
atomically update the whole table (as with regular The most common fault is a disk failure, either of a
updates). block or a whole disk. A block fault only affects one
Given this mechanism, we can also do real-time chunk, but a disk failure might affect more than one. In
updates. An update involves inserting the new version both cases, new copies of the chunks can be loaded onto
into a different chunk (usually in the “new” partition), other blocks or disks in the background, and then atomi-
and deleting the old version. Just doing the insert is not cally switched in. Note that chunks are never updated in
sufficient, since the master will see both versions, and place even in normal operation, so the replacement
may return both or the even just the old one (if it thinks chunk is really just an atomic update to the same ver-
they are duplicates). sion. Nodes are limited by disk seeks, not space, so there
5.4 System-Wide Updates is always plenty of free space for staging. In fact, given
that space is cheap and staging areas are useful, it is
Occasionally, we perform updates that affect all of worthwhile to cluster the active chunks onto contiguous
the nodes. The most common example is a change to the tracks, which reduces the seek time during normal oper-
scoring algorithm, which makes the old scores incompa- ation; other parts of the disk are used for staging.
rable with the new scores. Similarly, we may change the Failed disks are left in active nodes until some con-
schema or the global ID mechanism. In such cases, we venient time, typically the scheduled maintenance win-
need to ensure that masters only use compatible follow- dow for that node. We replace whole nodes only, and
ers. then sort out the failed disks offline. This simplifies the
The approaches to this are covered better elsewhere repair process, as we always have spare nodes ready to
[Bre01], but the easiest solution is to update all of the
swap in, which are then loaded with the proper chunks
nodes at once. By staging the updated versions ahead of and put back online. Originally, we used RAID to hide
time (i.e. loading them onto the disks in the background disk faults, as most DBMSs do, but found this to be
before the update), and using some automation, it is pos- expensive and unnecessary, and those disks still needed
sible to update all of the nodes at once with less than a some process for replacement.
minute of downtime. The cold caches will perform For replicated chunks, if this node is the secondary,
poorly until they warm up, but since this kind of update nothing special happens during recovery. If it is the pri-
is only done when the load is low, this is not a problem mary, than the other replica becomes the primary and
in practice. handles the queries until the local copy is restored. For
caching purposes, it is best to have only one replica han-
6 Fault Tolerance dle queries in the normal case (the primary), with the
The primary goal of fault tolerance for search other replica idle. For load balancing, each member of a
engines is high availability. We use a variety of tech- replica group will be the primary for some chunks and
niques and optimizations to achieve this, few of which the secondary for others. The are lots of ways to deter-
are novel, but together form a consistent strategy for mine which node should be the primary by default, but
availability. any simple (uniform) function of the chunk ID suffices.
The first task is to decide exactly what needs to be 6.2 Follower Faults
highly available, since there is always a significant cost
For node failures, we separate the case of followers
to provide it. First, the snapshot approach means that all
of the indexing and crawling process is independent to from that of masters. A failed follower takes down all of
the server and thus need not be highly available. The its chunks. A master will detect this failure, if it doesn’t
already know, via a timeout. It will then either continue
only fault tolerance requirement for these elements is
idempotency, to ensure that we can simply restart failed without the data in the unreplicated case, or contact the
processes. secondary in the replicated case. An important optimiza-
tion is to spread the secondary copies across the parti-
In addition, most documents are not worth replicat-
ing for high availability. In fact, most documents will tion, so that we spread out the redirected load that
never appear in a search result at all, but alas we cannot occurs during a fault [Bre01]. This can be done by
“chained declustering” [HD90], but there are many suit-
reliably predict which these are (or we would keep zero
copies). Thus some partitions are replicated and some able placements. For example, a typical partition might
are not, and faults in non-replicated chunks or nodes have ten nodes, 2-way replication, and nine primary
chunks per node. Ideally, the nine secondaries that
simply reduce the database size temporarily. However,
the use of pseudo-random DocIDs means that we lose a match the nine primaries for a given node, should be on
nine different nodes, so that after a failure we have
random subset of the documents in a partition, rather
evenly spread out the load for the secondaries. Thus a
than, say, all the documents from one site. A typical pol-
replicated partition should have more nodes than the
icy might replicate popular sites and paid content.
degree of replication, and a enough chunks per node to
enable fine-grain load balancing after a failure.
720 Chapter 9: Web Services and Data Bases

Failed nodes are typically replaced later the same day, should be very rare. So far we have not had any disas-
but they can be replaced at any time. The risk is that the ters, although we have moved data centers on multiple
secondary might fail before then. occasions, while keeping the system up, and thus know
that our approach works.
6.3 Master Faults The basic strategy is to combine master redirection
Since masters are interchangeable, the basic strategy is and graceful degradation. When a data center fails or
to reissue the query on a different master. Originally, the becomes unreachable, the client-side library in the web
master was also the web server, which meant that its fail- server will detect that the master has failed and will
ure was externally visible. A layer-7 switch [Fou01] can retry another master, probably in the same data center.
hide failed nodes for new queries, but it typically cannot At some point it will give up on that data center and try
reissue the outstanding queries at the time of the failure. an alternate. The number of data centers varies, but the
For that, we depended on the end-user to hit reload, which range is 2-10. Important partitions must be replicated at
they are remarkably happy to do. multiple data centers, in addition to local replication.
The current approach separates the web server from Although redirection is sufficient for a single query,
the master, and the web server detects the failure and reis- it would not work in aggregate without automatic grace-
sues failed queries to a new master (much like the relation- ful degradation. If we simply redirect all queries from
ship between masters and followers). This “smart client” one data center to another, the new target will likely be
approach [C+97] is strictly better for two reasons. First, the overloaded. (At low load times, it would probably be
retry is transparent to the end user, much like a transac- fine.) Thus, we depend on graceful degradation to
tional queue [BHM90]. Second, it allows us to reissue the increase the capacity of the new data center to handle
query to a different data center, which facilitates global the load of both centers. Unlike a traditional load spike,
load balancing and disaster recovery (covered below). The which is relatively short lived, this state may persist for
web servers are often owned by partners and are thus a while. Although it is possible to add some real capac-
located in other data centers anyway. They use a client- ity on short notice, full capacity may require major
side library within the web server to execute search que- repairs or even the provisioning and setup of new space.
ries, and the recovery and redirection code is part of this
library.
7 Other Topics
6.4 Graceful Degradation In this section we briefly visit a range a search
An important challenge for Internet servers that is not engine challenges that differ from traditional database
typically present for DBMSs is that of overload. There are systems.
many documented cases of huge load spikes due to
human-scale events such as earthquakes or marketing suc- 7.1 Personalization
cesses [Mov99,WS00]. These spikes are too large for over- Although personalization has become an important
provisioning, which means we must assume that we will part of the web experience, e.g. “My Yahoo!”, there is
be overloaded and must degrade the quality of service no equivalent in other media and thus search engines
gracefully. Overload detection is based on queue lengths: were the first systems to run into the problems of large-
when queues become too long, the system enters overload scale personalization. The first such site was the HotBot
mode until they drop below some low-water mark. search engine, which (originally) allowed users to cus-
The details are beyond the scope of this paper, but tomize the search interface.
there are two basic strategies that we use for graceful deg- There are two general approaches: cookies and data-
radation (see [Bre01]). The first and simplest is to make the bases. In the cookie approach, user data is stored in a
database smaller dynamically, which we can do by leaving “cookie” and parsed as part of each visit, while the data-
out some chunks. This both reduces processing time per base approach stores only the user ID in the cookie,
query and increases the effective cache size for the which it then uses to retrieve the appropriate row from a
remaining data. Each chunk we take out increases our table. Although the cookie approach appears simpler, it
effective capacity by some amount, and we can continue suffers from two serious problems: the data is distrib-
this process until we are no longer saturated. uted and generally unreachable, which hinders analysis,
Second, we can decline to execute some queries based and it is difficult to evolve the schema.
on their cost, which is a form of admission control. The Essentially the cookie approach requires that all cur-
naive policy simply denies expensive queries, such as rent and previous schema overlap in time, since there is
those with many search terms. A more sophisticated ver- no way to update the schema for a user until they next
sion denies queries probabilistically, so that repeated que- visit. For example, if the schema has gone through six
ries will eventually get through, even if they are versions, the current system must be able to handle
expensive. cookies that use all six schemas, since which version a
user follows depends only on the time of their last visit
6.5 Disaster Recovery (from a given browser), which can be any time in the
Disaster recovery is the process of recovering a whole past. Given the large population of users, every schema
data center, which might take considerable time, but will have some number of representatives. This can be
Combining Systems and Databases: A Search Engine Retrospective 721

addressed with version numbers (stored in the cookie), “chip” probably refers to semiconductors rather than
but remains awkward. corn chips or the TV show Chips. Rewriting the query to
Although we used a DBMS to manage user data, it is include a few terms about the context (with low weight)
actually a mediocre approach, primarily due to cost, is one easy way to disambiguate an otherwise ambigu-
complexity and availability. Indeed, there has been sub- ous query.
stantial work on how to solve this problem more
directly, including some support in Enterprise Java 7.4 Phrase Queries
Beans (backed by a database), the use a highly available So far, we have only covered the simplest kinds of
cluster hash table [GBH+00,Gri00], and a new framework queries, those based on words and properties. However,
specifically for session-state management [LKF04]. Like the relative positions of words within a document are of
the search engine itself, this component requires only a great value for improved ranking. For example, search-
single query plan, in this case just a highly available ing for “New York” really should give much higher
hash table lookup (no joins, ranges, or projections). scores to documents in which the two words are adja-
cent and in the correct order. There are two general
7.2 Logging approaches to this problem: tracking proximity and
Search engines, like other large-scale Internet sites, tracking exact word positions.
create enormous logs, often over 100GB per day. These Proximity techniques boost the scores of documents
logs are used primarily for billing advertisers, but also that have the words “near” each other, but not necessar-
for improving the quality of the search engine, and ily adjacent. This is a long-standing technique in infor-
debugging. (These are not the kind of logs used for mation retrieval [Sal89], and there are many approaches.
durability in a DBMS.) Log management systems have One typical one is to break a document into “pages” of
become their own class of data-intensive systems, and some size and use one bit per page to track which pages
they also do not fit well on top of existing databases. contain a given word. “Nearness” is then defined by
Although this material is covered much better by Adam how many pages contain both words (which is just a bit-
Sah [Sah02], who worked on the original Inktomi log wise AND). This requires building the bitmaps for every
manager, it is worth some discussion here. word/document pair, and then matching bitmaps once
The two primary issues are 1) DBMSs traditionally you know that document contains multiple words from
do not handle large-scale real-time loading of data, and the query.
2) the query language really needs to support regular The second approach is “phrase searching” in which
expressions, relative timestamps, and partial string the engine actually tracks every position of every word
matches, none of which fit well within SQL. In addition, in every document. Remarkably, current search engines
log records have a different and far simpler update actually do this! Phrase queries are significantly more
model: logs are append only and log records are (gener- complex, as you need to do what amounts to a nested
ally) immutable. The concurrency control and fault tol- merge join for every word in the query. For example,
erance decisions are thus quite different from a DBMS. given the sorted lists of positions for the words “New”
However, database principles and the top-down and “York”, you join them using an “off by one”
approach still apply, and in fact are the right approach. equijoin: output a tuple exactly if the position of “New”
The log system has its own query language and its own is one less than the position of “York”. The multiway
optimizations, including compression, caching (of join for phrases is analogous. Overall, the best ranking
reverse DNS lookups), and parallelization. occurs by mixing the results of regular scoring, proxim-
ity boosts, and phrases.
7.3 Query Rewriting
As in DBMSs, query rewriting is a powerful and
useful tool [PHH92,SJGP90]. In our case, there are two pri-
8 Discussion and Conclusions
mary values. First and most important, it provides the Up to now, the focus has been covering the design of
easiest way to customize a query for a given user or a search engine from the perspective of a database sys-
population. For example, for users known to speak a tem. In this section, we argue that is the right approach
certain language (based on their ISP for example), a for other top-down data-intensive systems, and that such
rewritten query might increase the ranking of docu- systems should employ the principles of databases if not
ments in that language or even filter the results for only the artifacts. We cover a few other example systems,
that language. Similarly, personalization can be used to each of which is a poor fit for existing databases, and yet
customize queries for a given user based on collections a good fit for the principles.
(e.g. more emphasis on news), topic, complexity, geo- First, it is worth summarizing why Informix did 10x
graphical location, etc. worse than the hand-built search engine in 1996. Infor-
Second, query rewriting is a clean way to encode the mix was among the best choices for a search engine at
context of the query. An important direction for search the time, and we in fact used it for other parts of the sys-
engines is to provide different results based on the con- tem, particularly personalization. It had cluster support
text of the query. For example, a query issued from a and seemed to do a reasonable job with caching; it was
page about semiconductors that contains the word also viewed as the best “toolbox” database, which is
722 Chapter 9: Web Services and Data Bases

what we needed. The basic issue was over generalization, is to “make” them fit, however awkward that may be. A
which presumably might limit modern DBMSs as well. clean top-down design, as in the case of logging above,
Here is a partial list of the optimizations that account for would lead to different implementation that is simpler,
the 10x difference: no locking, a single hand-optimized cleaner, and presumably more reliable and a better fit.
query plan, multiway joins, extensive compression, In the end, the hope is that projects on the “systems”
aggressive caching, careful data representation, hand-writ- side will benefit from top-down thinking, well-defined
ten access methods, single address space, and no security semantics, and declarative languages that leave room for
or access control (handled by the firewall). The represen- optimization. Conversely, the hope on the “database”
tation of indexed text in mid-90’s databases was typically side would be for more modular and layered designs
3x larger than the raw text; Inktomi and Alta Vista drove that are more flexible than current (monolithic) designs,
this number to well below one, which accounts a signifi- and thus more useful for new kinds of systems. It is not
cant fraction of the overall performance gain, since this clear that such layering is possible, but there is some
directly affects the number and size of I/O operations, and evidence in the form of Berkeley DB and some of the
the hit rate of the cache. Finally, even if modern databases novel uses of Postgres, such as the logging system.
solved all of these problems, which they do not, the
designers of the next big data-intensive system will surely
find some mismatches, and will also have to apply the Acknowledgments: We would like to thank Joe Heller-
principles rather than the artifacts. stein, Adam Sah, Mike Stonebraker, Remzi Arpaci-Dus-
For the first example of such a system, we return to the seau, and many great Inktomi employees including
logging system, discussed in Section 7.2. The best solution Brian Totty, Paul Gauthier, Kevin Brown, Doug Cook,
[Sah02] is a top-down design with data independence and Eric Baldeschweiler, and Ken Lutz.
a declarative language. Although based on Postgres, it is a
large deviation from a traditional DBMS, as it includes [Apa01] The Apache Web Server. http://www.apache.org.
Perl in the query language for string handling, strong sup- [BEA01] The BEA WebLogic Server Datasheet. http://
port for loading data in real time, and changes for high www.bea.com
availability. Predecessors, in fact, were not based on Post- [BHM90] P.A. Bernstein, M. Hsu and B. Mann. “Implementing
gres at all and used the file system for storage. Recoverable Requests Using Queues.” Proc.of ACM
Another search-related example is the Google File SIGMOD. Atlantic City, NJ. 1990.
System [GGL03], which is a distributed file system opti- [Bre01] E. Brewer. “Lessons from Giant-Scale Services.”
mized for large files, constrained sharing, and atomic IEEE Internet Computing 5(4): 46-55, April 2001. http://
append operations. It is a top-down design driven by the www.cs.berkeley.edu/~brewer/papers/GiantScale.pdf
need to handle more than one billion documents and mil- [BT+03] J. Bent, D. Thain, A. Arpaci-Dusseau, R. Arpaci-
lions of files; in particular, in handles all of the files used Dusseau, and M. Livny. “Explicit Control in the Batch-
by the crawling and indexing systems. It has a relatively Aware Distributed File System.” Proc. of SOSP 2003.
clean semantics for its important operations (concurrent October 2003.
append in particular), and support for high availability and [C+97] C. Yoshikawa et al., “Using Smart Clients to Build
replication. Although “navigational” rather than query Scalable Services,” Proc of the.Usenix Annual Technical
based, it fits the top-down model proscribed here. Conference. Berkeley, CA, Jan. 1997.
A more remote example is the Batch-Aware Distrib- [CG99] S. Chaudhuri and L. Gravano. “Evaluating Top-k
uted File System (BAD-FS) [BT+03]. This is a file system Selection Queries.” Proc. VLDB Conference, 1999.
for large wide-area I/O intensive workloads such as clus- http://citeseer.nj.nec.com/chaudhuri99evaluating.html
ter-based scientific applications. It is a top-down design [CGK01] Z. Chen, J. Gehrke, and F. Korn. “Query
with a simple declarative query language, which allows optimization in compressed database systems.” Proc.
the scheduler to optimize communication, caching and ACM SIGMOD 2001. http://citeseer.nj.nec.com/
replication by controlling both the placement and schedul- chen01query.html
ing of jobs. Although not described this way, it has the [Cha+81] D. Chamberlin et al. “A history and evaluation of
usual phases: a parser, query planner and optimizer, and an System R.” Communications of the ACM, 24(10), pp.
execution engine. It also provides a variation of views. As 632–646, October 1981.
with SQL, the declarative nature is critical for enabling [CK98] M. J. Carey and D. Kossmann. “Reducing the braking
optimizations. This project exhibits the proposed method- distance of an SQL query engine.” In Proceedings of the
ology in part because it has members from both the data- 24th VLDB Conference, pp. 158–169, New York, NY,
base and systems communities. August 1998. http://citeseer.nj.nec.com/
Although harder to show, many other systems fit this carey98reducing.html
model of applying the principles without the artifacts. [DGNP88] F. Darema, D. A. George, V. A. Norton, and G. F.
These include workflow systems, which have a query lan- Pfister. “A single-program-multiple-data computational
guage and data independence, XML databases, and the mode1 for epex/fortran.” Parallel Computing, 5(7),
emerging field of bioinformatics. All of these systems 1988.
have top-down designs that do not map well on SQL and
existing database semantics. The most common approach
Combining Systems and Databases: A Search Engine Retrospective 723

[FB99] A. Fox and E. A. Brewer. “Harvest, Yield, and [PAB+98] V. S. Pai, M. Aron, G. Banga, M, Svendsen, P.
Scalable Tolerant Systems.” Proc. of HotOS-VII. March Druschel, W. Zwaenepoel, and E. Nahum. “Locality-
1999. Aware Request Distribution in Cluster-based Network
[FGCB97] A. Fox, S. D. Gribble, Y. Chawathe and E. Brewer. Servers.” Proc. of ASPLOS 1998. San Jose, CA, October
“Scalable Network Services” Proc. of the 16th SOSP, St. 1998.
Malo, France, October 1997. [PDZ99] V. S. Pai, P. Druschel, and W. Zwaenepoel. “Flash:
[Fou01] Foundry Networks ServerIron Switch. http:// An efficient and portable Web server.” Proc. of the 1999
www.foundrynet.com/ Annual USENIX Technical Conference, June 1999.
[GBH+00] S. Gribble, E. Brewer, J. M. Hellerstein, and D. [PHH92] H. Pirahesh, J. M. Hellerstein, and W. Hasan:
Culler. “Scalable, Distributed Data Structures for Internet “Extensible/Rule Based Query Rewrite Optimization in
Service Construction.” Proc. of OSDI 2000, October Starburst.” Proc. of SIGMOD 1992. pp. 39-48. June
2000. 1992.
[GGL03] S. Ghemawat, H. Gobioff, and S.-T. Leung. “The [RN02] S. Russell and P. Norvig. Artificial Intelligence: A
Google File System.” Proc of the SOSP 2003. October Modern Approach. Prentice-Hall, 2002.
2003. [Sah02] A. Sah. “A New Architecture for Managing Enterprise
[GL02] S. Gilbert and N. Lynch. “Brewer's conjecture and the Log Data.” Proc. of LISA 2002. November 2002.
feasibility of consistent, available, partition-tolerant web [Sal89] G. Salton. Automatic Text Processing: The
services.” Sigact News, 33(2), June 2002. transformation, analysis, and retrieval of information by
[GR97] J. Gray and A. Reuter. Transaction Processing. computer. Addison-Wesley, 1989.
Morgan-Kaufman, 1997 [SAS+96] J. Sidell, P. M. Aoki, A. Sah, C. Staelin, M.
[Gra95] G. Graefe. “The Cascades framework for query Stonebraker, and A. Yu. “Data Replication in Mariposa.”
optimization.” Data Engineering Bulletin, 18(3):19–29, Proc. of the 12th International Conference on Data
September 1995. Engineering. February 1996.
[Gri00] S. Gribble. A Design Framework and a Scalable [SBL99] Y. Saito, B. Bershad and H. Levy. “Manageability,
Storage Platform to Simplify Internet Service Availability and Performance in Porcupine: A Highly
Construction. Ph.D. Dissertation, UC Berkeley, Scalable, Cluster-based Mail Service.” Proc. of the 17th
September 2000. SOSP. October 1999.
[GWv+01] S. Gribble, M. Welsh, R. von Behren, E. Brewer, D. [Sel+79] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R.
Culler, N. Borisov, S. Czerwinski, R. Gummadi, J. Hill, A. Lorie and G. T. Price. “Access path selection in a
A. Joseph, R.H. Katz, Z.M. Mao, S. Ross, and B. Zhao. relational database system.” Proc. of SIGMOD 1979.
“The Ninja Architecture for Robust Internet-Scale Boston, MA. pp. 22–34. June 1979.
Systems and Services.” Journal of Computer Networks, [SJGP90] M. Stonebraker, A. Jhingran, J. Goh, and S.
March 2001. Potamianos. “On rules, procedure, caching and views in
[HD90] H. I. Hsiao and D. DeWitt. “Chained Declustering: A data base systems.” Proc. of the 1990 ACM SIGMOD
New Availability Strategy for Multiprocessor Database International Conference on Management of Data. June
Machines.” Proc. of the 6th International Data 1990.
Engineering Conference. February 1990. [WCB01] M. Welsh, D. Culler and E. Brewer. “SEDA: An
[Her91] Maurice Herlihy. A Methodology for Implementing Architecture for Well-Conditioned, Scalable Internet
Highly Concurrent Data Objects. Technical Report CRL Services.” Proc. of the 18th SOSP. October, 2001.
91/10. Digital Equipment Corporation, October 1991. [WS00] L. A. Wald and S. Schwarz. “The 1999 Southern
[LAC+96] B. Liskov, A. Adya, M. Castro, S. Ghemawat, R. California Seismic Network Bulletin.” Seismological
Gruber, U. Maheshwari, A. C. Myers, M. Day and L. Research Letters, 71(4), July/August 2000.
Shrira. “Safe and efficient sharing of persistent objects in [ZBCS99] X. Zhang, M. Barrientos, J. B. Chen, and M. Seltzer.
Thor.” Proc. of ACM SIGMOD, pp. 318–329, 1996. “HACC: An Architecture for Cluster-Based Web
[LKF04] B. C. Ling, E. Kiciman, A. Fox. Session State: Servers.” Proc. of the 3rd USENIX Windows NT
Beyond Soft State Proceedings of Networked Systems Symposium, Seattle, WA, July 1999.
Design and Implementation (NSDI '04), San Francisco,
CA, March 2004.
[LML+96] R. Larson, J. McDonough, P. O’Leary, L. Kuntz and
R. Moon. Cheshire II: Designing a Next-Generation Appendix: Scoring
Online Catalog. Journal for the American Society for In this section we present a simple but representative
Information Science. 47(7), pp. 555–567. July 1996. scoring algorithm. Most of the research for current
[ML86] L. F. Mackert and G. M. Lohman. “R* optimizer search engines is on improving the scoring algorithms
validation and performance evaluation for local queries.” or adding new components to the scoring systems, such
Proceedings of SIGMOD 1986, pp. 84–95, 1986. as popularity metrics or incoming link counts.
[Mov99] MovieFone Corporation. “MovieFone Announces We define a query as a set of words and their corre-
Preliminary Results From First Day of Star Wars sponding weights ( W i ):
Advance Ticket Sales.” Company Press Release,
Business Wire, May 13, 1999.
724 Chapter 9: Web Services and Data Bases

Query Q { {[w 1,w 2 ,} ,w k], W i } (3) has the same score as (a OR b), although the AND will
usually return fewer documents.
The score of a document for query Q is the weighted sum
of an overall score for the document and a score for each
word in the query:

Score(Q d) { c 1 Quality d
(4)
+ c2 ¦ Wi Score w i  d
i

The document quality term is independent of the query


words and reflects things like length (shorter is better),
popularity, incoming links, quality of the containing site,
and external reviews.
The use of weighted sums for scoring is very common
in information retrieval [Sal89] and this one is loosely based
on Cheshire II [LML+96]. It has several advantages over
more complex formulas: it is easy to compute, it can repre-
sent multiplication by using logarithms within components
(commonly done), and the weights can be found using sta-
tistical regression (typically from human judgements on
relevance). To simplify query execution, we define:8
Score(w i d) { 0 if w i  d (5)

We don’t actually require that ¦ Wi = 1 and it useful


to modify the weights individually at query time. Since we
only care about the relative scoring within one query, there
is no particular meaning to the sum of the weights. Nor do
the words need to be unique; in fact, entering the same
word twice usually gives it twice the weight.
The word score can be further broken down:
Score(w i d) { c 3 ˜ f(w i d) + c 4 ˜ g(w i) + c 5 (6)

where f captures the relevance of the word in this docu-


ment, and g captures the properties of the word in the
overall corpus. For example, the specific version from
Cheshire II is essentially [LML+96]:

Score(Q d) { – 0.0674 length d


M
1 0.679 ˜ log Freq w i d
+ ----- e
M ¦ §© +0.223 ˜ log IDF w i ¹
·
i=1

The top term is Quality(d) and the bottom term is the


weighted sum, with even weights, of equation (6), where
f { log Freq w i d is the log of the count of w i in d, and
g { log IDF w i is the log of the inverse document fre-
quency of w i , which is one divided by the fraction of doc-
uments in which this word appears.
The scoring for AND and OR is trivial: just sum up the
scores for the matching words. For example, (a AND b)

8: Words that are in “anchor text” that point to the document are con-
sidered part of the document.
The Anatomy of a Large-Scale Hypertextual
Web Search Engine
Sergey Brin and Lawrence Page

Computer Science Department,


Stanford University, Stanford, CA 94305, USA
sergey@cs.stanford.edu and page@cs.stanford.edu

Abstract
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy
use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently
and produce much more satisfying search results than existing systems. The prototype with a full
text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/
To engineer a search engine is a challenging task. Search engines index tens to hundreds of
millions of web pages involving a comparable number of distinct terms. They answer tens of
millions of queries every day. Despite the importance of large-scale search engines on the web,
very little academic research has been done on them. Furthermore, due to rapid advance in
technology and web proliferation, creating a web search engine today is very different from three
years ago. This paper provides an in-depth description of our large-scale web search engine -- the
first such detailed public description we know of to date. Apart from the problems of scaling
traditional search techniques to data of this magnitude, there are new technical challenges involved
with using the additional information present in hypertext to produce better search results. This
paper addresses this question of how to build a practical large-scale system which can exploit the
additional information present in hypertext. Also we look at the problem of how to effectively deal
with uncontrolled hypertext collections where anyone can publish anything they want.

Keywords
World Wide Web, Search Engines, Information Retrieval, PageRank, Google

1. Introduction
(Note: There are two versions of this paper -- a longer full version and a shorter printed version. The
full version is available on the web and the conference CD-ROM.)
The web creates new challenges for information retrieval. The amount of information on the web is
growing rapidly, as well as the number of new users inexperienced in the art of web research. People are
likely to surf the web using its link graph, often starting with high quality human maintained indices
such as Yahoo! or with search engines. Human maintained lists cover popular topics effectively but are
subjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics.
Automated search engines that rely on keyword matching usually return too many low quality matches.
To make matters worse, some advertisers attempt to gain people’s attention by taking measures meant to
mislead automated search engines. We have built a large-scale search engine which addresses many of
the problems of existing systems. It makes especially heavy use of the additional structure present in
hypertext to provide much higher quality search results. We chose our system name, Google, because it
is a common spelling of googol, or 10100 and fits well with our goal of building very large-scale search
726 Chapter 9: Web Services and Data Bases

engines.

1.1 Web Search Engines -- Scaling Up: 1994 - 2000


Search engine technology has had to scale dramatically to keep up with the growth of the web. In 1994,
one of the first web search engines, the World Wide Web Worm (WWWW) [McBryan 94] had an index
of 110,000 web pages and web accessible documents. As of November, 1997, the top search engines
claim to index from 2 million (WebCrawler) to 100 million web documents (from Search Engine
Watch). It is foreseeable that by the year 2000, a comprehensive index of the Web will contain over a
billion documents. At the same time, the number of queries search engines handle has grown incredibly
too. In March and April 1994, the World Wide Web Worm received an average of about 1500 queries
per day. In November 1997, Altavista claimed it handled roughly 20 million queries per day. With the
increasing number of users on the web, and automated systems which query search engines, it is likely
that top search engines will handle hundreds of millions of queries per day by the year 2000. The goal of
our system is to address many of the problems, both in quality and scalability, introduced by scaling
search engine technology to such extraordinary numbers.

1.2. Google: Scaling with the Web


Creating a search engine which scales even to today’s web presents many challenges. Fast crawling
technology is needed to gather the web documents and keep them up to date. Storage space must be used
efficiently to store indices and, optionally, the documents themselves. The indexing system must process
hundreds of gigabytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to
thousands per second.

These tasks are becoming increasingly difficult as the Web grows. However, hardware performance and
cost have improved dramatically to partially offset the difficulty. There are, however, several notable
exceptions to this progress such as disk seek time and operating system robustness. In designing Google,
we have considered both the rate of growth of the Web and technological changes. Google is designed to
scale well to extremely large data sets. It makes efficient use of storage space to store the index. Its data
structures are optimized for fast and efficient access (see section 4.2). Further, we expect that the cost to
index and store text or HTML will eventually decline relative to the amount that will be available (see
Appendix B). This will result in favorable scaling properties for centralized systems like Google.

1.3 Design Goals


1.3.1 Improved Search Quality

Our main goal is to improve the quality of web search engines. In 1994, some people believed that a
complete search index would make it possible to find anything easily. According to Best of the Web
1994 -- Navigators, "The best navigation service should make it easy to find almost anything on the
Web (once all the data is entered)." However, the Web of 1997 is quite different. Anyone who has used
a search engine recently, can readily testify that the completeness of the index is not the only factor in
the quality of search results. "Junk results" often wash out any results that a user is interested in. In fact,
as of November 1997, only one of the top four commercial search engines finds itself (returns its own
search page in response to its name in the top ten results). One of the main causes of this problem is that
the number of documents in the indices has been increasing by many orders of magnitude, but the user’s
ability to look at documents has not. People are still only willing to look at the first few tens of results.
The Anatomy of a Large-Scale Hypertextual Web Search Engine 727

Because of this, as the collection size grows, we need tools that have very high precision (number of
relevant documents returned, say in the top tens of results). Indeed, we want our notion of "relevant" to
only include the very best documents since there may be tens of thousands of slightly relevant
documents. This very high precision is important even at the expense of recall (the total number of
relevant documents the system is able to return). There is quite a bit of recent optimism that the use of
more hypertextual information can help improve search and other applications [Marchiori 97] [Spertus
97] [Weiss 96] [Kleinberg 98]. In particular, link structure [Page 98] and link text provide a lot of
information for making relevance judgments and quality filtering. Google makes use of both link
structure and anchor text (see Sections 2.1 and 2.2).

1.3.2 Academic Search Engine Research

Aside from tremendous growth, the Web has also become increasingly commercial over time. In 1993,
1.5% of web servers were on .com domains. This number grew to over 60% in 1997. At the same time,
search engines have migrated from the academic domain to the commercial. Up until now most search
engine development has gone on at companies with little publication of technical details. This causes
search engine technology to remain largely a black art and to be advertising oriented (see Appendix A).
With Google, we have a strong goal to push more development and understanding into the academic
realm.

Another important design goal was to build systems that reasonable numbers of people can actually use.
Usage was important to us because we think some of the most interesting research will involve
leveraging the vast amount of usage data that is available from modern web systems. For example, there
are many tens of millions of searches performed every day. However, it is very difficult to get this data,
mainly because it is considered commercially valuable.

Our final design goal was to build an architecture that can support novel research activities on
large-scale web data. To support novel research uses, Google stores all of the actual documents it crawls
in compressed form. One of our main goals in designing Google was to set up an environment where
other researchers can come in quickly, process large chunks of the web, and produce interesting results
that would have been very difficult to produce otherwise. In the short time the system has been up, there
have already been several papers using databases generated by Google, and many others are underway.
Another goal we have is to set up a Spacelab-like environment where researchers or even students can
propose and do interesting experiments on our large-scale web data.

2. System Features
The Google search engine has two important features that help it produce high precision results. First, it
makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking
is called PageRank and is described in detail in [Page 98]. Second, Google utilizes link to improve
search results.

2.1 PageRank: Bringing Order to the Web


The citation (link) graph of the web is an important resource that has largely gone unused in existing
web search engines. We have created maps containing as many as 518 million of these hyperlinks, a
significant sample of the total. These maps allow rapid calculation of a web page’s "PageRank", an
728 Chapter 9: Web Services and Data Bases

objective measure of its citation importance that corresponds well with people’s subjective idea of
importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of
web keyword searches. For most popular subjects, a simple text matching search that is restricted to web
page titles performs admirably when PageRank prioritizes the results (demo available at
google.stanford.edu). For the type of full text searches in the main Google system, PageRank also helps
a great deal.

2.1.1 Description of PageRank Calculation

Academic citation literature has been applied to the web, largely by counting citations or backlinks to a
given page. This gives some approximation of a page’s importance or quality. PageRank extends this
idea by not counting links from all pages equally, and by normalizing by the number of links on a page.
PageRank is defined as follows:

We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d
is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are
more details about d in the next section. Also C(A) is defined as the number of links going
out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so the sum of all
web pages’ PageRanks will be one.

PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the
principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web
pages can be computed in a few hours on a medium size workstation. There are many other details
which are beyond the scope of this paper.

2.1.2 Intuitive Justification

PageRank can be thought of as a model of user behavior. We assume there is a "random surfer" who is
given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored
and starts on another random page. The probability that the random surfer visits a page is its PageRank.
And, the d damping factor is the probability at each page the "random surfer" will get bored and request
another random page. One important variation is to only add the damping factor d to a single page, or a
group of pages. This allows for personalization and can make it nearly impossible to deliberately
mislead the system in order to get a higher ranking. We have several other extensions to PageRank,
again see [Page 98].

Another intuitive justification is that a page can have a high PageRank if there are many pages that point
to it, or if there are some pages that point to it and have a high PageRank. Intuitively, pages that are well
cited from many places around the web are worth looking at. Also, pages that have perhaps only one
citation from something like the Yahoo! homepage are also generally worth looking at. If a page was not
high quality, or was a broken link, it is quite likely that Yahoo’s homepage would not link to it.
PageRank handles both these cases and everything in between by recursively propagating weights
through the link structure of the web.
The Anatomy of a Large-Scale Hypertextual Web Search Engine 729

2.2 Anchor Text


The text of links is treated in a special way in our search engine. Most search engines associate the text
of a link with the page that the link is on. In addition, we associate it with the page the link points to.
This has several advantages. First, anchors often provide more accurate descriptions of web pages than
the pages themselves. Second, anchors may exist for documents which cannot be indexed by a
text-based search engine, such as images, programs, and databases. This makes it possible to return web
pages which have not actually been crawled. Note that pages that have not been crawled can cause
problems, since they are never checked for validity before being returned to the user. In this case, the
search engine can even return a page that never actually existed, but had hyperlinks pointing to it.
However, it is possible to sort the results, so that this particular problem rarely happens.

This idea of propagating anchor text to the page it refers to was implemented in the World Wide Web
Worm [McBryan 94] especially because it helps search non-text information, and expands the search
coverage with fewer downloaded documents. We use anchor propagation mostly because anchor text
can help provide better quality results. Using anchor text efficiently is technically difficult because of
the large amounts of data which must be processed. In our current crawl of 24 million pages, we had
over 259 million anchors which we indexed.

2.3 Other Features


Aside from PageRank and the use of anchor text, Google has several other features. First, it has location
information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track
of some visual presentation details such as font size of words. Words in a larger or bolder font are
weighted higher than other words. Third, full raw HTML of pages is available in a repository.

3 Related Work
Search research on the web has a short and concise history. The World Wide Web Worm (WWWW)
[McBryan 94] was one of the first web search engines. It was subsequently followed by several other
academic search engines, many of which are now public companies. Compared to the growth of the
Web and the importance of search engines there are precious few documents about recent search engines
[Pinkerton 94]. According to Michael Mauldin (chief scientist, Lycos Inc) [Mauldin], "the various
services (including Lycos) closely guard the details of these databases". However, there has been a fair
amount of work on specific features of search engines. Especially well represented is work which can
get results by post-processing the results of existing commercial search engines, or produce small scale
"individualized" search engines. Finally, there has been a lot of research on information retrieval
systems, especially on well controlled collections. In the next two sections, we discuss some areas where
this research needs to be extended to work better on the web.

3.1 Information Retrieval


Work in information retrieval systems goes back many years and is well developed [Witten 94].
However, most of the research on information retrieval systems is on small well controlled
homogeneous collections such as collections of scientific papers or news stories on a related topic.
Indeed, the primary benchmark for information retrieval, the Text Retrieval Conference [TREC 96],
uses a fairly small, well controlled collection for their benchmarks. The "Very Large Corpus"
730 Chapter 9: Web Services and Data Bases

benchmark is only 20GB compared to the 147GB from our crawl of 24 million web pages. Things that
work well on TREC often do not produce good results on the web. For example, the standard vector
space model tries to return the document that most closely approximates the query, given that both query
and document are vectors defined by their word occurrence. On the web, this strategy often returns very
short documents that are the query plus a few words. For example, we have seen a major search engine
return a page containing only "Bill Clinton Sucks" and picture from a "Bill Clinton" query. Some argue
that on the web, users should specify more accurately what they want and add more words to their
query. We disagree vehemently with this position. If a user issues a query like "Bill Clinton" they should
get reasonable results since there is a enormous amount of high quality information available on this
topic. Given examples like these, we believe that the standard information retrieval work needs to be
extended to deal effectively with the web.

3.2 Differences Between the Web and Well Controlled Collections


The web is a vast collection of completely uncontrolled heterogeneous documents. Documents on the
web have extreme variation internal to the documents, and also in the external meta information that
might be available. For example, documents differ internally in their language (both human and
programming), vocabulary (email addresses, links, zip codes, phone numbers, product numbers), type or
format (text, HTML, PDF, images, sounds), and may even be machine generated (log files or output
from a database). On the other hand, we define external meta information as information that can be
inferred about a document, but is not contained within it. Examples of external meta information include
things like reputation of the source, update frequency, quality, popularity or usage, and citations. Not
only are the possible sources of external meta information varied, but the things that are being measured
vary many orders of magnitude as well. For example, compare the usage information from a major
homepage, like Yahoo’s which currently receives millions of page views every day with an obscure
historical article which might receive one view every ten years. Clearly, these two items must be treated
very differently by a search engine.

Another big difference between the web and traditional well controlled collections is that there is
virtually no control over what people can put on the web. Couple this flexibility to publish anything with
the enormous influence of search engines to route traffic and companies which deliberately
manipulating search engines for profit become a serious problem. This problem that has not been
addressed in traditional closed information retrieval systems. Also, it is interesting to note that metadata
efforts have largely failed with web search engines, because any text on the page which is not directly
represented to the user is abused to manipulate search engines. There are even numerous companies
which specialize in manipulating search engines for profit.

4 System Anatomy
First, we will provide a high level discussion of the architecture. Then, there is some in-depth
descriptions of important data structures. Finally, the major applications: crawling, indexing, and
searching will be examined in depth.
The Anatomy of a Large-Scale Hypertextual Web Search Engine 731

4.1 Google Architecture Overview


In this section, we will give a high level overview of how
the whole system works as pictured in Figure 1. Further
sections will discuss the applications and data structures
not mentioned in this section. Most of Google is
implemented in C or C++ for efficiency and can run in
either Solaris or Linux.

In Google, the web crawling (downloading of web pages)


is done by several distributed crawlers. There is a
URLserver that sends lists of URLs to be fetched to the
crawlers. The web pages that are fetched are then sent to
the storeserver. The storeserver then compresses and stores
the web pages into a repository. Every web page has an
associated ID number called a docID which is assigned
whenever a new URL is parsed out of a web page. The
indexing function is performed by the indexer and the Figure 1. High Level Google Architecture
sorter. The indexer performs a number of functions. It reads
the repository, uncompresses the documents, and parses them. Each document is converted into a set of
word occurrences called hits. The hits record the word, position in document, an approximation of font
size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially
sorted forward index. The indexer performs another important function. It parses out all the links in
every web page and stores important information about them in an anchors file. This file contains
enough information to determine where each link points from and to, and the text of the link.

The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into
docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points
to. It also generates a database of links which are pairs of docIDs. The links database is used to compute
PageRanks for all the documents.

The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and
resorts them by wordID to generate the inverted index. This is done in place so that little temporary
space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the
inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the
indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and
uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer
queries.

4.2 Major Data Structures


Google’s data structures are optimized so that a large document collection can be crawled, indexed, and
searched with little cost. Although, CPUs and bulk input output rates have improved dramatically over
the years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seeks
whenever possible, and this has had a considerable influence on the design of the data structures.
732 Chapter 9: Web Services and Data Bases

4.2.1 BigFiles

BigFiles are virtual files spanning multiple file systems and are addressable by 64 bit integers. The
allocation among multiple file systems is handled automatically. The BigFiles package also handles
allocation and deallocation of file descriptors, since the operating systems do not provide enough for our
needs. BigFiles also support rudimentary compression options.

4.2.2 Repository

The repository contains the full HTML of every web page.


Each page is compressed using zlib (see RFC1950). The
choice of compression technique is a tradeoff between speed
and compression ratio. We chose zlib’s speed over a
significant improvement in compression offered by bzip. The
compression rate of bzip was approximately 4 to 1 on the
repository as compared to zlib’s 3 to 1 compression. In the
repository, the documents are stored one after the other and Figure 2. Repository Data Structure
are prefixed by docID, length, and URL as can be seen in
Figure 2. The repository requires no other data structures to be used in order to access it. This helps with
data consistency and makes development much easier; we can rebuild all the other data structures from
only the repository and a file which lists crawler errors.

4.2.3 Document Index

The document index keeps information about each document. It is a fixed width ISAM (Index sequential
access mode) index, ordered by docID. The information stored in each entry includes the current
document status, a pointer into the repository, a document checksum, and various statistics. If the
document has been crawled, it also contains a pointer into a variable width file called docinfo which
contains its URL and title. Otherwise the pointer points into the URLlist which contains just the URL.
This design decision was driven by the desire to have a reasonably compact data structure, and the
ability to fetch a record in one disk seek during a search

Additionally, there is a file which is used to convert URLs into docIDs. It is a list of URL checksums
with their corresponding docIDs and is sorted by checksum. In order to find the docID of a particular
URL, the URL’s checksum is computed and a binary search is performed on the checksums file to find
its docID. URLs may be converted into docIDs in batch by doing a merge with this file. This is the
technique the URLresolver uses to turn URLs into docIDs. This batch mode of update is crucial because
otherwise we must perform one seek for every link which assuming one disk would take more than a
month for our 322 million link dataset.

4.2.4 Lexicon

The lexicon has several different forms. One important change from earlier systems is that the lexicon
can fit in memory for a reasonable price. In the current implementation we can keep the lexicon in
memory on a machine with 256 MB of main memory. The current lexicon contains 14 million words
(though some rare words were not added to the lexicon). It is implemented in two parts -- a list of the
words (concatenated together but separated by nulls) and a hash table of pointers. For various functions,
The Anatomy of a Large-Scale Hypertextual Web Search Engine 733

the list of words has some auxiliary information which is beyond the scope of this paper to explain fully.

4.2.5 Hit Lists

A hit list corresponds to a list of occurrences of a particular word in a particular document including
position, font, and capitalization information. Hit lists account for most of the space used in both the
forward and the inverted indices. Because of this, it is important to represent them as efficiently as
possible. We considered several alternatives for encoding position, font, and capitalization -- simple
encoding (a triple of integers), a compact encoding (a hand optimized allocation of bits), and Huffman
coding. In the end we chose a hand optimized compact encoding since it required far less space than the
simple encoding and far less bit manipulation than Huffman coding. The details of the hits are shown in
Figure 3.

Our compact encoding uses two bytes for every hit. There are two types of hits: fancy hits and plain hits.
Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. Plain hits include everything
else. A plain hit consists of a capitalization bit, font size, and 12 bits of word position in a document (all
positions higher than 4095 are labeled 4096). Font size is represented relative to the rest of the document
using three bits (only 7 values are actually used because 111 is the flag that signals a fancy hit). A fancy
hit consists of a capitalization bit, the font size set to 7 to indicate it is a fancy hit, 4 bits to encode the
type of fancy hit, and 8 bits of position. For anchor hits, the 8 bits of position are split into 4 bits for
position in anchor and 4 bits for a hash of the docID the anchor occurs in. This gives us some limited
phrase searching as long as there are not that many anchors for a particular word. We expect to update
the way that anchor hits are stored to allow for greater resolution in the position and docIDhash fields.
We use font size relative to the rest of the document because when searching, you do not want to rank
otherwise identical documents differently just because one of the documents is in a larger font.

The length of a hit list is stored before the hits themselves.


To save space, the length of the hit list is combined with the
wordID in the forward index and the docID in the inverted
index. This limits it to 8 and 5 bits respectively (there are
some tricks which allow 8 bits to be borrowed from the
wordID). If the length is longer than would fit in that many
bits, an escape code is used in those bits, and the next two
bytes contain the actual length.

4.2.6 Forward Index

The forward index is actually already partially sorted. It is


stored in a number of barrels (we used 64). Each barrel
holds a range of wordID’s. If a document contains words
that fall into a particular barrel, the docID is recorded into
the barrel, followed by a list of wordID’s with hitlists which Figure 3. Forward and Reverse Indexes
correspond to those words. This scheme requires slightly and the Lexicon
more storage because of duplicated docIDs but the
difference is very small for a reasonable number of buckets and saves considerable time and coding
complexity in the final indexing phase done by the sorter. Furthermore, instead of storing actual
wordID’s, we store each wordID as a relative difference from the minimum wordID that falls into the
734 Chapter 9: Web Services and Data Bases

barrel the wordID is in. This way, we can use just 24 bits for the wordID’s in the unsorted barrels,
leaving 8 bits for the hit list length.

4.2.7 Inverted Index

The inverted index consists of the same barrels as the forward index, except that they have been
processed by the sorter. For every valid wordID, the lexicon contains a pointer into the barrel that
wordID falls into. It points to a doclist of docID’s together with their corresponding hit lists. This doclist
represents all the occurrences of that word in all documents.

An important issue is in what order the docID’s should appear in the doclist. One simple solution is to
store them sorted by docID. This allows for quick merging of different doclists for multiple word
queries. Another option is to store them sorted by a ranking of the occurrence of the word in each
document. This makes answering one word queries trivial and makes it likely that the answers to
multiple word queries are near the start. However, merging is much more difficult. Also, this makes
development much more difficult in that a change to the ranking function requires a rebuild of the index.
We chose a compromise between these options, keeping two sets of inverted barrels -- one set for hit
lists which include title or anchor hits and another set for all hit lists. This way, we check the first set of
barrels first and if there are not enough matches within those barrels we check the larger ones.

4.3 Crawling the Web


Running a web crawler is a challenging task. There are tricky performance and reliability issues and
even more importantly, there are social issues. Crawling is the most fragile application since it involves
interacting with hundreds of thousands of web servers and various name servers which are all beyond
the control of the system.

In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A
single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the
URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections
open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system
can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second
of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it
does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections
can be in a number of different states: looking up DNS, connecting to host, sending request, and
receiving response. These factors make the crawler a complex component of the system. It uses
asynchronous IO to manage events, and a number of queues to move page fetches from state to state.

It turns out that running a crawler which connects to more than half a million servers, and generates tens
of millions of log entries generates a fair amount of email and phone calls. Because of the vast number
of people coming on line, there are always those who do not know what a crawler is, because this is the
first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of
pages from my web site. How did you like it?" There are also some people who do not know about the
robots exclusion protocol, and think their page should be protected from indexing by a statement like,
"This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers
to understand. Also, because of the huge amount of data involved, unexpected things will happen. For
example, our system tried to crawl an online game. This resulted in lots of garbage messages in the
middle of their game! It turns out this was an easy problem to fix. But this problem had not come up
The Anatomy of a Large-Scale Hypertextual Web Search Engine 735

until we had downloaded tens of millions of pages. Because of the immense variation in web pages and
servers, it is virtually impossible to test a crawler without running it on large part of the Internet.
Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole
web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which
access large parts of the Internet need to be designed to be very robust and carefully tested. Since large
complex systems such as crawlers will invariably cause problems, there needs to be significant resources
devoted to reading the email and solving these problems as they come up.

4.4 Indexing the Web


Parsing -- Any parser which is designed to run on the entire Web must handle a huge array of
possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag,
non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that
challenge anyone’s imagination to come up with equally creative ones. For maximum speed,
instead of using YACC to generate a CFG parser, we use flex to generate a lexical analyzer which
we outfit with its own stack. Developing this parser which runs at a reasonable speed and is very
robust involved a fair amount of work.
Indexing Documents into Barrels -- After each document is parsed, it is encoded into a number
of barrels. Every word is converted into a wordID by using an in-memory hash table -- the lexicon.
New additions to the lexicon hash table are logged to a file. Once the words are converted into
wordID’s, their occurrences in the current document are translated into hit lists and are written into
the forward barrels. The main difficulty with parallelization of the indexing phase is that the
lexicon needs to be shared. Instead of sharing the lexicon, we took the approach of writing a log of
all the extra words that were not in a base lexicon, which we fixed at 14 million words. That way
multiple indexers can run in parallel and then the small log file of extra words can be processed by
one final indexer.
Sorting -- In order to generate the inverted index, the sorter takes each of the forward barrels and
sorts it by wordID to produce an inverted barrel for title and anchor hits and a full text inverted
barrel. This process happens one barrel at a time, thus requiring little temporary storage. Also, we
parallelize the sorting phase to use as many machines as we have simply by running multiple
sorters, which can process different buckets at the same time. Since the barrels don’t fit into main
memory, the sorter further subdivides them into baskets which do fit into memory based on
wordID and docID. Then the sorter, loads each basket into memory, sorts it and writes its contents
into the short inverted barrel and the full inverted barrel.

4.5 Searching
The goal of searching is to provide quality search results efficiently. Many of the large commercial
search engines seemed to have made great progress in terms of efficiency. Therefore, we have focused
more on quality of search in our research, although we believe our solutions are scalable to commercial
volumes with a bit more effort. The google query evaluation process is show in Figure 4.
736 Chapter 9: Web Services and Data Bases

1. Parse the query.


2. Convert words into wordIDs.
To put a limit on response time, once a certain number 3. Seek to the start of the doclist in
(currently 40,000) of matching documents are found, the the short barrel for every word.
searcher automatically goes to step 8 in Figure 4. This 4. Scan through the doclists until
means that it is possible that sub-optimal results would be there is a document that matches
returned. We are currently investigating other ways to solve all the search terms.
this problem. In the past, we sorted the hits according to 5. Compute the rank of that
PageRank, which seemed to improve the situation. document for the query.
6. If we are in the short barrels and at
4.5.1 The Ranking System the end of any doclist, seek to the
start of the doclist in the full barrel
Google maintains much more information about web for every word and go to step 4.
documents than typical search engines. Every hitlist 7. If we are not at the end of any
includes position, font, and capitalization information. doclist go to step 4.
Additionally, we factor in hits from anchor text and the Sort the documents that have
PageRank of the document. Combining all of this matched by rank and return the top
information into a rank is difficult. We designed our ranking k.
function so that no particular factor can have too much
influence. First, consider the simplest case -- a single word Figure 4. Google Query Evaluation
query. In order to rank a document with a single word
query, Google looks at that document’s hit list for that word.
Google considers each hit to be one of several different types (title, anchor, URL, plain text large font,
plain text small font, ...), each of which has its own type-weight. The type-weights make up a vector
indexed by type. Google counts the number of hits of each type in the hit list. Then every count is
converted into a count-weight. Count-weights increase linearly with counts at first but quickly taper off
so that more than a certain count will not help. We take the dot product of the vector of count-weights
with the vector of type-weights to compute an IR score for the document. Finally, the IR score is
combined with PageRank to give a final rank to the document.

For a multi-word search, the situation is more complicated. Now multiple hit lists must be scanned
through at once so that hits occurring close together in a document are weighted higher than hits
occurring far apart. The hits from the multiple hit lists are matched up so that nearby hits are matched
together. For every matched set of hits, a proximity is computed. The proximity is based on how far
apart the hits are in the document (or anchor) but is classified into 10 different value "bins" ranging from
a phrase match to "not even close". Counts are computed not only for every type of hit but for every type
and proximity. Every type and proximity pair has a type-prox-weight. The counts are converted into
count-weights and we take the dot product of the count-weights and the type-prox-weights to compute
an IR score. All of these numbers and matrices can all be displayed with the search results using a
special debug mode. These displays have been very helpful in developing the ranking system.

4.5.2 Feedback

The ranking function has many parameters like the type-weights and the type-prox-weights. Figuring out
the right values for these parameters is something of a black art. In order to do this, we have a user
feedback mechanism in the search engine. A trusted user may optionally evaluate all of the results that
are returned. This feedback is saved. Then when we modify the ranking function, we can see the impact
of this change on all previous searches which were ranked. Although far from perfect, this gives us some
The Anatomy of a Large-Scale Hypertextual Web Search Engine 737

idea of how a change in the ranking function affects the search results.

5 Results and Performance

The most important measure of a search Query: bill clinton


engine is the quality of its search results. http://www.whitehouse.gov/
While a complete user evaluation is 100.00% (no date) (0K)
beyond the scope of this paper, our own http://www.whitehouse.gov/
experience with Google has shown it to Office of the President
produce better results than the major 99.67% (Dec 23 1996) (2K)
commercial search engines for most http://www.whitehouse.gov/WH/EOP/OP/html/OP_Home.html
searches. As an example which illustrates Welcome To The White House
the use of PageRank, anchor text, and 99.98% (Nov 09 1997) (5K)
proximity, Figure 4 shows Google’s http://www.whitehouse.gov/WH/Welcome.html
results for a search on "bill clinton". Send Electronic Mail to the President
These results demonstrates some of 99.86% (Jul 14 1997) (5K)
Google’s features. The results are http://www.whitehouse.gov/WH/Mail/html/Mail_President.html
clustered by server. This helps
considerably when sifting through result mailto:president@whitehouse.gov
sets. A number of results are from the 99.98%
whitehouse.gov domain which is what mailto:President@whitehouse.gov
one may reasonably expect from such a 99.27%
search. Currently, most major commercial The "Unofficial" Bill Clinton
search engines do not return any results 94.06% (Nov 11 1997) (14K)
from whitehouse.gov, much less the right http://zpub.com/un/un-bc.html
ones. Notice that there is no title for the Bill Clinton Meets The Shrinks
first result. This is because it was not 86.27% (Jun 29 1997) (63K)
crawled. Instead, Google relied on anchor http://zpub.com/un/un-bc9.html
text to determine this was a good answer President Bill Clinton - The Dark Side
to the query. Similarly, the fifth result is 97.27% (Nov 10 1997) (15K)
an email address which, of course, is not http://www.realchange.org/clinton.htm
crawlable. It is also a result of anchor text. $3 Bill Clinton
94.73% (no date) (4K)
All of the results are reasonably high http://www.gatewy.net/~tjohnson/clinton1.html
quality pages and, at last check, none
were broken links. This is largely because Figure 4. Sample Results from Google
they all have high PageRank. The
PageRanks are the percentages in red
along with bar graphs. Finally, there are no results about a Bill other than Clinton or about a Clinton
other than Bill. This is because we place heavy importance on the proximity of word occurrences. Of
course a true test of the quality of a search engine would involve an extensive user study or results
analysis which we do not have room for here. Instead, we invite the reader to try Google for themselves
at http://google.stanford.edu.

5.1 Storage Requirements


738 Chapter 9: Web Services and Data Bases

Aside from search quality, Google is designed to scale cost effectively to the size of the Web as it
grows. One aspect of this is to use storage efficiently. Table 1 has a breakdown of some statistics and
storage requirements of Google. Due to compression the total size of the repository is about 53 GB, just
over one third of the total data it stores. At current disk prices this makes the repository a relatively
cheap source of useful data. More importantly, the total of all the data used by the search engine requires
a comparable amount of storage, about 55 GB. Furthermore, most queries can be answered using just the
short inverted index. With better encoding and compression of the Document Index, a high quality web
search engine may fit onto a 7GB drive of a new PC.

Storage Statistics
5.2 System Performance Total Size of Fetched Pages 147.8 GB
Compressed Repository 53.5 GB
It is important for a search engine to crawl and index Short Inverted Index 4.1 GB
efficiently. This way information can be kept up to date and
major changes to the system can be tested relatively quickly. Full Inverted Index 37.2 GB
For Google, the major operations are Crawling, Indexing, Lexicon 293 MB
and Sorting. It is difficult to measure how long crawling Temporary Anchor Data
took overall because disks filled up, name servers crashed, 6.6 GB
(not in total)
or any number of other problems which stopped the system.
In total it took roughly 9 days to download the 26 million Document Index Incl.
9.7 GB
pages (including errors). However, once the system was Variable Width Data
running smoothly, it ran much faster, downloading the last Links Database 3.9 GB
11 million pages in just 63 hours, averaging just over 4 Total Without Repository 55.2 GB
million pages per day or 48.5 pages per second. We ran the
indexer and the crawler simultaneously. The indexer ran just Total With Repository 108.7 GB
faster than the crawlers. This is largely because we spent just
enough time optimizing the indexer so that it would not be a Web Page Statistics
bottleneck. These optimizations included bulk updates to the Number of Web Pages
document index and placement of critical data structures on 24 million
Fetched
the local disk. The indexer runs at roughly 54 pages per
Number of Urls Seen 76.5 million
second. The sorters can be run completely in parallel; using
four machines, the whole process of sorting takes about 24 Number of Email
1.7 million
hours. Addresses
Number of 404’s 1.6 million
5.3 Search Performance
Table 1. Statistics
Improving the performance of search was not the major focus of our research up to this point. The
current version of Google answers most queries in between 1 and 10 seconds. This time is mostly
dominated by disk IO over NFS (since disks are spread over a number of machines). Furthermore,
Google does not have any optimizations such as query caching, subindices on common terms, and other
common optimizations. We intend to speed up Google considerably through distribution and hardware,
software, and algorithmic improvements. Our target is to be able to handle several hundred queries per
second. Table 2 has some sample query times from the current version of Google. They are repeated to
show the speedups resulting from cached IO.
The Anatomy of a Large-Scale Hypertextual Web Search Engine 739

Same Query
Initial Query Repeated (IO
6 Conclusions mostly cached)
CPU Total CPU Total
Google is designed to be a scalable search engine. Query
Time(s) Time(s) Time(s) Time(s)
The primary goal is to provide high quality search
results over a rapidly growing World Wide Web. al gore 0.09 2.13 0.06 0.06
Google employs a number of techniques to improve vice
search quality including page rank, anchor text, and 1.77 3.84 1.66 1.80
president
proximity information. Furthermore, Google is a
hard
complete architecture for gathering web pages, 0.25 4.86 0.20 0.24
disks
indexing them, and performing search queries over
them. search
1.31 9.63 1.16 1.16
engines
6.1 Future Work Table 2. Search Times
A large-scale web search engine is a complex system and much remains to be done. Our immediate
goals are to improve search efficiency and to scale to approximately 100 million web pages. Some
simple improvements to efficiency include query caching, smart disk allocation, and subindices.
Another area which requires much research is updates. We must have smart algorithms to decide what
old web pages should be recrawled and what new ones should be crawled. Work toward this goal has
been done in [Cho 98]. One promising area of research is using proxy caches to build search databases,
since they are demand driven. We are planning to add simple features supported by commercial search
engines like boolean operators, negation, and stemming. However, other features are just starting to be
explored such as relevance feedback and clustering (Google currently supports a simple hostname based
clustering). We also plan to support user context (like the user’s location), and result summarization. We
are also working to extend the use of link structure and link text. Simple experiments indicate PageRank
can be personalized by increasing the weight of a user’s home page or bookmarks. As for link text, we
are experimenting with using text surrounding links in addition to the link text itself. A Web search
engine is a very rich environment for research ideas. We have far too many to list here so we do not
expect this Future Work section to become much shorter in the near future.

6.2 High Quality Search


The biggest problem facing users of web search engines today is the quality of the results they get back.
While the results are often amusing and expand users’ horizons, they are often frustrating and consume
precious time. For example, the top result for a search for "Bill Clinton" on one of the most popular
commercial search engines was the Bill Clinton Joke of the Day: April 14, 1997. Google is designed to
provide higher quality search so as the Web continues to grow rapidly, information can be found easily.
In order to accomplish this Google makes heavy use of hypertextual information consisting of link
structure and link (anchor) text. Google also uses proximity and font information. While evaluation of a
search engine is difficult, we have subjectively found that Google returns higher quality search results
than current commercial search engines. The analysis of link structure via PageRank allows Google to
evaluate the quality of web pages. The use of link text as a description of what the link points to helps
the search engine return relevant (and to some degree high quality) results. Finally, the use of proximity
information helps increase relevance a great deal for many queries.
740 Chapter 9: Web Services and Data Bases

6.3 Scalable Architecture


Aside from the quality of search, Google is designed to scale. It must be efficient in both space and time,
and constant factors are very important when dealing with the entire Web. In implementing Google, we
have seen bottlenecks in CPU, memory access, memory capacity, disk seeks, disk throughput, disk
capacity, and network IO. Google has evolved to overcome a number of these bottlenecks during
various operations. Google’s major data structures make efficient use of available storage space.
Furthermore, the crawling, indexing, and sorting operations are efficient enough to be able to build an
index of a substantial portion of the web -- 24 million pages, in less than one week. We expect to be able
to build an index of 100 million pages in less than a month.

6.4 A Research Tool


In addition to being a high quality search engine, Google is a research tool. The data Google has
collected has already resulted in many other papers submitted to conferences and many more on the
way. Recent research such as [Abiteboul 97] has shown a number of limitations to queries about the
Web that may be answered without having the Web available locally. This means that Google (or a
similar system) is not only a valuable research tool but a necessary one for a wide range of applications.
We hope Google will be a resource for searchers and researchers all around the world and will spark the
next generation of search engine technology.

7 Acknowledgments
Scott Hassan and Alan Steremberg have been critical to the development of Google. Their talented
contributions are irreplaceable, and the authors owe them much gratitude. We would also like to thank
Hector Garcia-Molina, Rajeev Motwani, Jeff Ullman, and Terry Winograd and the whole WebBase
group for their support and insightful discussions. Finally we would like to recognize the generous
support of our equipment donors IBM, Intel, and Sun and our funders. The research described here was
conducted as part of the Stanford Integrated Digital Library Project, supported by the National Science
Foundation under Cooperative Agreement IRI-9411306. Funding for this cooperative agreement is also
provided by DARPA and NASA, and by Interval Research, and the industrial partners of the Stanford
Digital Libraries Project.

References
Best of the Web 1994 -- Navigators http://botw.org/1994/awards/navigators.html
Bill Clinton Joke of the Day: April 14, 1997. http://www.io.com/~cjburke/clinton/970414.html.
Bzip2 Homepage http://www.muraroa.demon.co.uk/
Google Search Engine http://google.stanford.edu/
Harvest http://harvest.transarc.com/
Mauldin, Michael L. Lycos Design Choices in an Internet Search Service, IEEE Expert Interview
http://www.computer.org/pubs/expert/1997/trends/x1008/mauldin.htm
The Effect of Cellular Phone Use Upon Driver Attention
http://www.webfirst.com/aaa/text/cell/cell0toc.htm
Search Engine Watch http://www.searchenginewatch.com/
RFC 1950 (zlib) ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html
Robots Exclusion Protocol: http://info.webcrawler.com/mak/projects/robots/exclusion.htm
The Anatomy of a Large-Scale Hypertextual Web Search Engine 741

Web Growth Summary: http://www.mit.edu/people/mkgray/net/web-growth-summary.html


Yahoo! http://www.yahoo.com/

[Abiteboul 97] Serge Abiteboul and Victor Vianu, Queries and Computation on the Web.
Proceedings of the International Conference on Database Theory. Delphi, Greece 1997.
[Bagdikian 97] Ben H. Bagdikian. The Media Monopoly. 5th Edition. Publisher: Beacon, ISBN:
0807061557
[Cho 98] Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Efficient Crawling Through URL
Ordering. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18,
1998.
[Gravano 94] Luis Gravano, Hector Garcia-Molina, and A. Tomasic. The Effectiveness of GlOSS
for the Text-Database Discovery Problem. Proc. of the 1994 ACM SIGMOD International
Conference On Management Of Data, 1994.
[Kleinberg 98] Jon Kleinberg, Authoritative Sources in a Hyperlinked Environment, Proc.
ACM-SIAM Symposium on Discrete Algorithms, 1998.
[Marchiori 97] Massimo Marchiori. The Quest for Correct Information on the Web: Hyper Search
Engines. The Sixth International WWW Conference (WWW 97). Santa Clara, USA, April 7-11,
1997.
[McBryan 94] Oliver A. McBryan. GENVL and WWWW: Tools for Taming the Web. First
International Conference on the World Wide Web. CERN, Geneva (Switzerland), May 25-26-27
1994. http://www.cs.colorado.edu/home/mcbryan/mypapers/www94.ps
[Page 98] Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation
Ranking: Bringing Order to the Web. Manuscript in progress.
http://google.stanford.edu/~backrub/pageranksub.ps
[Pinkerton 94] Brian Pinkerton, Finding What People Want: Experiences with the WebCrawler.
The Second International WWW Conference Chicago, USA, October 17-20, 1994.
http://info.webcrawler.com/bp/WWW94.html
[Spertus 97] Ellen Spertus. ParaSite: Mining Structural Information on the Web. The Sixth
International WWW Conference (WWW 97). Santa Clara, USA, April 7-11, 1997.
[TREC 96] Proceedings of the fifth Text REtrieval Conference (TREC-5). Gaithersburg, Maryland,
November 20-22, 1996. Publisher: Department of Commerce, National Institute of Standards and
Technology. Editors: D. K. Harman and E. M. Voorhees. Full text at: http://trec.nist.gov/
[Witten 94] Ian H Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes:
Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold, 1994.
[Weiss 96] Ron Weiss, Bienvenido Velez, Mark A. Sheldon, Chanathip Manprempre, Peter
Szilagyi, Andrzej Duda, and David K. Gifford. HyPursuit: A Hierarchical Network Search Engine
that Exploits Content-Link Hypertext Clustering. Proceedings of the 7th ACM Conference on
Hypertext. New York, 1996.

Vitae
742 Chapter 9: Web Services and Data Bases

Sergey Brin received his B.S. degree in mathematics and computer science
from the University of Maryland at College Park in 1993. Currently, he is a
Ph.D. candidate in computer science at Stanford University where he received
his M.S. in 1995. He is a recipient of a National Science Foundation Graduate
Fellowship. His research interests include search engines, information
extraction from unstructured sources, and data mining of large text collections
and scientific data.

Lawrence Page was born in East Lansing, Michigan, and received a B.S.E.
in Computer Engineering at the University of Michigan Ann Arbor in 1995.
He is currently a Ph.D. candidate in Computer Science at Stanford University.
Some of his research interests include the link structure of the web, human
computer interaction, search engines, scalability of information access
interfaces, and personal data mining.

8 Appendix A: Advertising and Mixed Motives


Currently, the predominant business model for commercial search engines is advertising. The goals of
the advertising business model do not always correspond to providing quality search to users. For
example, in our prototype search engine one of the top results for cellular phone is "The Effect of
Cellular Phone Use Upon Driver Attention", a study which explains in great detail the distractions and
risk associated with conversing on a cell phone while driving. This search result came up first because
of its high importance as judged by the PageRank algorithm, an approximation of citation importance on
the web [Page, 98]. It is clear that a search engine which was taking money for showing cellular phone
ads would have difficulty justifying the page that our system returned to its paying advertisers. For this
type of reason and historical experience with other media [Bagdikian 83], we expect that advertising
funded search engines will be inherently biased towards the advertisers and away from the needs of the
consumers.

Since it is very difficult even for experts to evaluate search engines, search engine bias is particularly
insidious. A good example was OpenText, which was reported to be selling companies the right to be
listed at the top of the search results for particular queries [Marchiori 97]. This type of bias is much
more insidious than advertising, because it is not clear who "deserves" to be there, and who is willing to
pay money to be listed. This business model resulted in an uproar, and OpenText has ceased to be a
viable search engine. But less blatant bias are likely to be tolerated by the market. For example, a search
engine could add a small factor to search results from "friendly" companies, and subtract a factor from
results from competitors. This type of bias is very difficult to detect but could still have a significant
effect on the market. Furthermore, advertising income often provides an incentive to provide poor
The Anatomy of a Large-Scale Hypertextual Web Search Engine 743

quality search results. For example, we noticed a major search engine would not return a large airline’s
homepage when the airline’s name was given as a query. It so happened that the airline had placed an
expensive ad, linked to the query that was its name. A better search engine would not have required this
ad, and possibly resulted in the loss of the revenue from the airline to the search engine. In general, it
could be argued from the consumer point of view that the better the search engine is, the fewer
advertisements will be needed for the consumer to find what they want. This of course erodes the
advertising supported business model of the existing search engines. However, there will always be
money from advertisers who want a customer to switch products, or have something that is genuinely
new. But we believe the issue of advertising causes enough mixed incentives that it is crucial to have a
competitive search engine that is transparent and in the academic realm.

9 Appendix B: Scalability
9. 1 Scalability of Google
We have designed Google to be scalable in the near term to a goal of 100 million web pages. We have
just received disk and machines to handle roughly that amount. All of the time consuming parts of the
system are parallelize and roughly linear time. These include things like the crawlers, indexers, and
sorters. We also think that most of the data structures will deal gracefully with the expansion. However,
at 100 million web pages we will be very close up against all sorts of operating system limits in the
common operating systems (currently we run on both Solaris and Linux). These include things like
addressable memory, number of open file descriptors, network sockets and bandwidth, and many others.
We believe expanding to a lot more than 100 million pages would greatly increase the complexity of our
system.

9.2 Scalability of Centralized Indexing Architectures


As the capabilities of computers increase, it becomes possible to index a very large amount of text for a
reasonable cost. Of course, other more bandwidth intensive media such as video is likely to become
more pervasive. But, because the cost of production of text is low compared to media like video, text is
likely to remain very pervasive. Also, it is likely that soon we will have speech recognition that does a
reasonable job converting speech into text, expanding the amount of text available. All of this provides
amazing possibilities for centralized indexing. Here is an illustrative example. We assume we want to
index everything everyone in the US has written for a year. We assume that there are 250 million people
in the US and they write an average of 10k per day. That works out to be about 850 terabytes. Also
assume that indexing a terabyte can be done now for a reasonable cost. We also assume that the
indexing methods used over the text are linear, or nearly linear in their complexity. Given all these
assumptions we can compute how long it would take before we could index our 850 terabytes for a
reasonable cost assuming certain growth factors. Moore’s Law was defined in 1965 as a doubling every
18 months in processor power. It has held remarkably true, not just for processors, but for other
important system parameters such as disk as well. If we assume that Moore’s law holds for the future,
we need only 10 more doublings, or 15 years to reach our goal of indexing everything everyone in the
US has written for a year for a price that a small company could afford. Of course, hardware experts are
somewhat concerned Moore’s Law may not continue to hold for the next 15 years, but there are
certainly a lot of interesting centralized applications even if we only get part of the way to our
hypothetical example.
744 Chapter 9: Web Services and Data Bases

Of course a distributed systems like Gloss [Gravano 94] or Harvest will often be the most efficient and
elegant technical solution for indexing, but it seems difficult to convince the world to use these systems
because of the high administration costs of setting up large numbers of installations. Of course, it is
quite likely that reducing the administration cost drastically is possible. If that happens, and everyone
starts running a distributed indexing system, searching would certainly improve drastically.

Because humans can only type or speak a finite amount, and as computers continue improving, text
indexing will scale even better than it does now. Of course there could be an infinite amount of machine
generated content, but just indexing huge amounts of human generated content seems tremendously
useful. So we are optimistic that our centralized web search engine architecture will improve in its
ability to cover the pertinent text information over time and that there is a bright future for search.
The BINGO! System for Information Portal
Generation and Expert Web Search
Sergej Sizov, Michael Biwer, Jens Graupmann,
Stefan Siersdorfer, Martin Theobald, Gerhard Weikum, Patrick Zimmer
University of the Saarland
Department of Computer Science
Im Stadtwald, 66123 Saarbruecken
Germany

Abstract scores [3, 19]. These terms, also known as features,


represent word occurrence frequencies in documents
This paper presents the BINGO! focused after stemming and other normalizations. Queries are
crawler, an advanced tool for information por- vectors too, so that similarity metrics between vec-
tal generation and expert Web search. In tors, for example, the Euclidean distance or the cosine
contrast to standard search engines such as metric, can be used to produce a ranked list of search
Google which are solely based on precomputed results, in descending order of (estimated) relevance.
index structures, a focused crawler interleaves The quality of a search result is assessed a posteriori
crawling, automatic classification, link analy- by the empirical metrics precision and recall: precision
sis and assessment, and text filtering. A crawl is the fraction of truly relevant documents among the
is started from a user-provided set of training top N matches in the result ranking (N typically be-
data and aims to collect comprehensive results ing 10), and recall is the fraction of found documents
for the given topics. out of the relevant documents that exist somewhere
The focused crawling paradigm has been in the underlying corpus (e.g., the entire Web). More
around for a few years and many of our recently, the above basic model has been enhanced by
techniques are adopted from the informa- analyzing the link structure between documents, view-
tion retrieval and machine learning literature. ing the Web as a graph, and defining the authority of
BINGO! is a system-oriented effort to inte- Web sites or documents as an additional metric for
grate a suite of techniques into a comprehen- search result ranking [5, 14]. These approaches have
sive and versatile tool. The paper discusses been very successful in improving the precision (i.e.,
its overall architecture and main components, ”sorting out the junk” in more colloquial terms) for
important lessons from early experimentation typical mass queries such as ”Madonna tour” (i.e., ev-
and the resulting improvements on effective- erything or anything about the concert tour of pop
ness and efficiency, and experimental results star Madonna). However, link analysis techniques do
that demonstrate the usefulness of BINGO! not help much for expert queries where recall is the
as a next-generation tool for information or- key problem (i.e., finding a few useful results at all).
ganization and search. Two important observations can be made about the
above class of advanced information demands. First,
1 Introduction the best results are often obtained from Yahoo-style
portals that maintain a hierarchical directory of top-
1.1 The Problem of Web and Intranet Infor- ics, also known as an ontology; the problem with this
mation Search approach is, however, that it requires intellectual work
Web search engines mostly build on the vector space for classifying new documents into the ontology and
model that views text documents (including HTML thus does not scale with the Web. Second, fully auto-
or XML documents) as vectors of term relevance mated Web search engines such as Google, Altavista,
etc. sometimes yield search results from which the user
Permission to copy without fee all or part of this material is could possibly reach the actually desired information
granted provided that the copies are not made or distributed for by following a small number of hyperlinks; here the
direct commercial advantage, the VLDB copyright notice and problem is that exhaustively surfing the vicinity of a
the title of the publication and its date appear, and notice is
given that copying is by permission of the Very Large Data Base Web document may often take hours and is thus in-
Endowment. To copy otherwise, or to republish, requires a fee feasible in practice. These two observations have mo-
and/or special permission from the Endowment. tivated a novel approach known as focused crawling
Proceedings of the 2003 CIDR Conference or thematic crawling [7], which can be viewed as an
746 Chapter 9: Web Services and Data Bases

attempt to automate the above kinds of intellectual strictly limit the computer resource consumption per
preprocessing and postprocessing. query in the interest of maximizing the throughput of
”mass user” queries. With human cycles being much
1.2 The Potential of Focused Crawling more expensive than computer and network cycles, the
above kind of paradigm shift seems to be overdue for
In contrast to a search engine’s generic crawler (which advanced information demands (e.g., of scientists).
serves to build and maintain the engine’s index), a fo-
cused crawler is interested only in a specific, typically 1.3 Contribution and Outline of the Paper
small, set of topics such as 19th century Russian liter-
ature, backcountry desert hiking and canyoneering, or This paper presents the BINGO! system that we have
programming with (the Web server scripting language) developed in the last two years at the University of
PHP. The topics of interest may be organized into a the Saarland.1 . Our approach has been inspired by
user- or community-specific hierarchy. The crawl is and has adopted concepts from the seminal work of
started from a given set of seed documents, typically Chakrabarti et al. [7], but we believe it is fair to call
taken from an intellectually built ontology, and aims our system a second-generation focused crawler. While
to proceed along the most promising paths that stay most mathematical and algorithmic ingredients that
”on topic” while also accepting some detours along we use in BINGO! (e.g., the classifier, cross-entropy-
digressing subjects with a certain ”tunnelling” proba- based feature selection, link analysis for prioritizing
bility. Each of the visited documents is classified into URLs in the crawl queue, etc.) are state-of-the-art, the
the crawler’s hierarchy of topics to test whether it is overall system architecture is relatively unique (in the
of interest at all and where it belongs in the ontol- sense that most concepts have been around in the ma-
ogy; this step must be automated using classification chine learning and information retrieval literature, but
techniques from machine learning such as Naive Bayes, have not been considered in an integrated system con-
Maximum Entropy, Support Vector Machines (SVM), text). The following are salient features of the BINGO!
or other supervised learning methods [15, 17, 23]. The system:
outcome of the focused crawl can be viewed as the • As human expert time is scarce and expensive,
index of a personalized information service or a the- building the classifier on extensive, high-quality
matically specialized search engine. training data is a rare luxury. To overcome the
A focused crawler can be used for at least two major potential deficiencies of the initial training docu-
problems in information organization and search: ments, BINGO! uses a simple form of unsupervised,
1. Starting with a reasonable set of seed documents dynamic learning: during a crawl the system pe-
that also serve as training data for the classifier, riodically identifies the most characteristic docu-
a focused crawl can populate a topic directory and ments that have been automatically classified into
thus serves as a largely automated information por- a topic of interest and considers promoting these
tal generator. class ”archetypes” to become new training data.
2. Starting with a set of keywords or an initial re- • The crawl is structured into two phases: a learning
sult set from a search engine (e.g., from a Google phase and a harvesting phase. The first phase per-
query), a focused crawl can improve the recall for forms a limited (mostly depth-first) crawl and uses
an advanced expert query, a query that would take a conservative tuning of the classifier in order to
a human expert to identify matches and for which obtain a richer feature set (i.e., topic-specific termi-
current Web search engines would typically return nology) and to find good candidates for archetypes.
either no or only irrelevant documents (at least in The second phase then switches to a much more ag-
the top ten ranks). gressive breadth-first strategy with URL prioritiza-
tion. Learning aims to calibrate the precision of the
In either case the key challenge is to minimize the classifier, whereas harvesting aims at a high recall.
time that a human needs for setting up the crawl (e.g., • BINGO! is designed as a comprehensive and flexible
provide training data, calibrate crawl parameters, etc.) workbench for assisting a portal administrator or
and for interpreting or analyzing its results. For exam- a human expert with certain information demands.
ple, we would expect the human to spend a few min- To this end it includes a local search engine for
utes for carefully specifying her information demand querying the result documents of a crawl and various
and setting up an overnight crawl, and another few other data analysis techniques for postprocessing.
minutes for looking at the results the next morning.
In addition, the focused crawler may get back to the The paper describes the BINGO! architecture and
user for feedback after some ”learning” phase of say its components, and it demonstrates the system’s ef-
twenty minutes. fectiveness by two kinds of experiments for informa-
This mode of operation is in significant contrast tion portal generation and expert search. The rest
to today’s Web search engines which rely solely on 1 BINGO! stands for Bookmark-Induced Gathering of
precomputed results in their index structures and Information
The BINGO! System for Information Portal Generation and Expert Web Search 747

of the paper is organized as follows. Section 2 gives back after a short initial crawl and adding appropriate
an overview of the system’s main concepts, the cor- documents.
responding software components, and their interplay.
When we had built the first prototype based on these
concepts and started experimenting, we realized a
number of shortcomings regarding the search effective-
ness, i.e., the quality of the crawl results, and also ef-
ficiency, i.e., speed and resource consumption. These
observations led to substantial improvements that are
discussed in Sections 3 and 4 on effectiveness and ef-
ficiency. Section 5 presents our recent experimental
results. We conclude with an outlook on ongoing and Figure 2: Example of a topic tree
future work.
All crawled documents, including the initial data,
2 Overview of the BINGO! System are stored in an Oracle9i database which serves as a
cache. The role of the database system as a storage
The BINGO! crawling toolkit consists of six main com- manager to BINGO! is further discussed in Section 4.
ponents that are depicted in Figure 1: the focused
crawler itself, an HTML document analyzer that pro-
duces a feature vector for each document, the SVM 2.1 Crawler
classifier with its training data, the feature selection
as a ”noise-reduction” filter for the classifier, the link The crawler processes the links in the URL queue us-
analysis module as a distiller for topic-specific authori- ing multiple threads. For each retrieved document the
ties and hubs, and the training module for the classifier crawler initiates some analysis steps that depend on
that is invoked for periodic re-training. the document’s MIME type (e.g., HTML, PDF, etc.)
and then invokes the classifier on the resulting feature
vector. Once a crawled document has been success-
fully classified, BINGO! extracts all hyperlinks from
the document and adds them to the URL queue for
further crawling; the links are ordered by their prior-
ity, i.e., SVM confidence in our case.

2.2 Document Analyzer

BINGO! computes document vectors according to the


standard bag-of-words model, using stopword elimina-
tion, Porter stemming, and tf ∗ idf based term weight-
ing [3, 16]. This is a standard IR approach where term
weights capture the term frequency (tf ) of the corre-
Figure 1: The BINGO! architecture and its main com- sponding word stems in the document and the, log-
ponents arithmically dampened, inverse document frequency
(idf ) which is the reciprocal of the number of docu-
The crawler starts from a user’s bookmark file or ments in the entire corpus that contain the term. We
some other form of personalized or community-specific consider our local document database as an approxi-
topic directory. These intellectually classified docu- mation of the corpus for idf computation and recom-
ments serve two purposes: 1) they provide the initial pute it lazily upon each retraining.
seeds for the crawl (i.e., documents whose outgoing The document analyzer can handle a wide range
hyperlinks are traversed by the crawler), and 2) they of content handlers for different document formats (in
provide the initial contents for the user’s topic tree particular, PDF, MS Word, MS PowerPoint etc.) as
and the initial training data for the classifier. Fig- well as common archive files (zip, gz) and converts
ure 2 shows an example of such a tree. Note that a the recognized contents into HTML. So these formats
single-node tree is a special case for generating an in- can be processed by BINGO! like usual web pages.
formation portal with a single topic and no subclass Many useful kinds of documents (like scientific publi-
structure or for answering a specific expert query. In cations, whitepapers, or commercial product specifica-
the latter case the training data is a virtual document tions) are published as PDF; incorporating this mate-
derived from the user query, and this training basis can rial improves the crawling recall and the quality of the
be extended by prompting the user for relevance feed- classifier’s training set by a substantial margin.
748 Chapter 9: Web Services and Data Bases

2.3 Feature Selection intellectually pre-classified documents, and a decision


phase for classifying new, previously unseen docu-
The feature selection algorithm provides the BINGO!
ments fetched by the crawler. For training BINGO!
engine with the most characteristic features for a given
builds a topic-specific classifier for each node of the
topic; these are the features that are used by the clas-
topic tree.
sifier for testing new documents. A good feature for
this purpose discriminates competing topics from each New documents are classified against all topics of
other, i.e., those topics that are at the same level of the the ontology tree in a top-down manner. Starting with
topic tree. Therefore, feature selection has to be topic- the root, which corresponds to the union of the user’s
specific; it is invoked for every topic in the tree individ- topics of interest, we feed the document’s features into
ually. As an example, consider a directory with topics each of the node-specific decision models (including
mathematics, agriculture, and arts, where mathemat- node-specific feature selection) and invoke the binary
ics has subclasses algebra and stochastics. Obviously, classifiers for all topics with the same parent. We refer
the term theorem is very characteristic for math doc- to these as ”competing” topics as the document will
uments and thus an excellent discriminator between eventually be placed in at most one of them. Each of
mathematics, agriculture, and arts. However, it is of the topic-specific classifiers returns a yes-or-no decision
no use at all to discriminate algebra versus stochastics. and also a measure of confidence for this decision (see
A term such as field, on the other hand, is a good in- below). We assign the document to the tree node with
dicator for the topic algebra when the only competing the highest confidence in a positive decision. Then the
topic is stochastics; however, it is useless for a classifier classification proceeds with the children of this node,
that tests mathematics versus agriculture. until eventually a leaf node is reached. If none of the
We use the Mutual Information (MI) measure for topics with the same parent returns yes, the document
topic-specific feature. This technique, which is a is assigned to an artificial node labeled ’OTHERS’ un-
specialized case of the notions of cross-entropy or der the same parent.
Kullback-Leibler divergence [16], is known as one of Our engine uses support vector machines (SVM) [6,
the most effective methods [24]. The MI weight of the 23] as topic-specific classifiers. We use the linear form
term Xi in the topic Vj is defined as: of SVM where training amounts to finding a hyper-
plane in the m-dimensional feature vector space that
P [Xi ∧ Vj ] separates a set of positive training examples (docu-
M I(Xi , Vj ) = P [Xi ∧ Vj ] log (1) ment collection Di+ of the topic Vi ) from a set of nega-
P [Xi ]P [Vj ]
tive examples (document collection Di− of all compet-
Mutual information can be interpreted as measure ing topics V with the same parent as Vi ) with maxi-
of how much the joint distribution of features Xi and mum margin. The hyperplane can be described in the
topics Vj deviate from a hypothetical distribution in form w·
 x+b = 0, as illustrated in Figure 3. Computing
which features and topics are independent of each the hyperplane is equivalent to solving a quadratic op-
other (hence the remark about MI being a special case timization problem [23]. The current BINGO! version
of the Kullback-Leibler divergence which measures the uses an existing open-source SVM implementation [1].
differences between multivariate probability distribu-
tions in general).
The result of feature selection for a given topic is a
ranking of the features, with the most discriminative
features listed first. Our experiments achieved good
results with the top 2000 features for each topic as the
input to the classifier (compared to tens of thousands
of different terms in the original documents). For effi-
ciency BINGO! pre-selects candidates for the best fea-
tures based on tf values and evaluates MI weights only
for the 5000 most frequently occurring terms within
each topic. As an example consider the class “Data
Mining” in the topic tree of Figure 2 with home pages
and DBLP pages of 5 to 10 leading researchers for each Figure 3: The separating hyperplane of the linear SVM
topic. Our feature selection finds the following word classifier
stems with the highest MI values: mine, knowledg, Note that in the decision phase the SVM classifier
olap, frame, pattern, genet, discov, cluster, dataset. is very efficient. For a new, previously unseen, docu-
ment in the m-dimensional feature space d ∈ Rm it
2.4 Classifier
merely needs to test whether the document lies on the
Document classification consists of a training phase ”left” side or the ”right” side of the separating hy-
for building a mathematical decision model based on perplane. The decision simply requires computing an
The BINGO! System for Information Portal Generation and Expert Web Search 749

m-dimensional scalar product of two vectors. two vectors with authority scores and hub scores. We
We interpret the distance of a newly classified doc- are interested in the top ranked authorities and hubs.
ument from the separating hyperplane as a measure The former are perceived as topic-specific archetypes
of the classifier’s confidence. This computation is an and considered for promotion to training data, and the
inexpensive byproduct of the classifier. We use this latter are the best candidates for being crawled next
kind of SVM confidence for identifying the most char- and therefore added to the high-priority end of the
acteristic “archetypes” of a given topic. Note that crawler’s URL queue. These steps are performed with
training documents have a confidence score associ- each retraining.
ated with them, too, by simply running them through
the classifier’s decision model after completed train- 2.6 Learning Phase vs. Harvesting Phase
ing. Further note that the initial training data does Building a reasonably precise classifier from a very
not necessarily yield the best archetypes, in particular, small set of training data is a very challenging task.
for expert Web search where the initial training data Effective learning algorithms for highly heterogeneous
is merely a representation of the user’s query terms. environments like the Web would require a much larger
Therefore, BINGO! periodically considers promoting training basis, yet human users would rarely be willing
the best archetypes to become training documents for to invest hours of intellectual work for putting together
retraining the classifier. To estimate the precision of a rich document collection that is truly representative
the new classifier, we use the computationally efficient of their interest profiles. To address this problem, we
ξα-method [13]. This estimator has approximately the distinguish two basic crawl strategies:
same variance as leave-one-out estimation and slightly
underestimates the true precision of the classifier (i.e., • The learning phase serves to identify new archetypes
is a bit pessimistic). The prediction of the classifier’s and expand the classifier’s knowledge base.
performance during a crawl is valuable for tuning the • The harvesting phase serves to effectively process the
feature space construction; we will further discuss this user’s information demands with improved crawling
issue in Section 3. precision and recall.
Depending on the phase, different focusing rules come
2.5 Link Analysis into play to tell the crawler when to accept or reject
Web pages for addition to the URL queue (see Sec-
The link structure between documents in each topic tion 4.2).
is an additional source of information about how well In the learning phase we are exclusively interested
they capture the topic [4, 5, 7, 14]. Upon each re- in gaining a broad knowledge base for the classifier
training, we apply the method of [4], a variation of by identifying archetypes for each topic. In many
Kleinberg’s HITS algorithm, to each topic of the direc- cases such documents can be obtained from the di-
tory. This method aims to identify a set SA of author- rect neighborhood of the initial training data, assum-
ities, which should be Web pages with most significant ing that these have been chosen carefully. For example,
and/or comprehensive information on the topic, and a suppose the user provides us with home pages of re-
set SH of hubs, which should be the best link collec- searchers from her bookmarks on a specific topic, say
tions with pointer to good authorities. The algorithm data mining; then chances are good that we find a rich
considers a small part of the hyperlink-induced Web source of topic-specific terminology in the vicinity of
graph G = (S, E) with a node set S in the order of a these home pages, say a conference paper on some data
few hundred or a few thousand documents and a set of mining issue. i.e., a scientists homepage with links to
edges E with an edge from node p to node q if the doc- her topic-specific publications. Following this ratio-
ument that corresponds to p contains a hyperlink that nale, BINGO! uses a depth-first crawl strategy during
points to document q. The node set S is constructed in the learning phase, and initially restricts itself to Web
two steps: 1) We include all documents that have been pages from the domains that the initial training doc-
positively classified into the topic under consideration, uments come from.
which form the ”base set” in Kleinberg’s terminology. BINGO! repeatedly initiates re-training of the clas-
2) We add all successors of these documents (i.e., doc- sifier, when a certain number of documents have been
uments that can be reached along one outgoing edge) crawled and successfully classified with confidence
and a reasonably sized subset of predecessors (i.e., doc- above a certain threshold. At such points, a new set
uments that have a direct hyperlink to a document in of training documents is determined for each node of
the base set). The predecessors can be determined by the topic tree. For this purpose, the most character-
querying a large unfocused Web database that inter- istic documents of a topic, coined archetypes, are de-
nally maintains a large fraction of the full Web graph. termined in two, complementary, ways. First, the link
The actual computation of hub and authority scores analysis is initiated with the current documents of a
is essentially an iterative approximation of the prin- topic as its base set. The best authorities of a tree node
cipal Eigenvectors for two matrices derived from the are regarded as potential archetypes of the node. The
adjacency matrix of the graph G. Its outcome are second source of topic-specific archetypes builds on the
750 Chapter 9: Web Services and Data Bases

confidence of the classifier’s yes-or-no decision for a systematic manner. As the positive training exam-
given node of the ontology tree. Among the automat- ples for the various topics all contain ample common-
ically classified documents of a topic those documents sense vocabulary and not just the specific terms that
whose yes decision had the highest confidence measure we are interested in, we included training documents
are selected as potential archetypes. The union of both in the “OTHERS” classes that capture as much of the
top authorities and documents with high SVM confi- common-sense terminology as possible. In most of our
dence form a new set of candidates for promotion to experiments we use about 50 documents from the top-
training data. level categories of Yahoo (i.e., sports, entertainment,
After successfully extending the training basis with etc.) for this purpose. Since our focused crawls were
additional archetypes, BINGO! retrains all topic- mostly interested in scientific topics, this choice of neg-
specific classifiers and switches to the harvesting phase ative examples turned out to be a proper complement
now putting emphasis on recall (i.e., collecting as many to improve the classifier’s learning.
documents as possible). The crawler is resumed with
the best hubs from the link analysis, using a breadth- 3.2 Archetype Selection
first strategy that aims to visit as many different sites
The addition of inappropriate archetypes for retrain-
as possible that are related to the crawl’s topics. When
ing the classifier was a source of potential diffusion.
the learning phase cannot find sufficient archetypes or
To avoid the ”topic drift” phenomenon, where a few
when the user wants to confirm archetypes before ini-
out-of-focus training documents may lead the entire
tiating a long and resource-intensive harvesting crawl,
crawl into a wrong thematic direction, we now require
BINGO! includes a user feedback step between learn-
that the classification confidence of an archetype must
ing and harvesting. Here the user can intellectually
be higher than the mean confidence of the previous
identify archetypes among the documents found so far
training documents. So each iteration effectively adds
and may even trim individual HTML pages to remove
x new archetypes (0 ≤ x ≤ min{Nauth ; Nconf } where
irrelevant and potentially dilluting parts (e.g., when
Nauth is the number of high-authority candidates from
a senior researcher’s home page is heterogeneous in
the link analysis and Nconf is the number of candi-
the sense that it reflects different research topics and
dates with top ranks regarding SVM confidence), and
only some of them are within the intended focus of the
it may also remove documents from the training data
crawl).
as the mean confidence of the training data changes.
Once the up to min{Nauth ; Nconf } archetypes of a
3 Making BINGO! Effective topic have been selected, the classifier is re-trained.
The BINGO! system described so far is a complete fo- This step in turn requires invoking the feature selec-
cused crawler with a suite of flexible options. When we tion first. So the effect of re-training is twofold: 1)
started experimenting with the system, we observed if the archetypes capture the terminology of the topic
fairly mixed success, however. In particular, some of better than the original training data (which is our ba-
the crawls lost their focus and were led astray by in- sic premise) then the feature selection procedure can
appropriate training data or a bad choice of automat- extract better, more discriminative, features for driv-
ically added archetypes. Based on these lessons we ing the classifier, and 2) the accuracy of the classifiers
improved the system in a number of ways that are de- test whether a new, previously unseen, document be-
scribed in this section. longs to a topic or not is improved using richer (e.g,
longer but concise) and more characteristic training
3.1 Classifier Training on Negative Examples documents for building its decision model. In the case
of an SVM classifier, the first point means transform-
An SVM classifier needs both positive and negative
ing all documents into a clearer feature space, and
training examples for computing a separating hyper-
the second point can be interpreted as constructing a
plane. As negative examples we used the positive
”sharper” (i.e., less blurred) separating hyperplane in
training data from a topic’s competing classes, which
the feature space (with more slack on either side of the
are the topic’s siblings in the topic tree. For topics
hyperplane to the accepted or rejected documents).
without proper siblings, e.g., for a single-topic crawl,
we added a virtual child “OTHERS” to all tree nodes
3.3 Focus Adjustment and Tunnelling
which was populated with some arbitrarily chosen doc-
uments that were “semantically far away” from all top-
Learning Phase with Sharp Focus
ics of the directory. This approach worked, but in some
situations it was not sufficient to cope with the extreme During the learning phase BINGO! runs with a very
diversity of Web data. In some sense, saying what the strict focusing rule. As the system starts only with
crawl should not return is as important as specifying a relatively small set of seeds, we can expect only
what kind of information we are interested in. low classification confidence with this initial classifier.
As a consequence of this observation we now pop- Therefore, our top priority in this phase is to find new
ulate the virtual “OTHERS” class in a much more archetypes to augment the training basis. The crawler
The BINGO! System for Information Portal Generation and Expert Web Search 751

accepts only documents that are reachable via hyper- documents. This approach is somewhat risky as it
links from the original seeds and are classified into the may as well dillute the feature space (as reported
same topic as the corresponding seeds. We call this in [8]); so it is crucial to combine it with conserva-
strategy sharp focusing: for all documents p, q ∈ E tive (MI based) feature selection.
and links (p, q) ∈ V accept only those links where • Anchor texts: The short texts in hyperlink tags of
class(p) = class(q). the HTML pages that point to the current docu-
The above strategy requires that at least some of the ment may provide concise descriptions of the target
crawled documents are successfully classified into the document. However, it is very crucial to use an ex-
topic hierarchy; otherwise, the crawler would quickly tended form of stopword elimination on anchor texts
run out of links to be visited. This negative situa- (to remove standard phrases such as “click here”).
tion did indeed occur in some of our early experiments
when the training data contained no useful links to re- The way we are using the above feature options in
lated Web sources. Therefore, BINGO! also considers BINGO! is by constructing combined feature spaces
links from rejected documents (i.e., documents that or by creating multiple alternative classifiers (see next
do not pass the classification test for a given topic) subsection). For example, BINGO! can construct fea-
for further crawling. However, we restrict the depth ture vectors that have single-term frequencies, term-
of traversing links from such documents to a thresh- pair frequencies, and anchor terms of predecessors as
old value, typically set to one or two. The rationale components. For all components feature selection is
behind this threshold is that one often has to “tun- applied beforehand to capture only the most signifi-
nel” through topic-unspecific “welcome” or “table-of- cant of these features. The classifier can handle the
contents” pages before again reaching a thematically various options that BINGO! supports in a uniform
relevant document. manner: it does not have to know how feature vectors
are constructed and what they actually mean. Vec-
Harvesting Phase with Soft Focus tors with up to several thousand components can be
Once the training set has reached handled with acceptable performance.
min{Nauth ; Nconf } documents per topic, BINGO!
performs retraining and the harvesting phase is 3.5 Meta Classification
started. The now improved crawling precision allows
us to relax the hard focusing rule and to accept BINGO! can construct a separate classifier (i.e.,
all documents that can successfully be classified trained decision model) for each of the various feature
into anyone of the topics of interest, regardless of space options outlined in the previous section (includ-
whether this is the same class as that of its hyperlink ing combination spaces). Right after training it uses
predecessor. We call this strategy soft focusing: for the ξα estimator [13] for predicting the quality (i.e.,
all documents p, q ∈ E and links (p, q) ∈ V accept classification precision) of each alternative and then
all links where class(p) = ROOT /OT HERS. The selects the one that has the best estimated “gener-
harvesting usually has tunneling activated. alization performance” for classifying new, previously
unseen, documents. The same estimation technique
3.4 Feature Space Construction can be used, with some extra computational effort,
for choosing an appropriate value for the number of
Single terms alone and the resulting tf ∗idf -based doc-
most significant terms or other features that are used
ument vectors are a very crude characterization of doc-
to construct the classifier’s input vectors after feature
ument contents. In addition to this traditional IR ap-
selection.
proach we are also investigating various richer feature
spaces: In addition, BINGO! can combine multiple clas-
sifiers at run-time using a meta classifier approach.
• Term pairs: The co-occurrence of certain terms in Consider the set V = {v1 , . . . , vh } of classifiers. Let
the same document adds to the content characteri- res(vi , D, K) ∈ {−1, 1} be the decision of the i-th
zation and may sometimes even contribute to disam- method for the classification of document D into class
biguating polysems (i.e., words with multiple mean- C, w(vi ) ∈ R be weights and t1 , t2 ∈ R be thresholds.
ings). The extraction of all possible term pairs in a Then we can define a meta decision function as follows:
document is computationally expensive. We use a
sliding window technique and determine only pairs M eta(V, D, C) =
within a limited word distance.  h
• Neighbor documents: Sometimes a document’s  +1 when wi · res(vi ) > t1 (2)
i=1
h
neighborhood, i.e., its predecessors and successors in  −1 when i=1 wi · res(vi ) < t2
the hyperlink graph, can help identifying the topic 0, otherwise
of the document. We consider constructing feature
vectors that contain both the current document’s The zero decision means that the meta classifier is un-
terms and the most significant terms of its neighbor able to make a clear decision and thus abstains.
752 Chapter 9: Web Services and Data Bases

Three special instances of the above meta classifier Filtering and ranking alone cannot guarantee that
are of particular importance (one of them using the ξα the user finds the requested information. Therefore,
estimators [13]): when BINGO! is used for expert Web search, our local
search engine supports additional interactive feedback.
1. unanimous decision: for definitively positive classi-
In particular, the user may select additional training
fication the results of all classifiers must be equal:
documents among the top ranked results that he sees
as follows:
and possibly drops previous training data; then the
w(vi ) = 1 for all vi ∈ V, t1 = h − 0.5 = −t2
filtered documents are classified again under the re-
2. majority decision: the meta result is the result of
trained model to improve precision. For information
the majority of the classifiers:
portal generation, a typical problem is that the results
w(vi ) = 1 for all vi ∈ V, t1 = t2 = 0.
in a given class are heterogeneous in the sense that
3. weighted average according to the ξα estimators:
they actually cover multiple topics that are not nec-
w(vi ) = precisionξα (vi ) for all vi ∈ V, t1 = t2 = 0
essarily closely related. This may result from the di-
Such model combination and averaging techniques versity and insufficient quality of the original training
are well known in the machine learning literature [17]. data.
They typically make learning-based decision functions To help the portal administrator for better organiz-
more robust and can indeed improve the overall clas- ing the data, BINGO! can perform a cluster analysis on
sification precision. This observation was also made the results of one class and suggest creating new sub-
in some of our experiments where unanimous and classes with tentative labels automatically drawn from
weighted average decisions improved precision from the most characteristic terms of these subclasses. The
values around 80 percent to values above 90 percent. user can experiment with different numbers of clus-
By default, BINGO! uses multiple alternative classi- ters, or BINGO! can choose the number of clusters
fiers in parallel and applies the unanimous-decision such that an entropy-based cluster impurity measure
meta function in the crawl’s learning phase and the is minimized [9]. Our current implementation uses the
weighted average in the harvesting phase. Each of simple K − means algorithm [16, 17] for clustering,
these parallel classifiers requires computing a scalar but we plan to add more sophisticated algorithms.
product between vectors with a few thousand com-
ponents for each visited Web page that needs to be
classified. When the crawler’s run-time is critical, we 4 Making BINGO! Efficient
therefore switch to a single feature space and a single
classifier, namely, the one with the best ξα estimator Our main attention in building BINGO! was on search
for its precision. This still requires training multiple result quality and the effectiveness of the crawler.
classifiers, but in this run-time-critical case this is done When we started with larger-scale experimentation,
only once before the harvesting phase is started. For we realized that we had underestimated the impor-
the learning phase we always use the meta classifier. tance of performance and that effectiveness and effi-
ciency are intertwined: the recall of our crawls was
severely limited by the poor speed of the crawler.
3.6 Result Postprocessing In the last months we focused our efforts on per-
The result of a BINGO! crawl may be a database with formance improvement and reimplemented the most
several million documents. Obviously, the human user performance-critical function components.
needs additional assistance for filtering and analyzing BINGO! is implemented completely in Java and
such result sets in order to find the best answers to her uses Oracle9i as a storage engine. The database-
information demands. To this end BINGO! includes a related components (document analysis, feature selec-
local search engine that employs IR and data mining tion, etc.) are implemented as stored procedures, the
techniques for this kind of postprocessing. crawler itself runs as a multi-threaded application un-
The search engine supports both exact and vague der the Java VM. As crawl results are stored in the
filtering at user-selectable classes of the topic hierar- database, we implemented our local search engine as
chy, with relevance ranking based on the usual IR met- a set of servlets under Apache and the Jserv engine.
rics such as cosine similarity [3] of term-based docu- Our rationale for Java was easy portability, in par-
ment vectors. In addition, it can rank filtered docu- ticular, our student’s desire to be independent of the
ment sets based on the classifier’s confidence in the as- “religious wars” about Windows vs. Linux as the un-
signment to the corresponding classes, and it can per- derlying platform.
form the HITS link analysis [14] to compute authority This section discusses some of the Java- and
scores and produce a ranking according to these scores. database-related performance problems and also some
Different ranking schemes can be combined into a lin- of the key techniques for accelerating our crawler. We
ear sum with appropriate weights; this provides flexi- adopted some useful tips on crawl performance prob-
bility for trial-and-error experimentation by a human lems from the literature [10, 11, 20] and also developed
expert. various additional enhancements.
The BINGO! System for Information Portal Generation and Expert Web Search 753

4.1 Lessons on Database Design and Usage of memory with LRU replacement and TTL-based in-
validation.
The initial version of BINGO! used object-relational
Since a document may be accessed through different
features of Oracle9i (actually Oracle8i when we
path aliases on the same host (this holds especially for
started), in particular, nested tables for hierarchically
well referenced authorities for compatibility with out-
organized data. This seemed to be the perfect match
dated user bookmarks), the crawler uses several finger-
for storing documents, as the top-level table, and the
prints to recognize duplicates. The initial step consists
corresponding sets of terms and associated statistics
of simple URL matching (however, URLs have an av-
as a subordinate table (document texts were stored
erage length of more than 50 bytes [2]; our implemen-
in a LOB attribute of the top-level table). It turned
tation merely compares the hashcode representation of
out, however, that the query optimizer had to com-
the visited URL, with a small risk of falsely dismiss-
pute Cartesian products between the top-level and the
ing a new document). In the next step, the crawler
subordinate table for certain kinds of queries with se-
checks the combination of returned IP address and
lections and projections on both tables. Although this
path of the resource. Finally, the crawler starts the
may be a problem of only a specific version of the
download and controls the size of the incoming data.
database system, we decide to drastically simplify the
We assume that the filesize is a unique value within
database design and now have a schema with 24 flat
the same host and consider candidates with previously
relations, and also simplified the SQL queries accord-
seen IP/filesize combinations as duplicates. A similar
ingly.
procedure is applied to handle redirects. The redirec-
Crawler threads use separate database connections tion information is stored in the database for use in
associated with dedicated database server processes. the link analysis (see Section 2.5). We allow multi-
Each thread batches the storing of new documents and ple redirects up to a pre-defined depth (set to 25 by
avoids SQL insert commands by first collecting a cer- default).
tain number of documents in workspaces and then in-
voking the database system’s bulk loader for moving Document type management
the documents into the database. This way the crawler
can sustain a throughput of up to ten thousand docu- To avoid common crawler traps and incorrect server
ments per minute. responses, the maximum length of hostnames is re-
stricted to 255 (RFC 1738 [22] standard), the max-
imum URL length is restricted to 1000. This re-
4.2 Lessons on Crawl Management flects the common distribution of URL lengths on the
Web [2], disregarding URLs that have GET parame-
Networking aspects ters encoded in them.
A key point for an efficient crawler in Java is con- To recognize and reject data types that the crawler
trol over blocking I/O operations. Java provides the cannot handle (e.g., video and sound files), the
convenient HTTPUrlConnection class, but the under- BINGO! engine checks all incoming documents against
lying socket connection is hidden from the program- a list of MIME types [12]. For each MIME type we
mer. Unfortunately, it is impossible to change the de- specify a maximum size allowed by the crawler; these
fault timeout setting; thus, a successfully established sizes are based on large-scale Google evaluations [2].
but very slow connection cannot be cancelled. The The crawler controls both the HTTP response and the
recommended way to overcome this limitation of the real size of the retrieved data and aborts the connec-
Java core libraries is to control the blocking connec- tion when the size limit is exceeded.
tion using a parallel “watcher thread”. To avoid this Crawl queue management
overhead, BINGO! implements its own socket-based
HTTP connections following RFC 822 [18]. The proper URL ordering on the crawl frontier is
The Java core class InetAddress, used for the a key point for a focused crawler. Since the abso-
representation of network addresses and resolving of lute priorities may vary for different topics of inter-
host names, is another potential bottleneck for the est, the queue manager maintains several queues, one
crawler [11]. It was observed that the caching al- (large) incoming and one (small) outgoing queue for
gorithm of InetAddress is not sufficiently fast for each topic, implemented as Red-Black trees.
thousands of DNS lookups per minute. To speed The engine controls the sizes of queues and starts
up name resolution, we implemented our own asyn- the asynchronous DNS resolution for a small number
chronous DNS resolver. This resolver can operate with of the best incoming links when the outgoing queue is
multiple DNS servers in parallel and resends requests not sufficiently filled. So expensive DNS lookups are
to alternative servers upon timeouts. To reduce the initiated only for promising crawl candidates. Incom-
number of DNS server requests, the resolver caches all ing URL queues are limited to 25.000 links, outgoing
obtained information (hostnames, IP addresses, and URL queues to 1000 links, to avoid uncontrolled mem-
additional hostname aliases) using a limited amount ory usage.
754 Chapter 9: Web Services and Data Bases

In all queues, URLs are prioritized based on their Property 90 minutes 12 hours
Visited URLs 100,209 3,001,982
SVM confidence scores (see Section 2). The priority Stored pages 38,176 992,663
of tunnelled links (see 3.3) is reduced by a constant Extracted links 1,029,553 38,393,351
factor for each tunnelling step (i.e., with exponential Positively classified 21,432 518,191
decay), set to 0.5 in our experiments. Visited hosts 3,857 34,647
We also learned that a good focused crawler needs Max crawling depth 22 236
to handle crawl f ailures. If the DNS resolution or
page download causes a timeout or error, we tag the Table 1: Crawl summary data
corresponding host as “slow”. For slow hosts the num-
ber of retrials is restricted to 3; if the third attempt
fails the host is tagged as “bad” and excluded for the the training data (i.e., the CS department of the Uni-
rest of the current crawl. versity of Wisconsin and Microsoft Research, and also
additional Yahoo categories for further negative ex-
amples). Since we started with extremely small train-
5 Experiments ing data, we did not enforce the thresholding scheme
5.1 Testbed (3.2) (requirement that the SVM confidence for new
archetypes would have to be higher than the average
In the experiments presented here, BINGO! was run-
confidence of the initial seeds). Instead, we rather
ning on a dual Intel 2GHz server with 4 GB main mem-
admitted all positively classified documents (includ-
ory under Win2k, connected to an Oracle9i database
ing the ones that were positively classified into the
server on the same computer. The number of crawler
complementary class “OTHERS”, i.e., the Yahoo doc-
threads was initially restricted to 15; the number of
uments). Altogether we obtained 1002 archetypes,
parallel accesses per host was set to 2 and per recog-
many of them being papers (in Word or PDF), talk
nized domain to 5. The engine used 5 DNS servers
slides (in Powerpoint or PDF), or project overview
located on different nodes of our local domain. The
pages of the two researchers, and then retrained the
maximum number of retrials after timeouts was set to
classifier with this basis.
3. The maximum allowed tunneling distance was set
The harvesting phase then performed prioritized
to 2. The allowed size of the URL queues for the crawl
breadth-first search with the above training basis and
frontier was set to 30,000 for each class. To eliminate
seed URLs, now without any domain limitations (other
“meta search capabilities”, the domains of major Web
than excluding popular Web search engines). We
search engines (e.g., Google) were explicitly locked for
paused the crawl after 90 minutes to assess the in-
crawling. The feature selection, using the MI criterion,
termediate results at this point, and then resumed it
selected the best 2000 features for each topic.
for a total crawl time of 12 hours. Table 5.2 shows
In the following subsections we present two kinds
some summary data for this crawl.
of experiments: 1) the generation of an information
portal from a small seed of training documents, and 2) To assess the quality of our results we used the
an expert query that does not yield satisfactory results DBLP portal (http://dblp.uni-trier.de/) as a compar-
on any of the popular standard search engines such as ison yardstick. The idea was that we could automati-
Google. cally construct a crude approximation of DBLP’s col-
lection of pointers to database researcher homepages.
DBLP contains 31,582 authors with explicit homepage
5.2 Portal Generation for a Single Topic
URLs (discounting those that have only a URL sug-
To challenge the learning capabilities of our focused gested by an automatic homepage finder). We sorted
crawler, we aimed to gather a large collection of Web these authors in descending order of their number of
pages about database research. This single-topic di- publications (ranging from 258 to 2), and were par-
rectory was initially populated with only two author- ticularly interested in finding a good fraction of the
itative sources, the home pages of David DeWitt and top ranked authors with BINGO!. To prevent giving
Jim Gray (actually 3 pages as Gray’s page has two BINGO! any conceivably unfair advantage, we locked
frames, which are handled by our crawler as separate the DBLP domain and the domains of its 7 official mir-
documents). rors for our crawler. In evaluating the results, we con-
The initial SVM classification model was built us- sidered a homepage as “found” if the crawl result con-
ing these 2 positive and about 400 negative examples tained a Web page “underneath” the home page, i.e.,
randomly chosen from Yahoo top-level categories such whose URL had the homepage path as a prefix; these
as sports and entertainment (see Section 3). In the were typically publication lists, papers, or CVs. The
learning phase, BINGO! explored the vicinity of the rationale for this success measure was that it would
initial seeds and added newly found archetypes to the now be trivial and fast for a human user to navigate
topic. To this end the maximum crawl depth was set upwards to the actual homepage.
to 4 and the maximum tunnelling distance to 2, and We evaluated the recall, i.e., the total number of
we restricted the crawl of this phase to the domains of found DBLP authors, and the precision of the crawl re-
The BINGO! System for Information Portal Generation and Expert Web Search 755

Best crawl results Top 1000 DBLP All authors retrieve useful and starting points for a focused crawl.
1,000 27 91
5,000 79 498
The top 10 matches from Google were intellectually in-
all (21,432) 218 1,396 spected by us, and we selected 7 reasonable documents
for training; these are listed in Figure 4.
Table 2: BINGO! precision (90 minutes) 1 http://www.bell-labs.com/topic/books/db-book/
slide-dir/Aries.pdf
Best crawl results Top 1000 DBLP All authors 2 http://www-2.cs.cmu.edu/afs/cs/academic/class/
1,000 267 342 15721-f01/www/lectures/recovery with aries.pdf
3 http://icg.harvard.edu/ cs265/lectures/
5,000 401 1,325
readings/mohan-1992.pdf
all (518,191) 712 7,101
4 http://www.cs.brandeis.edu/ liuba/abstracts/mohan.html
5 http://www.almaden.ibm.com/u/mohan/
ARIES Impact.html
Table 3: BINGO! precision (12 hours) 6 http://www-db.stanford.edu/dbseminar/Archive/
FallY99/mohan-1203.html
sult. For the latter we considered the number of pages 7 http://www.vldb.org/conf/1989/P337.PDF
found out of the 1000 DBLP-top-ranked researchers,
i.e., the ones with the most publications, namely, be-
tween 258 and 45 papers. The crawl result was sorted Figure 4: Initial training documents
by descending classification confidence for the class
“database research”, and we compared the top 1000 Note that Mohan’s ARIES page (the 5th URL in
results to the top 1000 DBLP authors. Figure 4) does not provide an easy answer to the query;
Tables 2 and 3 show the most important measures of course, it contains many references to ARIES-
on crawl result quality. Most noteworthy is the good related papers, systems, and teaching material, but it
recall: we found 712 of the top 1000 DBLP authors would take hours to manually surf and inspect a large
(without ever going through any DBLP page). The fraction of them in order to get to the source code of
precision is not yet as good as we wished it to be: 267 a public domain implementation.
of these top-ranked authors can be found in the 1000 These pages were used to build the initial SVM clas-
documents with highest classification confidence. So sification model. As negative examples we again used
a human user would have to use the local search en- a set of randomly chosen pages from Yahoo top-level
gine and other data analysis tools to further explore categories such as ”sports”. The focused crawler was
the crawl result, but given that the goal was to auto- then run for a short period of 10 minutes. It visited
matically build a rich information portal we consider about 17,000 URLs with crawling depth between 1 and
the overall results as very encouraging. Note that 7; 2,167 documents were positively classified into the
our crawler is not intended to be a homepage finder topic “ARIES”.
and thus does not use specific heuristics for recogniz- Finally, we used the result postprocessing compo-
ing homepages (e.g., URL pattern matching, typical nent (see 3.6) and performed a keyword search filtering
HTML annotations in homepages, etc.). This could with relevance ranking based on cosine similarity. The
be easily added for postprocessing the crawl result and top-10 result set for the query ”source code release”
would most probably improve precision. contains links to open-source projects Shore and Mini-
base, which implement ARIES media recovery algo-
rithm (Figure 5). Additionally, the third open source
5.3 Expert Web Search
system, Exodus, is directly referenced by the Shore
To investigate the abilities of the focused crawler homepage. A MiniBase page (further up in the direc-
for expert Web search, we studied an example of a tory) was also among the top 10 crawl results accord-
“needle-in-a-haystack” type search problem. We used ing to SVM classification confidence; so even without
BINGO! to search for public domain open source im- further filtering the immediate result of the focused
plementations of the ARIES recovery algorithm. crawl would provide a human user with a very good
A direct search for “public domain open source reference.
ARIES recovery” on a large-scale Web search engine We emphasize that the expert Web search sup-
(e.g., Google) or a portal for open source software (e.g., ported by our focused crawler required merely a min-
sourceforge.net) does not return anything useful in the imum amount of human supervision. The human ex-
top 10 ranks; it would be a nightmare to manually nav- pert had to evaluate only 30 to 40 links (20 for train-
igate through the numerous links that are contained ing set selection, and 10 to 20 for result postprocess-
in these poor matches for further surfing. As an anec- ing), collected into prepared lists with content pre-
dotic remark, the open source software portal even re- views. Including crawling time and evaluation, the
turned lots of results about binaries and libraries. overall search cost was about 14 minutes. This over-
Our procedure for finding better results was as fol- head is significantly lower than the typical time for
lows. In a first step, we issued a Google query for “aries manual surfing in the hyperlink vicinity of some initial
recovery method” and “aries recovery algorithm” to authorities (such as IBM Almaden).
756 Chapter 9: Web Services and Data Bases

0.025 http://www.cs.wisc.edu/shore/doc/ [6] C.J.C. Burges. A tutorial on Support Vector


overview/node5.html
0.023 http://www.almaden.ibm.com/cs/ Machines for pattern recognition. Data Mining and
jcentral press.html Knowledge Discovery, 2(2), 1998.
0.022 http://www.almaden.ibm.com/cs/garlic.html
0.021 http://www.cs.brandeis.edu/l̃iuba/
[7] S. Chakrabarti, M. van den Berg, and B. Dom.
abstracts/greenlaw.html Focused crawling: A new approach to topic-specific
0.020 http://www.db.fmi.uni-passau.de/k̃ossmann/ Web resource discovery. 8th WWW Conference,
papers/garlic.html 1999.
0.018 http://www.tivoli.com/products/index/
storage-mgr/platforms.html [8] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced
0.015 http://www.cs.wisc.edu/shore/doc/ hypertext categorization using hyperlinks. SIGMOD
overview/footnode.html Conference, 1998.
0.014 http://www.almaden.ibm.com/cs/clio/
0.011 http://www.cs.wisc.edu/coral/minibase/ [9] R. Duda, P. Hart, and D. Stork. Pattern
logmgr/report/node22.html Classification. Wiley, 2000.
0.011 http://www.ceid.upatras.gr/courses/minibase/
minibase-1.0/documentation/html/minibase/ [10] A. Heydon and M. Najork. Mercator: A scalable,
logmgr/report/node22.html extensible Web crawler. WWW Conference, 1999.
[11] A. Heydon and M. Najork. Performance limitations
Figure 5: Top 10 results for query ”source code release” of the Java core libraries. ACM Java Grande
Conference, 1999.
6 Conclusion
[12] The Internet Assigned Numbers Authority (IANA).
In this paper we have presented the BINGO! system
http://www.iana.org.
for focused crawling and its applications to informa-
tion portal generation and expert Web search. Many [13] T. Joachims. Estimating the generalization
concepts in BINGO! have been adopted from prior performance of an SVM efficiently. European
work on Web IR and statistical learning, but we believe Conference on Machine Learning (ECML), 2000.
that the integration of these techniques into a compre- [14] J.M. Kleinberg. Authoritative sources in a
hensive and versatile system like BINGO! is a major hyperlinked environment. Journal of the ACM,
step towards a new generation of advanced Web search 46(5), 1999.
and information mining tools. The experiments that
we presented in this paper have shown the great po- [15] D. Lewis. Naive (Bayes) at forty: The independence
assumption in information retrieval. European
tential of the focused crawling paradigm but also some
Conference on Machine Learning (ECML), 1998.
remaining difficulties of properly calibrating crawl se-
tups for good recall and high precision. [16] C.D. Manning and H. Schuetze. Foundations of
Our future work aims to integrate BINGO! engine Statistical Natural Language Processing. MIT Press,
with a Web-service-based portal explorer and a seman- 1999.
tically richer set of ontology services. On the other [17] T. Mitchell. Machine Learning. McGraw Hill, 1996.
hand, we plan to pursue approaches to generating ”se-
mantically” tagged XML documents from the HTML [18] Hypertext Transfer Protocol.
pages that BINGO! crawls and investigate ways of http://www.w3.org/protocols/http/.
incorporating ranked retrieval of XML data [21] in [19] G. Salton and M.J. McGill. Introduction to Modern
the result postprocessing or even as a structure- and Information Retrieval. McGraw Hill, 1983.
context-aware filter during a focused crawl.
[20] V. Shkapenyuk and T. Suel. Design and
implementation of a high-performance distributed
References Web crawler. International Conference on Data
[1] The open-source biojava project. Engineering (ICDE), 2002.
http://www.biojava.org.
[21] A. Theobald and G. Weikum. Adding relevance to
[2] Google research project. WebmasterWorld Pub XML. 3rd International Workshop on the Web and
Conference, 2002. Databases (WebDB), 2000.

[3] R. Baeza-Yates and B. Ribeiro-Neto. Modern [22] RFC 1738: Uniform Resource Locators (URL).
Information Retrieval. Addison Wesley, 1999. http://www.w3.org/addressing/rfc1738.txt.

[4] K. Bharat and M. Henzinger. Improved algorithms [23] V. Vapnik. Statistical Learning Theory. Wiley, New
for topic distillation in a hyperlinked environment. York, 1998.
ACM SIGIR Conference, 1998. [24] Y. Yang and O. Pedersen. A comparative study on
[5] S. Brin and L. Page. The anatomy of a large scale feature selection in text categorization. International
hyper-textual Web search engine. 7th WWW Conference on Machine Learning (ICML), 1997.
Conference, 1998.
Data Management in Application Servers
Dean Jacobs
BEA Systems
235 Montgomery St
San Francisco, CA 94104, USA
dean@bea.com

Abstract TP Monitors support both synchronous (two-way)


and asynchronous (one-way) communication between
This paper surveys data management techniques distributed processes. In a synchronous remote procedure
used in Application Servers. It begins with an call (RPC), the sender of a request is blocked until a re-
overview of Application Servers and the way sponse is obtained from the receiver. In asynchronous
they have evolved from earlier transaction proc- messaging, no response is expected and the sender is free
essing systems. It then presents a taxonomy of to continue as soon as the request has been queued by the
clustered services that differ in the way they infrastructure. In either case, reliable communication can
manage data in memory and on disk. The treat- be provided using distributed transactions. This technique
ment of conversational state is discussed in is problematic in an administratively-heterogeneous envi-
depth. Finally, it describes a persistence layer ronment however because a transaction started in one
that is specifically designed for and tightly inte- jurisdiction may end up holding resources, such as data-
grated with the Application Server. Throughout base locks, in another. An alternative is to use store-and-
this paper, examples are drawn from experiences forward messaging, where messages are queued on the
implementing BEA WebLogic Server™. sender before being forwarded to queues on the receiver.
Message forwarding can be reliably implemented using
1. Introduction simple ACKing protocols. In addition, it provides a natu-
ral way of buffering work when a remote system is tem-
Transaction processing applications maintain data repre-
porarily unavailable.
senting real-world concepts and field associated requests
Early TP Monitors, in particular IBM CICS [2], were
from client devices [1]. Transaction processing applica-
developed in the 1970s to run on monolithic mainframe
tions play an essential role in many industries, including:
systems. Distributed TP Monitors, such as BEA Tux-
• Airlines Managing flight schedules and passen-
edo™ [3], were developed in the 1980s to run on collec-
ger reservations.
tions of mid-sized computers. Distributed TP Monitors
• Banking Accessing customer accounts through
use software-level clustering to provide scalability and
tellers and ATMs.
availability. Application Servers, such as BEA Web-
• Manufacturing Tracking orders; Managing in-
Logic Server™ [4], evolved from Distributed TP Moni-
ventory; Planning and scheduling jobs
tors in the 1990s to meet new demands imposed by the
• Telephony Allocating resources during call
Internet. Significant among these demands is support for
setup and teardown; Billing across multiple
loosely-coupled clients.
companies.
Tightly-coupled clients contain code from the Appli-
The typical transaction processing workload consists of
cation Server and communicate with it using proprietary
many short-running requests that include both queries and
protocols. As a result, they can offer higher functionality
updates. In contrast, typical workloads for non-
and better performance. Loosely-coupled clients do not
transactional applications, such as those for scientific
contain code from the Application Server and communi-
computing or analytical processing, consist of smaller
cate with it using vendor-neutral, industry-standard proto-
numbers of compute-intensive queries.
cols such as HTTP [5] and SOAP [6]. Loosely-coupled
Transaction Processing Monitors provide a software
clients tolerate a wider variety of evolutionary changes to
infrastructure for building transaction processing applica-
the server-side of an application and are easier to main-
tions. A key feature of TP Monitors is support for ACID
tain. Prominent loosely-coupled clients include Web
properties of transactions to handle failures and other ex-
Browsers, for human-to-machine communication, and
ceptional conditions. Other important features include
Web Services clients, for machine-to-machine communi-
support for security, administration, scalability, and high
cation. Web Services protocols for asynchronous commu-
availability.
nication usually employ store-and-forward messaging.
758 Chapter 9: Web Services and Data Bases

The choice of whether to use tightly- or loosely- cess, lifecycle management, and persistence.
coupled clients is affected not only by the degree of dis- Components generally support object-relational
tribution of the system, but also by the extent to which it mapping to allow persistence to RDBMSs.
has a centralized administrative authority. For example, a • Connectors provide access to external back-end
chief architect might create a collection of tightly-coupled systems such as databases and mainframes.
systems across widely-distributed outlets of a retail com- • Messaging provides support for asynchronous
pany or branches of a post office. On the other hand, communication.
highly-autonomous departments within the same enter- • Naming services allow the externally visible parts
prise might choose to use loosely-coupled clients so they of an application to be accessed by clients.
can evolve their applications more independently. The Java™ 2 Enterprise Edition (J2EE™) [9] is a well-
As part of their evolution from TP Monitors, Applica- known set of industry-standard Application Server APIs
tion Servers have become increasingly dynamic in nature. that includes all of the above.
Along with “systematic” applications, which are carefully Application Servers are becoming increasingly dis-
planned and rolled out over a long period of time, Appli- tributed within the enterprise data center, both horizon-
cation Servers must handle “opportunistic” applications, tally, from data-centric applications in the back-end to
which are rolled out quickly and modified often during presentation-oriented applications in the front-end, and
their lifetimes. As a result, Application Servers greatly vertically, from stand-alone “stovepipe” applications to
benefit by the ability to tune themselves, e.g., by auto- integrative applications that mediate access to many
matically setting cache and thread pool sizes. In addition, stovepipes. To provide desired qualities of service in the
Application Servers must handle traffic from unknown face of such distribution, Application Servers are in-
numbers of loosely-coupled clients across the Internet. As creasingly maintaining data outside of centralized data-
a result, Application Servers greatly benefit from the abil- bases in the back-end. An Application Server may main-
ity to dynamically enlist resources to handle peak loads tain the primary copy of data, for which it is then wholly
[7]. Application Servers can be integrated into an enter- responsible, or a secondary copy, which is drawn from a
prise-wide Grid Computing infrastructure that facilitates database. In either case, the net effect is often to relax
sharing of resources [8]. ACID properties of transactions that manipulate the data
Application Servers provide most of the features of TP [10].
Monitors and thus embody the state of the art in transac- This paper surveys data management techniques used
tion processing systems. In addition to traditional transac- in Application Servers. It identifies three basic types of
tion processing applications, Application Servers are used clustered services - stateless, cached, and singleton - that
for the following. differ in the way they manage data in memory and on
• E-commerce Catalog browsing and purchasing disk. These service types provide different ways of relax-
of consumer goods such as books ing ACID properties of transactions. The treatment of
• News Portals Personalized consolidation of conversational state is discussed in depth. This paper also
news from multiple sources describes a persistence layer that is specifically designed
• Financial Management Control of financial for and tightly integrated with the Application Server.
holdings such as bank accounts or stocks This persistence layer is lighter in weight than a conven-
• Packaged Applications Single-purpose, vertical tional database and thus is better suited for distribution
applications such as accounting, expense report- across a cluster. Throughout this paper, examples are
ing, ERP, or CRM drawn from experiences implementing WebLogic Server.
• Business Workflows Automation of business Two applications, which demonstrate the prominent
processes such as bidding or ordering parts loosely-coupled clients, will be used as running examples
• Message Broker Infrastructure for transforming throughout this paper. The first is an e-commerce appli-
and routing asynchronous messages cation, where consumers use Web Browsers to find and
Application Servers offer a set of Application Pro- purchase goods. Items are selected one-by-one from a
gramming Interfaces (APIs) to developers. Common APIs catalog and placed in a shopping cart, after which they
include the following. may be purchased all together. The second is a workflow
• Servlets compute dynamic Web pages based on application, where a manufacturing business orders parts
arguments to requests from Web Browsers. In from a supplier using Web Services. Documents such as
contrast, static Web pages are always the same purchase orders and receipts flow back and forth as asyn-
and can be hosted by a simple Web Server. chronous messages according to an agreed-upon higher-
• Web Services are services that can be remotely level protocol.
invoked using industry-standard machine-to- This paper is organized as follows. Section 2 presents
machine protocols. an overview of Application Server architectures. Section
• Components are general-purpose objects with 3 presents the taxonomy of clustered services. Section 4
built-in support for features such as remote ac-
Data Management in Application Servers 759

discusses the treatment of conversational state. Section 5 utilization of the system is improved by processing each
describes the Application Server persistence layer. request on as few servers as possible, since there is suffi-
cient work to go around and the overhead for communi-
2. Application Server Architectures cation is relatively large. Consequently, all other factors
being equal, it is preferable to minimize the number of
Application Server systems are organized into logical physical tiers in the system.
tiers, each of which may contain multiple servers or other The most common reason to segregate servers into
processes, as illustrated in Figure 1. actual physical tiers is for security, in particular, to sup-
port the placement of firewalls that filter traffic according
Client
to ports, protocols, and machines. Typically, firewalls are
Presentation used to protect the application tier from the outside world,
but they may also be used to restrict internal access to the
Application persistence tier. Another common reason for segregating
Persistence servers into physical tiers is to improve scalability by pro-
viding session concentration. The idea here is to place
many smaller machines in the front end and multiplex
socket connections to fewer, larger machines in the back
end. In practice, a single machine can support thousands
of connections and thus session concentration is required
only to support tens of thousands of clients.
A cluster is a group of servers1 that coordinate their
actions to provide scalable, highly-available services.
Figure 1 Multi-Tier Cluster Architecture Scalability is provided by allowing servers to be dynami-
cally added and removed and by balancing the load of
In this context, the example e-commerce application requests across the system. Availability is provided by
might be organized as follows. The client tier contains ensuring that there is no single point of failure and by
Web Browsers running on personal computers. The pres- migrating work off of failed servers. Ideally, a cluster
entation tier contains Web Servers serving static Web should offer a single system image so that clients remain
pages. The application tier contains Application Servers unaware of whether they are communicating with one or
running servlets, to generate personalized dynamic pages, many servers [11]. The servers in a cluster may be con-
and components, to access catalog and purchasing data. tained within a single tier or they may span several tiers.
The persistence tier contains a database that maintains the Simple round robin or data-dependent load balancing
catalog and purchasing data. schemes generally suffice for the typical transaction proc-
The example workflow application might be organized essing workload. In particular, it is rarely worth the effort
as follows. The client tier contains servers in the manu- either to take actual server load into account or to redis-
facturing business that place orders for parts. The presen- tribute on-going work when it occasionally becomes un-
tation tier contains message routing processes. The appli- balanced. This is in contrast to practices commonly em-
cation tier contains Application Servers running Web ployed for compute-intensive applications [12].
Services, to orchestrate the steps of the supplier’s Load balancing and failover for tightly-coupled clients
workflow, and components, to access inventory and pur- is built into the Application Server infrastructure. For
chasing data. The persistence tier contains a database that example, WebLogic Server integrates this functionality
maintains the inventory and purchasing data. into its implementation of Remote Method Invocation
More generally, the client tier may contain personal (RMI), the basic Java API for invoking methods of a re-
devices, such as workstations or handheld mobile units, mote object. The caller-side RMI stub for a service ob-
embedded devices, such as network appliances or office tains information about the instances of the service and
machines, or servers in other enterprise systems. The makes load balancing and failover decisions.
presentation tier manages basic interactions with these Load balancing and failover for loosely-coupled cli-
clients over whatever protocols they require. Processes in ents must be implemented given the fixed vendor-neutral
the presentation tier do not run application code. The ap- front-end of the cluster. One common approach relies on
plication tier contains Application Servers running all of the fact that DNS allows multiple IP addresses to be listed
the application code. The application tier may itself be under the same name and cycles through those addresses
divided, for example, into servlet and component tiers. for each lookup. Using this feature, the front-end servers
The persistence tier provides durable storage in the form
of databases and file systems. The persistence tier may 

also contain mainframes and other back-end systems. 1
Throughout this paper, the term “server” refers to a
The typical transaction processing workload consists software process rather than a piece of hardware. The
of many short-running requests. In this setting, overall latter is referred to as a “machine”.
760 Chapter 9: Web Services and Data Bases

in a cluster can be co-listed under one name and clients large catalog that is stored in a database and pass them to
can choose when to do lookups. This approach provides a servlet for presentation in a browser. In the example
only coarse control over load balancing and failover. workflow application, a stateless Web Service might re-
Moreover, it exposes details of the system so it is both trieve the current workflow state from a database as each
less secure and harder to reconfigure. An alternative ap- request arrives.
proach is to use a load balancing appliance (also known as A stateless service can be made scalable and highly
a “packet sprayer”) that exposes a single IP address and available by offering many instances of it in a cluster.
routes each request to one of the front-end servers behind WebLogic Server integrates this functionality into its im-
it [13]. plementation of RMI as follows. Each member of the
cluster advertises the instances of stateless services it of-
3. Clustered Services fers using a light-weight multicast protocol. This infor-
mation is obtained by the caller-side stub for a service and
This section describes three basic types of clustered serv- used to make load balancing and failover decisions. The
ices - stateless, cached, and singleton - that differ in the default load balancing algorithm uses a modified round-
way they manage application data in memory and on disk. robin scheme that favors certain servers in order to mini-
Stateless services do not maintain application data mize the spread of a transaction. The default failover al-
between invocations, but rather load it into memory from gorithm retries a failed operation only if it can be guaran-
shared back-end systems as needed for each request. A teed that there were no side-effects.
stateless service can be made scalable and highly avail-
able in a cluster by offering many instances of it, any one 3.2 Cached Services
of which is as good as any other. Clients of the service are
Cached services maintain application data between invo-
then free to switch between the instances as needed for
cations but keep it only loosely synchronized with the
load balancing and failover. While this approach is very
primary copy in a shared back-end system. A cached
simple, the large number of accesses to shared back-end
service can be made scalable and highly available by of-
systems can lead to long response times and can become a
fering many instances of it in a cluster. Cached data may
bottleneck to throughput.
be kept in memory and/or written out to a server’s private
To mitigate these problems, application data must be
disk. Data may be written out to avoid reacquiring it after
maintained privately by each server, either in memory, on
a restart or to free up memory. Ideally, an Application
disk, or both. Cached services maintain application data
Server should integrate caching into the implementation
between invocations but keep it only loosely synchronized
of its various APIs and, in addition, offer an explicit
with the primary copy in a shared back-end system. This
caching API.
weak form of consistency allows there to be multiple in-
Cached data may be the result of application-level
stances of the service, for scalability and high availability,
processing of back-end data. In the example e-commerce
without incurring a high overhead for synchronizing cop-
application, HTML page fragments describing popular
ies of the data. Singleton services maintain application
catalog items or special offers might be computed from
data between invocations and guarantee its strict transac-
relational data and cached by a servlet. In the example
tional consistency. To achieve such guarantees without
workflow application, aggregate historical data that is
incurring excessive overhead, each individual data item is
used for generating price quotes might be computed from
exclusively owned by a single instance of the service. If
relational data and cached by a Web Service. Cached data
an instance of a singleton service fails, ownership of its
may also be a direct copy of back-end data. In the exam-
data must be migrated to another server.
ple applications, user or customer profiles that are re-
3.1 Stateless Services trieved from the database might be directly cached by a
database connector or a component. Note that database
Stateless services do not maintain application data be- connectors will cache the tabular results of relational que-
tween invocations. The simplest example of a stateless ries while components will cache the fields of objects
service is a component that computes a pure mathematical generated by object-relational mapping.
function of its arguments. A stateless service may main- Cached data that is directly copied from the back-end
tain data internally as long as it does not directly effect may be updated by the service and written back. In the
results returned to clients. For example, database con- example applications, a component representing a user or
nection pools, which allow sharing and reuse of database customer profile might allow the data to be updated at the
connections by many clients, are stateless but keep track object level, producing a write to the backing relational
internally of which connections are in use. A stateless data. Update anomalies can occur here because the read of
service may load application data into memory from a the cached data occurs in a different transaction than the
shared back-end system but only for the duration of an write. To prevent such anomalies from occurring, the data
individual invocation. In the example e-commerce appli- must be protected by optimistic concurrency control from
cation, a stateless component might retrieve items from a within the cache.
Data Management in Application Servers 761

WebLogic Server provides an option to use optimistic base triggers or log-sniffing. Alternatively, the application
concurrency control to keep the cached fields of a compo- can be made responsible for explicitly triggering cache
nent consistent with a database. At the beginning of a evictions through a direct API.
transaction, the server records the initial value of certain For cached data that has been computed, sending up-
cached fields, either application-level version fields or date signals also requires identifying which pieces of
actual data fields. At commit time, the generated back-end data are relevant in each case. There is a trade
UPDATE statement is predicated by a WHERE clause in off here associated with the granularity of tracking of the
which these values are compared with those in the data- data: finer granularity results in longer caching but is
base and a concurrency exception is thrown if they don’t harder to implement efficiently. If the associated queries
match. When such an exception occurs, it generally suf- are known in advance, then it is possible to use database
fices for an application to retry the transaction. Overall, view maintenance techniques [14] for materialized cach-
although this approach does not ensure serializability, its ing and view invalidation techniques [15] for demand
behavior may be desirable in that it increases concurrency caching. This problem is compounded in the presence of
in acceptable ways. ad-hoc queries, particularly if application-level processing
In demand caching, values are loaded as they are of the back-end data makes it unclear which queries are
needed and evicted as they become out of date, as dis- relevant.
cussed in more detail below. Values may also be evicted In a partitioned caching scheme [16], responsibility
to recover memory, a process that should be integrated for subsets of the data is striped across subsets of the
with server-wide memory management. Demand caching servers in the cluster. Partitioning makes it possible to
is appropriate for large data collections with small work- scale up the effective memory size of the cluster so it can
ing sets, such as user or customer profiles. In material- manage larger data collections. Partitioning requires data-
ized caching, which is a form of replication, values are dependent routing to forward requests to the appropriate
pre-loaded during initialization and refreshed when they servers. In contrast, without partitioning, data accesses
become out of date. Values may not be evicted to recover always occur on the local server.
memory. Materialized caching is appropriate for moder- In a two-tier caching scheme, a first-tier cached serv-
ately-sized data collections that are frequently used, such ice draws its data from a second-tier cached service,
as product catalogs. Since the set of data in memory is which draws its data from a shared back-end system as
known at all times, this technique facilitates querying usual. Note that both tiers are contained within the appli-
through the cache. cation tier as illustrated in Figure 1. In the example e-
Cached values may be assigned a time-to-live until commerce application, the first tier might contain servlets
eviction or refresh. This approach does not require any with small, demand caches of page fragments while the
communication between servers, so it scales well, but second tier contains components with large, materialized
requires that the application tolerate a given window of caches of catalog data. This architecture reduces the load
staleness and inconsistency. This approach is attractive on the back-end, since it allows a single lookup in the
when the data is frequently updated, e.g., from a real-time second tier to be shared by many members of the first tier.
data stream, in which case keeping up with the changes Moreover, it separates heavy-duty garbage collection,
can be less efficient than not caching at all. Alternatively which is generated by application logic running in the first
or in addition, values may be evicted or refreshed when tier, from heavy-duty caching, which occurs in the second
updates occur. This approach is attractive when the data is tier. Such a separation is advantageous because it allows
infrequently updated, in which case the signalling over- the garbage collector to avoid needless scanning of cache
head will be tolerable. Update signals may be sent with elements.
varying degrees of reliability, e.g., from best effort multi-
cast to durable messaging. 3.3 Singleton Services
Sending update signals requires identifying when rele- Singleton services maintain application data between
vant back-end data has changed. Doing so is straight- invocations and guarantee its strict transactional consis-
forward if updates go through the Application Server it- tency. To achieve such guarantees without limiting scal-
self, since it can then capture them in the course of its ability of the cluster, each individual data item is exclu-
normal operations. For example, WebLogic Server can be sively owned by a single instance of the service. A data
configured to automatically evict all instances of a cached item may be accessed only by its owner; it may not be
component in the event that any one of them is updated. accessed by other members of the cluster nor shared with
After a transaction commits, the server multicasts a clus- other applications.
ter-wide cache eviction signal containing the keys of any The primary copy of a data item may be kept in a
updated components. Identifying when relevant back-end shared back-end system and cached in memory on the
data has changed is more difficult if updates go through owner. In this case, performance is improved only for
other applications that share the data. In this case, the Ap- reads, since writes have to go through to the shared back-
plication Server must rely on mechanisms such as data- end. A second alternative is to keep the primary copy on a
762 Chapter 9: Web Services and Data Bases

private disk of the owner as well as caching the data in secondary server in the cluster. Note that in the latter case,
memory. In addition to improving the performance of the secondary is used only to recover the data, it does not
reads, this approach reduces the load on the shared back- process requests while the primary is active. There are
end. A third alternative, which maximizes performance, is three levels at which migration can occur: server, data,
to keep the primary copy in memory on the owner rather and service.
than on a disk. In server migration, a singleton service instance is
Examples of singleton services and their associated pinned to a particular server and that server is migrated to
data items include the following. a new machine as failures occur. The IP addresses of the
• A servlet container (the Application Server server are usually migrated along with it, using routing
module that executes servlets) and the data asso- protocols such as ARP, so external references do not need
ciated with browser sessions, such as the contents to be adjusted. One advantage of server migration is that it
of shopping carts. does not require any special effort on the part of the sin-
• A messaging queue and the messages on it that gleton service implementer; even pre-existing services
are awaiting delivery. can be made highly available without modification. In
• A transaction manager and the data associated addition, it is compatible with the design of most High-
with its on-going transactions. Availability (HA) Frameworks [17], which offer whole-
• A distributed lock manager and the status of process migration.
locks in the system. A disadvantage of server migration is that it requires
• An in-memory database and its current state. the administrator to manually distribute the set of single-
• An event correlation engine, such as an alarm ton service instances across a fixed set of servers. Moreo-
generator, and the current state of the system. ver, in order to provide headroom for the cluster to per-
A large singleton service may be made scalable by form automatic load balancing in the event that machines
partitioning it into multiple instances, each of which han- are added, an unnecessarily large number of servers must
dles a different slice of the data. For example, WebLogic be defined. As a result, some machines may be required to
Server allows a logical messaging queue to be partitioned host several servers at the same time. Another disadvan-
into many physical instances, each of which is responsible tage of server migration is that it may take a long time to
for certain messaging consumers. In the example start a server and initialize the application, increasing the
workflow application, a logical queue holding requests to downtime of the service. This problem can be mitigated
purchase parts might be partitioned into multiple physical using hot-standby techniques, which entail implementing
queues, each associated with a group of customers. In the some kind of server/service lifecycle API.
case of queues, partitioning also improves availability in In data migration, ownership of data elements is dis-
that messages can continue to flow through the system tributed and migrated among existing singleton service
even though an instance has failed, although certain mes- instances, as illustrated by the following examples.
sages or users may be stalled until the failed instance is • Each server has one servlet container and failure
recovered. of a server entails migrating its browser sessions
A request for a singleton service must be routed to the to other servers.
appropriate server. Routing must take into account parti- • Each server has one physical instance of a logi-
tioning if it occurs. Routing is straight-forward for tightly- cal messaging queue and failure of a server en-
coupled clients, since it is built into the Application tails migrating its outstanding messages to other
Server infrastructure. For loosely-coupled clients, routing servers.
must be implemented given the fixed vendor-neutral • Each server has one Transaction Manager and
front-end of the cluster, as discussed in the next section on failure of a server entails migrating its outstand-
conversational state. Routing introduces an extra network ing transactions to other servers.
hop in processing a request. This overhead is acceptable if • Each server has one instance of an in-memory
it is used to trade off server-to-database communication database that is in charge of some slices of the
with server-to-server communication. Ideally, all of the data and failure of a server entails migrating its
singleton service instances needed to process a request slices to other servers.
should be co-located on the same server so that routing Note that in the first three cases, any instance of the serv-
occurs only once per request. ice is as good as any other until some kind of session or
If an instance of a singleton service fails, ownership of connection is created, after which a particular instance
its data items must be migrated to a new server. If the must be used. The advantage of data migration is that
primary copy of the data is kept in a shared back-end work distribution can be performed automatically without
system, then it can be directly accessed by the new server. intervention on the part of the administrator. The disad-
Alternatively, the primary copy can be kept on a private vantage is that it complicates the task of writing a single-
disk whose ownership can be transferred, such as a dual- ton service because ownership of data items may be as-
ported disk. A third alternative is to replicate the data to a signed on-the-fly.
Data Management in Application Servers 763

In service migration, complete singleton service in- Conversational state is ideal for management by a sin-
stances are distributed and migrated among existing serv- gleton service because a) requests that use it often benefit
ers. As in server migration, this approach requires the from short response times, b) it is not usually shared, and
administrator to manually distribute the work. A more c) it can often tolerate reduced durability. This section
serious problem is that this approach allows there to be discusses the use of singleton services to manage conver-
multiple instances of the same service on the same server. sational state for the two prominent types of loosely-
This behavior is not acceptable for services, such as trans- coupled clients, Web Browsers and Web Services clients.
action managers, whose identity is uniquely associated To make use of singleton services, selection of the
with an individual server. host for a conversation should occur when it is initially
Regardless of the level at which it occurs – server, created and subsequent requests should be routed to the
data, or service – migration usually requires sophisticated chosen server. Implementing such session affinity is
machinery to establish cluster membership. The problem straight-forward for tightly-coupled clients, since load
is that, in an asynchronous network, there is no way to balancing is built into the Application Server infrastruc-
distinguish actual process failure from temporary process ture. For loosely-coupled clients, which are discussed in
freezing or network partitioning [18]. Thus, a seemingly- this section, session affinity must be implemented given
failed process may reappear after migration has occurred, the fixed vendor-neutral front-end of the cluster.
resulting in disagreement about the ownership of data.
The standard solution is to have processes engage in a 4.1 Web Browser Clients
distributed agreement protocol [19] to establish cluster Web Browser conversations are called servlet sessions
membership. Processes periodically prove their health to and the associated conversational state is called servlet
the rest of the system and, if that is not possible for any session state. Web Browsers, Web Servers, and load bal-
reason, shut themselves down. The decision to migrate is ancing appliances provide mechanisms for implementing
always postponed for at least one health-check period to session affinity so servlet session state can be maintained
give a seemingly-failed server the chance to shut itself in memory. When a servlet session is first created, the
down. hosting server embeds its identity in a cookie that is re-
Among other options, WebLogic Server offers a novel turned to the client. The client then includes this cookie in
technique for establishing cluster membership that uses a each subsequent request, where it can be used to imple-
back-end database in place of server-to-server communi- ment session affinity. One approach is for the Application
cation. Each server periodically writes to the database to Server vendor to provide a Web Server plug-in that in-
prove its health and, within the same transaction, ensures spects the cookie and routes requests from the presenta-
that it has not been ejected from the cluster by another tion tier to the application tier. This approach introduces
server. Other servers inspect the database to identify a an extra network hop in processing each request. Alterna-
timed-out server and, in the same transaction, eject it from tively, load balancing appliances can be configured to key
the cluster. All timing data is taken from the database session affinity to data such as cookies or client IP ad-
clock. This technique is particularly attractive for TP ap- dresses.
plications, which generally require a database and have If servlet session state is maintained only in memory
one configured to the desired levels of scalability and on a single server, then it will be lost when that server
availability. In contrast, distributed agreement protocols fails. Availability can be improved by placing servlet ses-
based on server-to-server communication introduce addi- sion state under the control of a singleton service and mi-
tional overhead that limits the size of a cluster. Moreover, grating it in the event of failure. The sophisticated ma-
such protocols often introduce a shared disk anyway in chinery to establish cluster membership is unnecessary in
order to avoid “split-brain syndrome”, where two sub- this case because a given servlet session is accessed by
clusters function independently, which can result from only a single client, which can unambiguously drive own-
network partitioning. ership and migration.
WebLogic Server supports replication of servlet ses-
4. Managing Conversational State sion state to a secondary server in the cluster and migra-
tion of the data between servlet containers in the event of
A conversation between a client and the server-side of an
failure. All requests are handled by the primary server,
application consists of a sequence of related requests in-
which synchronously transmits a delta for any updates to
tended to accomplish some goal. In the example e-
the secondary before returning the response to the client.
commerce application, a conversation occurs when a
To support migration, the identity of both the primary and
browser client puts a series of items in a shopping cart and
the secondary are embedded in the cookie. In the event
purchases them. In the example workflow application, a
that either server fails, a new primary/secondary pair is
conversation occurs when a manufacturing client negoti-
established lazily when the next request arrives, since that
ates a bid for parts with a supplier. The participants of a
is the first opportunity to rewrite the cookie.
conversation maintain conversational state to keep track
of progress that has been made towards the desired goal.
764 Chapter 9: Web Services and Data Bases

Figure 2 illustrates the case where a Web Server plug-


in inspects the cookie and routes to the primary. If the
primary is not reachable, it routes to the secondary, which
then becomes the primary, creates a new secondary, and
rewrites the cookie.
Data Management in Application Servers 765

has a single multi-party conversation. Note that in this


Browser Web Servers Servlet Engines example, A, B, and C each have their own cluster.

A
cookie
BC B A B C
primary

C Client side state Server side state


secondary
After Failure
Figure 4 Subordinate Web Service Conversations
Figure 2 Replication with Routing in the Web Server
Web Service conversational state should nominally be
Figure 3 illustrates the case where a load balancing kept in a shared database and accessed through a stateless
appliance performs routing. The primary is created on the service each time a request arrives. However, conversa-
server that was initially selected by the appliance and for tions that are short-lived or that have bursty access pat-
which it set up session affinity. If the primary becomes terns are attractive to maintain in memory under the man-
unreachable, the appliance switches to some arbitrary agement of a singleton service. Depending on the durabil-
member of the cluster. When the next request arrives ity requirements, the data can be written through to a
there, the servlet engine inspects the cookie, contacts the shared database, replicated to a passive secondary, or lost
secondary to get a copy of the state, becomes the primary, on failure. The latter two alternatives might be acceptable
and rewrites the cookie leaving the secondary unchanged. for read-only applications, shopping-cart-style applica-
tions where only the last fulfilment step is crucial, and
Web Server forwarding applications where reliability is provided by
Browser Servlet Engines the external end-points. It is not possible to ensure that
session affinity is set up in all cases for all transport pro-
A tocols, thus routing is in general required.
primary
cookie If a Web Service conversation is maintained in mem-
AB ory, then any associated in-bound or out-bound asynchro-
B nous messages should be co-located with it. Assuming it
secondary
is prohibitive to maintain a queue per conversation, this
behavior can be accomplished by partitioning a single
Before Failure C logical queue into one physical instance per server and
assigning a set of conversations to each instance. Again,
depending on durability requirements, messages can be
Web Server
Browser Servlet Engines written through to a shared database, replicated to a pas-
sive secondary, or lost on failure. Treating a conversation
A and its messages in the same way provides a consistent
cookie unit of failure.
CB B The difficulty of migrating multi-party conversations
secondary
is illustrated in Figure 4. If A or C were to drive migration
decisions about B in isolation, in the same way that a
C browser drives decisions about servlet sessions, then there
primary
After Failure could be disagreement as to which server owns the data.
Instead, there must be cluster-wide agreement about mi-
Figure 3 Replication with External Routing gration and servers that are deemed to have failed must be
shut down. Thus multi-party conversations require the
sophisticated machinery to establish cluster membership
4.2 Web Services Clients WebLogic Server allows Web Service conversations
Web Service conversations may in general involve multi- to be managed by singleton services. The implementation
ple participants, each of which maintains conversational makes use of a Conversation Manager that is also a sin-
state. In the example workflow application, suppose gleton service. The Conversation Manager mediates be-
manufacturer A negotiates a bid for parts with broker B tween clients to choose the initial host of a conversation at
which then negotiates the bid with supplier C, as illus- the point it is created. A client uses this mechanism to
trated in Figure 4. Broker B should maintain a single attempt to accommodate session affinity that has been set
piece of state for both conversations and thus it effectively up for it by the vendor-neutral front-end of the cluster.
766 Chapter 9: Web Services and Data Bases

After this point, a client has no say as to which server above, having the same resource manager handle this data
hosts a conversation; migration decisions are entirely un- along with its messages eliminates the need for two-phase
der the control of the clustering infrastructure. The Con- commit between the messaging system and a database.
versation Manager also provides a service for finding the Another significant category of middle-tier data is the
locations of conversations to facilitate routing. The loca- meta-data needed to configure the server and its applica-
tions of conversations are cached on each server to reduce tions. Meta-data can include business rules, user profiles,
the load on the Conversation Manager. and security policies. The primary copy of such data is
generally maintained in a shared back-end system. The
5. An Application Server Persistence Layer data is pushed out to each server instance in the cluster
where it is cached on a local disk. The local copy of the
As Application Servers become increasingly distributed data can significantly reduce server and application start
within the enterprise, they increasingly maintain middle- up times as well as making restarts more autonomous.
tier data outside of centralized databases in the back-end. It may also be useful to cache a copy of back-end ap-
This paper has characterized this practice in terms of plication data on disk in the middle tier. This technique
cached services, which maintain loosely-synchronized can isolate operational systems in the back-end from the
secondary copies of the data, and singleton services, distribution, load-handling, and error-handling require-
which maintain strict transactional consistency by grant- ments of presentation-oriented applications in the front-
ing exclusive ownership of each data item to a single end. And the extraction, transformation, and loading
service instance. process can optimize the data for the needs of these appli-
An Application Server instance may write middle-tier cations. For example, relational data might be pre-
data to a private disk, either to avoid reacquiring it from digested into object or XML form to avoid runtime map-
the back-end (for cached services) or because it is the ping.
primary copy and must be made durable (for singleton
services). Conventional databases are less than ideal for Acknowledgements
this purpose for several reasons. First, since the data is
accessed by only one server at a time, conventional dis- This paper reports on the work of many talented people at
tributed concurrency control is unnecessary. Second, BEA, including Juan Andrade, Adam Bosworth, Ed Felt,
since the data is often accessed only in limited ways, e.g., Steve Felts, Eric Halpern, Anno Langen, Adam Mess-
by key or through a sequential scan, conventional data- inger, Prasad Peddada, Sam Pullara, Seth White, Rob
base access mechanisms are overbuilt. Third, conven- Woollen, and Stephan Zachwieja. Special thanks to Adam
tional databases are relatively expensive and hard to Messinger for helping to organize the ideas in this paper.
maintain, making them inappropriate to install on every This paper is dedicated to the memory of Ed Felt.
member of a cluster.
Middle-tier data is better handled by a persistence References
layer that is specifically designed for the Application
Server. This persistence layer should be tightly integrated
with Application Server instances to decrease communi- [1] J. Gray, A. Reuter. Transaction Processing: Concepts
cation costs and simplify administration. It should also be and Techniques. Morgan Kaufman, 1993.
lighter weight than a conventional database so it is more
appropriate to install on every member of a cluster. [2] CICS/VS Version 1.6, General Information Manual,
Messages, both in-bound and out-bound, are the most GC33-1055, IBM Corp., Armonk, N.Y., 1983.
significant category of middle-tier data. Gray argues that
databases should be enhanced with TP-monitor-like fea-
tures to handle messaging; for example, triggers and [3] J. Andrade, M. Carges, T. Dwyer, and S. Felts. The
stored procedures should evolve into worker Tuxedo System - Software for Constructing and Managing
thread/process pools for servicing queue entries [20]. The Distributed Business Applications. Addison-Wesley Pub-
counter-argument is that Application Servers should be lishing, 1996.
enhanced with persistence, since they also provide much
of the required infrastructure, including security, configu-
[4] BEA Systems. The WebLogic Application Server.
ration, monitoring, recovery, and logging. Specialized
http://www.bea.com/products/ weblogic/
file-based message stores are in fact common, for all of
server/index.shtml.
the reasons described above, and can be used as a starting
point for building an Application Server persistence layer.
[5] Hypertext Transfer Protocol -- HTTP/1.1.
WebLogic Server has a file-based message store and it
http://www.ietf.org/rfc/rfc2616.txt
is being generalized to handle other kinds of data. Of par-
ticular importance is the state associated with Web Serv-
ice conversations. In addition to the benefits described
Data Management in Application Servers 767

[6] Simple Object Access Protocol (SOAP) 1.1. [19] B. Lampson. How to Build a Highly Available Sys-
http://www.w3.org/TR/SOAP. tem Using Consensus. In Distributed Algorithms, Lecture
Notes in Computer Science 1151, (ed. Babaoglu and
Marzullo), Springer, 1996.
[7] A. Fox, S. Gribble, Y. Chawathe, E. Brewer, and P.
Gauthier. Cluster-Based Scalable Network Services. Pro-
[20] J. Gray. Queues are Databases. Proceedings 7th High
ceedings of ACM Symposium on Operating Systems Prin-
Performance Transaction Processing Workshop. Asilo-
ciples. Vol. 31, October 1997.
mar CA, Sept 1995.

[8] I. Foster, C. Kesselman, and S. Tuecke. The Anatomy


of the Grid: Enabling Scalable Virtual Organizations. In-
ternational Journal of Supercomputer Applications. 2001.

[9] Sun Microsystems. Java™ 2 Platform, Enterprise Edi-


tion (J2EE™). http:// java.sun.com/ j2ee.

[10] J. Gray. The Transaction Concept: Virtues and


Limitations. Proceedings of VLDB. Cannes, France, Sep-
tember 1981.

[11] G. F. Pfister. In Search of Clusters, 2nd Edition.


Prentice Hall, 1998.

[12] D. L. Eager, E. D. Lazowska, and J. Zahorjan.


Adaptive load sharing in homogeneous distributed sys-
tems. IEEE Transactions on Software Engineering. Vol.
12, 1986.

[13] T. Bourke. Server Load Balancing. O'Reilly & Asso-


ciates, August 2001.

[14] A. Gupta, I. S. Mumick (Editors). Materialized


Views: Techniques, Implementations, and Applications.
The MIT Press, 1999.

[15] K. S. Candan, D. Agrawal, W. S. Li, O. Po, W. P.


Hsiung. View Invalidation for Dynamic Content Caching
in Multitiered Architectures. Proceedings of the 28th Very
Large Data Bases Conference, August 2002.

[16] B. Devlin, J. Gray, B. Laing, G. Spix. Scalability


Terminology: Farms, Clones, Partitions, and Packs:
RACS and RAPS. Microsoft Technical Report MS-TR-
99-85, December 1999.

[17] E. Marcus, H. Stern. Blueprints for High Availabil-


ity: Designing Resilient Distributed Systems. Wiley, Janu-
ary 2000.

[18] N. A. Lynch. Distributed Algorithms. Morgan Kauf-


mann, San Francisco 1996.
Querying Semi-Structured Data 769
770 Chapter 9: Web Services and Data Bases
Querying Semi-Structured Data 771
772 Chapter 9: Web Services and Data Bases
Querying Semi-Structured Data 773
774 Chapter 9: Web Services and Data Bases
Querying Semi-Structured Data 775
776 Chapter 9: Web Services and Data Bases
Querying Semi-Structured Data 777
778 Chapter 9: Web Services and Data Bases
Querying Semi-Structured Data 779
780 Chapter 9: Web Services and Data Bases
Querying Semi-Structured Data 781
782 Chapter 9: Web Services and Data Bases
Querying Semi-Structured Data 783
784 Chapter 9: Web Services and Data Bases
Querying Semi-Structured Data 785
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases 787
788 Chapter 9: Web Services and Data Bases
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases 789
790 Chapter 9: Web Services and Data Bases
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases 791
792 Chapter 9: Web Services and Data Bases
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases 793
794 Chapter 9: Web Services and Data Bases
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases 795
NiagaraCQ: A Scalable Continuous Query System for Internet
Databases
Jianjun Chen David J. DeWitt Feng Tian Yuan Wang
Computer Sciences Department
University of Wisconsin-Madison

{jchen, dewitt, ftian, yuanwang}@cs.wisc.edu

ABSTRACT
Notify me whenever the price of Dell or Micron stock
Continuous queries are persistent queries that allow users to
drops by more than 5% and the price of Intel stock remains
receive new results when they become available. While
unchanged over next three month.
continuous query systems can transform a passive web into an
active environment, they need to be able to support millions of
queries due to the scale of the Internet. No existing systems have In order to handle a large number of users with diverse interests,
achieved this level of scalability. NiagaraCQ addresses this a continuous query system must be capable of supporting a large
problem by grouping continuous queries based on the number of triggers expressed as complex queries against web-
observation that many web queries share similar structures. resident data sets.
Grouped queries can share the common computation, tend to fit The goal of the Niagara project is to develop a distributed
in memory and can reduce the I/O cost significantly. database system for querying distributed XML data sets using a
Furthermore, grouping on selection predicates can eliminate a query language like XML-QL [DFF+98]. As part of this effort,
large number of unnecessary query invocations. Our grouping our goal is to allow a very large number of users to be able to
technique is distinguished from previous group optimization register continuous queries in a high-level query language such
approaches in the following ways. First, we use an incremental as XML-QL. We hypothesize that many queries will tend to be
group optimization strategy with dynamic re-grouping. New similar to one another and hope to be able to handle millions of
queries are added to existing query groups, without having to continuous queries by grouping similar queries together. Group
regroup already installed queries. Second, we use a query-split optimization has the following benefits. First, grouped queries
scheme that requires minimal changes to a general-purpose can share computation. Second, the common execution plans of
query engine. Third, NiagaraCQ groups both change-based and grouped queries can reside in memory, significantly saving on
timer-based queries in a uniform way. To insure that NiagaraCQ I/O costs compared to executing each query separately. Third,
is scalable, we have also employed other techniques including grouping makes it possible to test the “firing” conditions of
incremental evaluation of continuous queries, use of both pull many continuous queries together, avoiding unnecessary
and push models for detecting heterogeneous data source invocations.
changes, and memory caching. This paper presents the design of
NiagaraCQ system and gives some experimental results on the Previous group optimization efforts [CM86] [RC88] [Sel86]
system’s performance and scalability. have focused on finding an optimal plan for a small number of
similar queries. This approach is not applicable to a continuous
1. INTRODUCTION query system for the following reasons. First, it is
Continuous queries [TGNO92][LPT99][LPBZ96] allow users to computationally too expensive to handle a large number of
obtain new results from a database without having to issue the queries. Second, it was not designed for an environment like the
same query repeatedly. Continuous queries are especially useful web, in which continuous queries are dynamically added and
in an environment like the Internet comprised of large amounts removed. Our approach uses a novel incremental group
of frequently changing information. For example, users might optimization approach in which queries are grouped according
want to issue continuous queries of the form: to their signatures. When a new query arrives, the existing
groups are considered as possible optimization choices instead
of re-grouping all the queries in the system. The new query is
merged into existing groups whose signatures match that of the
query.
Our incremental group optimization scheme employs a query-
split scheme. After the signature of a new query is matched, the
sub-plan corresponding to the signature is replaced with a scan
of the output file produced by the matching group. This
optimization process then continues with the remainder of the
query tree in a bottom-up fashion until the entire query has been
analyzed. In the case that no group “matches” a signature of the
new query, a new query group for this signature is created in the
NiagaraCQ: A Scalable Continuous Query System for Internet Databases 797

system. Thus, each continuous query is split into several smaller 2. NIAGARACQ COMMAND LANGUAGE
queries such that inputs of each of these queries are monitored NiagaraCQ defines a simple command language for creating and
using the same techniques that are used for the inputs of user- dropping continuous queries. The command to create a
defined continuous queries. The main advantage of this continuous query has the following form:
approach is that it can be implemented using a general query
engine with only minor modifications. Another advantage is that CREATE CQ_name
the approach is easy to implement and, as we will demonstrate XML-QL query
in Section 4, very scalable. DO action
Since queries are continuously being added and removed from {START start_time} {EVERY time_interval} {EXPIRE
groups, over time the quality of the group can deteriorate, expiration_time}
leading to a reduction in the overall performance of the system.
In this case, one or more groups may require “dynamic re- To delete a continuous query, the following command is used:
grouping” to re-establish their effectiveness.
Continuous queries can be classified into two categories Delete CQ_name
depending on the criteria used to trigger their execution.
Change-based continuous queries are fired as soon as new Users can write continuous queries in NiagaraCQ by combining
relevant data becomes available. Timer-based continuous an ordinary XML-QL query with additional time information.
queries are executed only at time intervals specified by the The query will become effective at the start_time. The
submitting user. In our previous example, day traders would Time_interval indicates how often the query is to be executed. A
probably want to know the desired price information query is timer-based if its time_interval is not zero; otherwise, it
immediately, while longer-term investors may be satisfied being is change-based. Continuous queries will be deleted from the
notified every hour. Although change-based continuous queries system automatically after their expiration_time. If not provided,
obviously provide better response time, they waste system default values for the time are used. (These values can be set by
resources when instantaneous answers are not really required. the database administrator.) Action is performed upon the
Since timer-based continuous queries can be supported more XML-QL query results. For example, it could be ``MailTo
efficiently, query systems that support timer-based continuous dewitt@cs.wisc.edu'' or a complex stored procedure to further
queries should be much more scalable. However, since users can processing the results of the query. Users can delete installed
specify various overlapping time intervals for their continuous queries explicitly using the delete command.
queries, grouping timer-based queries is much more difficult
than grouping purely change-based queries. Our approach 3. OUR INCREMENTAL GROUP
handles both types of queries uniformly. OPTIMIZATION APPROACH
NiagaraCQ is the continuous query sub-system of the Niagara In Section 3.1, we present a novel incremental group
project, which is a net data management system being developed optimization strategy that scales to a large number of queries.
at University of Wisconsin and Oregon Graduate Institute. This strategy can be applied to a wide range of group
NiagaraCQ supports scalable continuous query processing over optimization methods. A specific group optimization method
multiple, distributed XML files by deploying the incremental based on this approach is described in Section 3.2. Section 3.3
group optimization ideas introduced above. A number of other introduces our query-split scheme that requires minimal changes
techniques are used to make NiagaraCQ scalable and efficient. to a general-purpose query engine. Section 3.4 and 3.5 apply our
1) NiagaraCQ supports the incremental evaluation of continuous group optimization method to selection and join operators. We
queries by considering only the changed portion of each updated discuss how our system supports timer-based queries in Section
XML file and not the entire file. Since frequently only a small 3.6. Section 3.7 contains a brief discussion of the caching
portion of each file gets updated, this strategy can save mechanisms in NiagaraCQ to make the system more scalable.
significant amounts of computation. Another advantage of
incremental evaluation is that repetitive evaluation is avoided 3.1 General Strategy of Incremental Group
and only new results are returned to users. 2) NiagaraCQ can Optimization
monitor and detect data source changes using both push and poll Previous group optimization strategies [CM86] [RC88] [Sel86]
models on heterogeneous sources. 3) Due to the scale of the focused on finding an optimal global plan for a small number of
system, all the information of the continuous queries and queries. These techniques are useful in a query environment
temporary results cannot be held in memory. A caching where a small number of similar queries either enter the system
mechanism is used to obtain good performance with limited within a short time interval or are given in advance. A naive
amounts of memory. approach for grouping continuous queries would be to apply
The rest of the paper is organized as follows. In Section 2 the these methods directly by reoptimizing all queries whenever a
NiagaraCQ command language is briefly described. Our new new query is added. We contend that such an approach is not
group optimization approach is presented in Section 3 and its acceptable for large dynamic environments because of the
implementation is described in Section 4. Section 5 examines associated performance overhead.
the performance of the incremental continuous query We propose an incremental group optimization strategy for
optimization scheme. Related work is described in Section 6. continuous queries in this paper. Groups are created for existing
We conclude our paper in Section 7. queries according to their signatures, which represent similar
structures among the queries. Groups allow the common parts of
798 Chapter 9: Web Services and Data Bases

two or more queries to be shared. Each individual query in a of the XML-QL query plan after the query is parsed. Expression
query group shares the results from the execution of the group signatures allow queries with the same syntactic structure to be
plan. When a new query is submitted, the group optimizer grouped together to share computation [HCH+99]. Expression
considers existing groups as potential optimization choices. The signatures for different queries will be discussed later. Note, in
new query is merged into those existing groups that match its NiagaraCQ, users can specify an XML-QL query without
signatures. Existing queries are not, however, re-grouped in our specifying the destination data sources by using a “*” in the file
approach. While this strategy is likely to result in sub-optimal name position and giving a DTD name. This allows users to
groups, it reduces the cost of group optimization significantly. specify continuous queries without naming the data sources. Our
More importantly it is very scalable in a dynamic environment. group query optimizer is easily extended to support this
Since continuous queries are frequently added and removed, it is capability by using a mapping mechanism offered by the
possible that current groups may become inefficient. “Dynamic Niagara Search Engine. Without losing generality for our
re-grouping” would be helpful to re-group part or all of the incremental grouping algorithm, we assume continuous queries
queries either periodically or when the system performance are defined on a specific data source in this paper.
degrades below some threshold. This is left as future work.

3.2 Incremental Group Optimization using Trigger Action I Trigger Action J


Expression Signature
Based on our incremental grouping strategy, we designed a
scalable group optimization method using expression signatures. Select Select
Expression signatures [HCH+99] represent the same syntax Symbol = “INTC” Symbol = “MSFT”
structure, but possibly different constant values, in different
queries. It is a specific implementation of the signature concept.

3.2.1 Expression Signature File Scan File Scan


For purposes of illustration, we use XML-QL queries on a
database of stock quotes.
quotes.xml quotes.xml
Where <Quotes> <Quote>
<Symbol>INTC</>
</> </> element_as $g Figure 3.3 Query plans of queries in Figure 3.1
in “http://www.cs.wisc.edu/db/quotes.xml”
construct $g 3.2.2 Group
Groups are created for queries based on their expression
Where <Quotes> <Quote> signatures. For example, a group is generated for the queries in
<Symbol>MSFT</> Figure 3.1 because they have same expression signature. We use
</> </> element_as $g this group in following discussion. A group consists of three
in “http://www.cs.wisc.edu/db/quotes.xml” parts.
construct $g
1. Group signature
The group signature is the common expression signature of all
queries in the group. For the example above, the expression
Figure 3.1 XML-QL query examples
signature is given in Figure 3.2.
The two XML-QL queries in Figure 3.1 retrieve stock Constant_value Destination_buffer
information on either Intel (symbol INTC) or Microsoft (symbol
MSFT). Many users are likely to submit similar queries for …. ….
different stock symbols. An expression signature is created for INTC Dest. i
the selection predicates by replacing the constants appearing in
the predicates with a placeholder. The expression signature for MSFT Dest. j
the two queries in Figure 3.1 is shown in Figure 3.2. …. ….

= Figure 3.4 an example of group constant table


Quotes.Quote.Symbol constant 2. Group constant table
in quotes.xml The group constant table contains the signature constants of all
Figure 3.2 Expression signature of queries in Figure 3.1 queries in the group. The constant table is stored as an XML
file. For the example above, “INTC” and “MFST” are stored in
A query plan is generated by Niagara query parser. Figure 3.3 this table (Figure 3.4). Since the tuples produced by the shared
shows the query plans of the queries in Figure 3.1. The lower computation need to be directed to the correct individual query
part in each query plan corresponds to the expression signature for further processing, the destination information is also stored
of the queries. A new operator TriggerAction is added on the top with the constant.
NiagaraCQ: A Scalable Continuous Query System for Internet Databases 799

3. Group plan constant table does not have an entry “AOL”, it will be added
The group plan is the query plan shared by all queries in the and a new destination buffer allocated.
group. It is derived from the common part of all single query
plans in the group. Figure 3.5 shows the group plan for the Where <Quotes> Trigger Action
queries in Figure 3.1.
<Quote>
Trigger Action I Trigger Action J <Symbol>AOL</>
Select
</></> Symbol = “AOL”
element_as $g
Split
Group Plan in
File Scan
“http://www.cs.wisc.edu/
Join db/quotes.xml”
construct $g quotes.xml
Symbol = Constant_value

File Scan File Figure 3.6 XML-QL Figure 3.7 Query plan for
query examples query in Figure 3.6
In the case that the signature of the query does not match any
quotes.xml Constant Table group signature, a new group will be generated for this signature
and added to the group table.
In general, a query may have several signatures and may be
Figure 3.5 Group plan for queries in Figure 3.1 merged into several groups in the system. This matching process
will continue on the remainder of the query plan until the top of
An expression signature allows queries in a group to have the plan is reached. Our incremental grouping is very efficient
different constants. Since the result of the shared computation because it only requires one traversal of the query plan.
contains results for all the queries in the group, the results must In the following sections, we first discuss our query-split
be filtered and sent to the correct destination operator for further scheme and then describe how incremental group optimization
processing. NiagaraCQ performs filtering by combining a is performed on selection and join operators.
special Split operator with a Join operator based on the constant
values stored in the constant table. Tuples from the data source 3.3 Query Split with Materialized
(e.g. Quotes.xml) are joined with the constant table. The Split
operator distributes each result tuple of the Join operator to its Intermediate Files
correct destination based on the destination buffer name in the The destination buffer for the split operator can be implemented
tuple (obtained from the Constant Table). The Split operator either in a pipelined scheme or as an intermediate file. Our
removes the name of the destination buffer from the tuple before initial design of the split operator used a pipeline scheme in
it is put into the output stream, so that subsequent operators in which tuples are pipelined from the output of one operator into
the query do not need to be modified. In addition, queries with the input of the next operator. However, such a pipeline scheme
the same constant value also share the same output stream. This does not work for grouping timer-based continuous queries.
feature can significantly reduce the number of output buffers. Since timer-based queries will only be fired at specified time,
output tuples must be retained until the next firing time. It is
Since generally the number of active groups is likely to be on difficult for a split operator to determine which tuples should be
the order of thousands or ten of thousands, group plans can be stored and how long they should be stored for.
stored in a memory-resident hash table (termed a group table)
with the group signature as the hash key. Group constant tables In addition, in the pipelined approach, the ungrouped parts of all
are likely to be large and are stored on disk. query plans in a group are combined with the group plan,
resulting in a single execution plan for all queries in the group.
3.2.3 Incremental Grouping Algorithm This single plan has several disadvantages. First, its structure is
In this section we briefly describe how the NiagaraCQ group a directed graph, and not a tree. Thus, the plan may be too
optimizer performs incremental group optimization. complicated for a general-purpose XML-QL query engine to
execute. Second, the combined plan may be very large and
When a new query (Figure 3.6) is submitted, the group
require resources beyond the limits of some systems. Finally, a
optimizer traverses its query plan bottom up and tries to match
large portion of the query plan may not need to be executed at
its expression signature with the signatures of existing groups.
each query invocation. For example, in Figure 3.5, suppose only
The expression signature of the new query, which is the same as
the price of Intel stock changes. Although the destination buffer
the signature in Figure 3.2, matches the signature of the group in
for Microsoft is empty, the upper part of the Microsoft query
Figure 3.5. The group optimizer breaks the query plan (Figure
(Trigger Action J) is also executed. This problem can be avoided
3.7) into two parts. The lower part of the query is removed. The
only if the execution engine has the ability to selectively
upper part of the query is added onto the group plan. If the
800 Chapter 9: Web Services and Data Bases

load part of a query plan in a bottom-up manner. Such a The advantages of this new design include:
capability would require a special implementation of the XML- 1. Each query is scheduled independently, thus only the
QL query engine. necessary queries are executed. For example, in Figure 3.8, if
only the price of Intel stock changes, queries on intermediate
Trig. Act. I Trig. Act. J
files other than “file_i” will not be scheduled. Since usually
only a small amount of data is changed, only a few of the
installed continuous queries will be fired. Thus, computation
File Scan File Scan time and system resource usage is significantly reduced.
2. Queries after a split operator will be in a standard, tree-
structured query format and thus can be scheduled and executed
file_ i file_ j by a general query engine.
3. Each query in the system is about the size of a common
user query, so that it can be executed without consuming an
Split Group unusual amount of system resources.
Plan 4. This approach handles intermediate files and original data
source files uniformly. Changes to materialized intermediate
Join Constant Intermediate files will be processed and monitored just like changes to the
value file name original data files.
Symbol =
Constant_value 5. The potential bottleneck problem of the pipelined approach
…. …. is avoided.
File Scan File Scan INTC file. i There are some potential disadvantages. First, the split operator
MSFT file. j becomes a blocking operator since the execution of the upper
…. …. part of the query must wait for the intermediate files to be
quotes.xml Constant Table completely materialized. Since continuous queries run over data
changes that are usually not very large, we do not believe that
the impact of this blocking will be significant. Second, reading
Figure 3.8 query-split scheme using intermediate files and writing the intermediate files incurs extra disk I/Os. Since
Since a split operator has one input stream and multiple most data changes will be relatively small, we anticipate that
(possibly tens of thousands) output streams, split operators may they will be buffered in memory before the upper part queries
become a bottleneck when the ungrouped parts of queries consume them. There will be disk I/Os in the case of timer-
consume output tuples from the split stream at widely varying based queries that have long time intervals because data changes
rates. For example, suppose 100 queries are grouped together, may be accumulated. In this situation, data changes need to be
99 of which are very simple selection queries, and one is a very written to disk no matter what strategy is used. As discussed in
expensive query involving multiple joins. Since this expensive Section 3.7, NiagaraCQ uses special caching mechanisms to
query may process the input from the split operator very slowly, reduce this cost.
it may block all the other simple queries.
3.4 Incremental Grouping of General
The pipeline scheme can be used in systems that support only a
small number of change-based continuous queries. Since our
Selection Predicates
goal is to support millions of both change-based and timer-based Our primary focus is on predicates that are in the format of
continuous queries, we adopt an approach that is more scalable “Attribute op Constant.” Attribute is a path expression without
and easier to implement. We also try to use a general query wildcards in it. Op includes “=”, “<”, “>”. Such formats
engine to the maximal extent possible. dominate in selection queries. Other predicate formats could
also be handled in our approach, but we do not discuss them
In our new design (Figure 3.8), the split operator writes each further in this paper.
output stream into an intermediate file. A query plan is cut into
two parts at the split operator and a file scan operator is added to Figure 3.9 shows an example of a range selection query that
the upper part of plan to read the intermediate file. NiagaraCQ returns every stock whose price has risen more than 5%. Figure
3.9 also gives its expression signature. The group plan for
treats the two new queries like normal user queries. In
queries with this signature is the same in Figure 3.5, except the
particular, changes to the intermediate files are monitored in the
join condition is Change_Ratio > constant.
same way as those to ordinary data sources! Since a new
continuous query may overlap with multiple query groups, one A general range-query has both lower_bound and upper_bound
query may be split into several queries. However, the total values. Two columns are needed to represent both bounds in the
number of queries in the system will not exceed the number of constant table. Thus each entry of the constant table will be
groups plus the number of original user queries. Since we [lower_bound, upper_bound, intermediate_file_name]. The join
assume that no more than thousands of groups will be generated condition is Change_Ratio < upper_bound and Change_Ratio
for millions of user queries, the overall number of queries in the > lower_bound. A special index would be helpful to evaluate
system will increase only slightly. Intermediate file names are this predicate. For example, an interval skip list [HJ94] could be
stored in the constant table and grouped continuous queries with used for this purpose when all the intervals fit in memory. We
the same constant share the same intermediate file.
NiagaraCQ: A Scalable Continuous Query System for Internet Databases 801

are considering developing a new index method that handles this profile. The signature for the join operation is shown on the
case more efficiently. right side of the figure. A join signature in our approach
contains the names of the two data sources and the predicate for
Where <Quotes><Quote> the join. The group optimizer groups join queries with the same
<Change_Ratio>$c</></> element_as $g </> join signatures. A constant table is not needed in this case
In “quotes.xml”, $c > 0.05 because there is only one output intermediate file, whose name
Construct $g is stored in the split operator. This file is used to hold the results
of the shared join operation.
>
Where <Quotes><Quote><Symbol>$s</></>
Quotes.Quote.Change_Ratio constant element_as $g </> in “quotes.xml”,
in “quotes.xml” <Companies><Company><Symbol>$s</></>
element_as $t</> in “companies.xml”
construct $g, $t
Figure 3.9 Range selection query example and its
expression signature
One potential problem for range-query groups is that the Symbol = Symbol
intermediate files may contain a large number of duplicate tuples quotes.xml companies.xml
because range predicates of the different queries might overlap.
“Virtual intermediate files” are used to handle this case. Each
virtual intermediate file stores a value range instead of actual
result tuples. All outputs from the split operator are stored in Figure 3.11 an example query with join operator and its
one real intermediate file, which has a clustered index on the signature
range attribute. Modification on virtual intermediate files can
trigger upper-level queries in the same way as ordinary There are two ways to group queries that contain both join
intermediate files. The value range of a virtual intermediate file operators and selection operators. Figure 3.12 shows such an
is used to retrieve data from the real intermediate file. Our example, which retrieves all stocks in the computer service
query-split scheme need not be changed to handle virtual industry and the related company profiles. The group optimizer
intermediate files. can place the selection either below or above the join, so that
two different grouping sequences can be used during
In general, a query may have multiple selection predicates, i.e. incremental group optimization process. The group optimizer
multiple expression signatures. Predicates on the same data chooses the better one based on a cost model. We discuss these
source can be represented in conjunctive normal form. The alternatives below using the query example in Figure 3.12.
group optimizer chooses the most selective conjunct, which
does not contain “or”, to do incremental grouping. Other
predicates are evaluated in the upper levels of the continuous Where <Quotes><Quote><Symbol>$s</>
query after the split operator. <Industry>”Computer Service”</></>
element_as $g </> in “quotes.xml”,
Where <Quotes><Quote><Symbol>”INTC”</> <Companies><Company><Symbol>$s</></>
<Current_Price>$p</></> element_as $g </> element_as $t</> in “companies.xml”
in “quotes.xml”, $p < 100 construct $g, $t
Construct $g
Figure 3.12 an example query with both join and
selection operators
Figure 3.10 an example query with two selection predicates
If the selection operator (e.g., on Industry) is pulled above the
Figure 3.10 shows a query with two selection predicates, which join operator, the group optimizer first groups the query by the
retrieves Intel stock whenever its price falls below $100. This join signature. The selection signature, which contains the
query has two expression signatures, one is an equal selection intermediate file, is grouped next. The advantage of this method
predicate on Symbol and the other is a range selection predicate is that it allows the same join operator to be shared by queries
on Current_price. The expression signature on the equal with different selection operators. The disadvantage is that the
selection predicate (i.e. on Symbol) is used for grouping because join, which will be performed before the selection, may be very
it is more selective. In addition, a new select operator with the expensive and may generate a large intermediate file. If there are
second selection predicate (i.e. the range select on only a small number of queries in the join group and each of
Current_price) will be added above the file scan operator. them has a highly selective selection predicate, then this
grouping method may be even more expensive than evaluating
3.5 Incremental Grouping of Join Operators the queries individually.
Since join operators are usually expensive, sharing common join Alternatively, the group optimizer can push down the selection
operations can significantly reduce the amount of computation. operator (e.g., on Industry) to avoid computing an expensive
Figure 3.11 shows a query with a join operator that, for each join. First, the signature for the selection operator is matched
company, retrieves the price of its stock and the company’s with an existing group. Then a file scan operator on the
802 Chapter 9: Web Services and Data Bases

intermediate file produced by the selection group is added and intermediate files used to store the output of the split operator.
the join operator is rewritten to use the intermediate file as one NiagaraCQ calculates the changes to a source XML file and
of its inputs. Finally, the group optimizer incrementally groups merges the changes into its delta file. For intermediate files,
the join operation using its signature. Compared to the first outputs from the split operators are directly appended to the
approach, this approach may create many join groups with delta file.
significant overlap between them. Note, however, that this same In order to support timer-based queries, a time stamp is added to
overlap exists in the non-grouping approach. Thus, in general, each tuple in the delta file. Since timer-based queries with
this method always outperforms than non-grouping approach. different firing times can be defined on one file, the delta file
The group optimizer will select one of these two strategies based must keep data for the longest time interval among those queries
on a cost model. To date we have implemented the second that use the file as an input. At query execution time, NiagaraCQ
approach in NiagaraCQ. In the future we plan on implementing fetches only tuples that were added to the delta file since the
the first strategy and compare the performance of the two query's last firing time.
approaches. Whenever a grouped plan is invoked, the results of its execution
3.6 Grouping Timer-based Continuous Queries are stored in an intermediate file regardless of whether or not
Since timer-based queries are only periodically executed their queries defined on these intermediate files should be fired
use can significantly reduce computation time and make the immediately. Subsequent invocations of this group query do not
system more scalable. Timer-based queries are grouped in the need to repeat previous computation. Upper level queries
same way as change-based queries except that the time defined on intermediate files will still be fired at their scheduled
information needs to be recorded at installation time. Grouping execution time. Thus, the shared computation is totally
large number of timer-based queries poses two significant transparent to these subsequent operators.
challenges. First, it is hard to monitor the timer events of those
queries. Second, sharing the common computation becomes
3.7 Memory Caching
Due to the desired scale of the system, we do not assume that all
difficult due to the various time intervals. For example, two
the information required by the continuous queries and
users may both request the query in Figure 3.1 with different
intermediate results will fit in memory. Caching is used to
time intervals, e.g. weekly and monthly. The query with the
obtain good performance with a limited amount of memory.
monthly interval should not repeat the weekly query’s work. In
NiagaraCQ caches query plans, system data structures, and data
general, queries with various time intervals should be able to
files for better performance.
share the results that have already been produced.
1. Grouped query plans tend to be memory resident since we
3.6.1 Event Detection assume that the number of query groups is relatively small.
Two types of events in NiagaraCQ can trigger continuous Non-grouped change-based queries may be cached using an
queries. They are data-source change events and timer events. LRU policy that favors frequently fired queries. Timer-based
Data sources can be classified into push-based and pull-based. queries with shorter firing intervals will have priority over
Push-based data sources will inform NiagaraCQ whenever those with longer intervals.
interesting data is changed. On the other hand, changes on pull-
2. NiagaraCQ caches recently accessed files. Small delta files
based data sources must be checked periodically by NiagaraCQ.
generated by split operators tend to be consumed and
Timer-based continuous queries are fired only at specified times. discarded. A caching policy that favors these small files saves
However, queries will not be executed if the corresponding lots of disk I/Os.
input files have not been modified. Timer events are stored in
3. The event list for monitoring the timer-based events can be
an event list, which is sorted in time order. Each entry in the list
large if there are millions of timer-based continuous queries.
corresponds to a time instant where there exists a continuous
To avoid maintaining the whole list in memory, we keep only
query to be scheduled. Each query in NiagaraCQ has a unique
a “time window” of this list. The window contains the front
id. Those query ids are also stored in the entry. Whenever a
part of the list that should be kept in memory, e.g. within 24
timer event occurs, all related files will be checked. Each query
hours.
in the entry will be fired if its data source has been modified
since its last firing time. The next firing times for all queries in 4. IMPLEMENTATION
the entry are calculated and the queries are added into the NiagaraCQ is being developed as a sub-system of Niagara
corresponding entries on the list. project. The initial version of the system was implemented in
3.6.2 Incremental Evaluation Java (JDK1.2). A validating XML parser (IBM XML4J) from
IBM is used to parse XML documents. We describe the system
Incremental evaluation allows queries to be invoked only on the
architecture of NiagaraCQ in Section 4.1 and how continuous
changed data. It reduces the amount of computation significantly
queries are processed in Section 4.2.
because typically the amount of changed data is smaller than the
original data file. For each file, on which continuous queries are 4.1 System Architecture
defined, NiagaraCQ keeps a “delta file” that contains recent
Figure 4.1 shows the architecture of Niagara system. NiagaraCQ
changes. Queries are run over the delta files whenever possible
is a sub-system of Niagara that handles continuous queries.
instead of their original files. However, in some cases the
NiagaraCQ consists of
complete data files must be used, e.g., incremental evaluation of
join operators. NiagaraCQ uses different techniques for 1. A continuous query manager, which is the core module of
handling delta files of ordinary data sources and those of NiagaraCQ system. It provides a continuous query interface to
NiagaraCQ: A Scalable Continuous Query System for Internet Databases 803

Niagara GUI

Query Parser Niagara Query Engine


Continuous Query
Processor
CQ Manager
Query Optimizer

Group Optimizer Niagara Search


Execution Engine Engine

Event Detector
Data Manager

Figure 4.1 NiagaraCQ system


architecture. Data Source on the Internet

users and invokes the Niagara query engine to execute fired push-based and pull-based data sources. For push-based data
queries. sources, the Data Manager is informed of a file change and
2. A group optimizer that performs incremental group notifies Event Detector actively. Otherwise, the Event Detector
optimization. periodically asks the Data Manager to check the last modified
time.
3. An event detector that detects timer events and changes of
data sources. 1
In addition, the Niagara data manager was enhanced to support Continuous Query Event Detector
the incremental evaluation of continuous queries. Manager (CQM) (ED)
4.2 Processing Continuous Queries 5
Figure 4.2 shows the interactions among the Continuous Query 6 2, 3
4
Manager, the Event Detector and the Data Manager as 7
continuous queries are installed, detected, and executed.
Continuous query processing is discussed in following sections. Query Engine (QE) Data Manager
(DM)
4.2.1 Continuous Query Installation
8
When a new continuous query enters the system, the query is
parsed and the query plan is fed into the group optimizer for 1. CQM adds continuous queries with file and timer information to
incremental grouping. The group optimizer may split this query enable ED to monitor the events.
into several queries using the query-split scheme described in 2. ED asks DM to monitor changes to files.
Section 3. The continuous query manager then invokes the 3. When a timer event happens, ED asks DM the last modified time
Niagara query optimizer to perform common query optimization of files.
for these queries and the optimized plans are stored for future 4. DM informs ED of changes to push-based data sources.
5. If file changes and timer events are satisfied, ED provides CQM
execution. Timer information and data source names of these
with a list of firing CQs.
queries are given to the Event Detector (Step 1 in Figure 4.2). 6. CQM invokes QE to execute firing CQs.
The Event Detector then asks the Data Manager to monitor the 7. File scan operator calls DM to retrieve selected documents.
related source files and intermediate files (Step 2 in Figure 4.2), 8. DM only returns data changes between last fire time and current
which in turn caches a local copy of each source file. This step fire time.
is necessary in order to detect subsequent changes to the file.
Figure 4.2 Continuous Query processing in NiagaraCQ
The Event Detector monitors two types of events: timer events
and file-modification events. Whenever such events occur, the
Event Detector notifies the Continuous Query Manager about
4.2.2 Continuous Query Deletion
A system unique name is generated for every user-defined
which queries need to be fired and on which data sources.
continuous query. A user can use this name to retrieve the query
The Data Manager in Niagara monitors web XML sources and status or to delete the query. Queries are automatically removed
intermediate files on its local disk. It handles the disk I/O for from the system when they expire.
both ordinary queries and continuous queries and supports both
804 Chapter 9: Web Services and Data Bases

4.2.3 Execution of Continuous Queries Data changes on “quotes.xml” are generated artificially to
The invocation of a continuous query requires a series of simulate the real stock market and continuous queries are
interactions among the Continuous Query Manager, Event triggered by these changes. The “companies.xml” file was not
Detector and Data Manager. changed during our experiments.
When a timer event happens, the Event Detector first asks the We give a brief description of the assumptions that we made to
Data Manager if any of the relevant data sources have been generate “quotes.xml”. Each stock has a unique Symbol value.
modified (Step 3 in Figure 4.2). The Data Manager returns a list The Industry attribute takes a value randomly from a set with
of names of modified source files. The Data Manager also about 100 values. The Change_Ratio represents the change
notifies the Event Detector when push-based data sources have percentage of the current price to the closing price for the
been changed (Step 4 in Figure 4.2). If a continuous query needs previous session. It follows a normal distribution with a mean
to be executed, its query id and the names of the modified files value of 0 and standard deviation of 1.0.
are sent to the Continuous Query Manager (Step 5 in Figure Since time spent calculating changes in source files is the same
4.2). The Continuous Query Manager invokes the Niagara query for both the grouped and non-grouped approaches, we run our
engine to execute the triggered queries (Step 6 in Figure 4.2). experiments directly against the data changes. Unless specified,
At execution time, the Query Engine requests data from the Data the number of “tuples” modified is 1000, which is about 400K
Manager (Step 7. in Figure 4.2). The Data Manager recognizes bytes.
that it is a request for a continuous query and returns only the
delta file (Step 8 in Figure 4.2). Delta files for source files are Queries
computed by performing an XML-specific “diff” operation Although users may submit many different queries, we
using the original file and the new version of the file. hypothesize that many queries will contain similar expression
signatures. In our experiments, we use four types of queries to
5. EXPERIMENTAL RESULTS represent the effect of grouping queries in a stock environment
We expect that for a continuous query system over the Internet, by their expression signatures.
incremental group optimization will provide substantial
improvement to system performance and scalability. In the Where <Quotes><Quote><Symbol>”INTC”</></>
following experiments, we compare our incremental grouping element_as $g </> in “quotes.xml”, construct $g
approach with a non-grouping approach to show benefits from
sharing computation and avoiding unnecessary query
Query Type-1 Example: Notify me when Intel stocks change.
invocations.

5.1 Experiment Setting Where <Quotes><Quote><Change_Ratio>$c</></>


The following experiments were conducted on a Sun Ultra 6000 element_as $g </> in “quotes.xml”, $c > 0.05
with 1GB of RAM, running JDK1.2 on Solaris 2.6. construct $g

<!ELEMENT Quotes ( Quote )*> Query Type-2 Example: Notify me of all stocks whose
<!ELEMENT Quote ( Symbol, Sector, Industry, prices rise more than 5 percent.
Current_Price, Open, PrevCls, Volume, Day’s_range,
52_week_range?, Change_Ratio> Where <Quotes><Quote><Symbol>”INTC”</>
<!ELEMENT Day’s_range (low, high)> <Current_Price>$p</></> element_as $g </>
<!ELEMENT 52_week_change (low, high)> in “quotes.xml”, $p < 100, construct $g

Figure 5.1 DTD of quotes.xml


Query Type-3 Example: Notify me when Intel stock trades
<!ELEMENT Companies ( Company )*> below 100 dollars.
<!ELEMENT Company ( Symbol, Name, Sector, Industry, Where <Quotes><Quote><Symbol>$s</><Industry>
Company_profiles?> ”Computer Service”</></> element_as $g </>
<!ELEMENT Company_profiles (Capital, Employees, in “quotes.xml”,
Address, Description)> <Companies><Company><Symbol>$s</></>
<!ELEMENT Address (City, State)> element_as $t</> in “companies.xml”
construct $g, $t
Figure 5.2 DTD of companies.xml
Query Type-4 Example: Notify me all of changes to stocks in the
Data Sets computer service industry and related company information.
Our experiments were run against a database of stock
information consisting of two XML files, “quotes.xml” and
“companies.xml”. “Quotes.xml” contains stock information on • Type-1 queries have the same expression signature on the
about 5000 NASDAQ companies. The size of “quotes.xml” is equal selection predicate on Symbol.
about 2 MB. Related company information is stored in • Type-2 queries have the same expression signature on the
“companies.xml”, whose size is about 1MB. The DTDs of these range selection predicate on Change_ratio.
two XML files are given in Figure 5.1 and 5.2, respectively.
NiagaraCQ: A Scalable Continuous Query System for Internet Databases 805

• Type-3 queries have two common expression signatures, consumes significantly less execution time by sharing the
one is on the equal selection predicate on Symbol, and the computation of the selection operator. It also grows more slowly
other is on the range selection predicate on Current_price. because in a single Type-1 query Tng is much smaller than Tg.
The expression signature of the equal selection predicate is • Case 2: F = 100, i.e., 100 queries are invoked in the
used for grouping Type-3 queries because it is more grouping approach.
selective than that of the range predicate.
In the grouping approach, the execution time of Case 2 is almost
• Type-4 queries contain expression signatures for both constant when F is fixed. The execution time of the grouping
selection and join operators. Selection operators are pushed approach depends on number of fired queries F, not on the total
down under join operators. The incremental group number of installed queries N. The reason is that, although Tg
optimizer first groups selection signatures and then join increases as N grows, this shared computation is executed only
signatures. once and is a very small portion of total execution time. The
Queries of Type-3 are generated following a normal distribution execution time for the upper queries, which is proportional to
with a mean value of 3 and a standard deviation of 1.0. Queries the number of fired queries F, dominates the total execution
of the other types are generated using different constants time. On the other hand, the execution time for the non-
following a uniform distribution on the range of values in the grouping approach is proportional to N because all queries are
data unless specified. scheduled for execution.
Experiment 2. (Figure 5.4) F = N = 2000 queries
5.2 Interpretation of Experimental Result
In this experiment we explore the impact of C, the number of
The parameters in our experiments are: modified tuples, on the performance of the two approaches. C is
1. N, the number of installed queries, is an important measure varied from 100 tuples (about 40K bytes) to 2000 tuples (about
of system scalability. 800K bytes). Increasing C will increase the query execution
time. For the non-grouping approach, the total execution time is
2. F, the number of fired queries in the grouping case. The
proportional to C because the selection operator of every
number of fired queries may vary depending on triggering
installed query needs to be executed. For the grouping approach,
conditions in the grouping case. For example, in a Tye-1 query,
if Intel stock does not change, queries defined on “INTC” are the execution time is not sensitive to the change of C because
not scheduled for execution after the common computation of the increase of Tg only counts for a small percentage of the total
the group. This parameter does not affect non-grouping queries. execution time and the sum of Tng of all fired queries does not
change because of the predicate’s selectivity.
3. C, the number of tuples modified.
Experiment results for Type-2, 3, 4 queries (Figure 5.5, 5.6,
In our grouping approach, a user-defined query consists of 5.7) C =1000 tuples, F = N
grouped part and non-grouped part. Tg and Tng represent the
execution time of each part. The execution time T for evaluating We discuss the influence of different expression signatures in
this set of experiments.
N queries is the sum of Tg and Tng of each of F fired queries,
T = Tg + ∑ Tng , because the grouped portion is executed
Figure 5.5 and Figure 5.6 show that our group optimization
works well for various selection predicates. Type-2 queries are
F
grouped according to their range selection signature. Type-3
only once.
queries have two signatures. The group optimizer chooses an
Since the non-grouping strategy needs to scan each XML data equal predicate to group queries since it is more selective.
source file multiple times, we cache parsed XML files in
memory so that both approaches scan and parse XML files only Figure 5.7 shows the results for Type-4 queries. Type-4 queries
once. This ensures that the comparison between the two have one selection signature and one join signature. The
approaches is fair. However, in a production system, parsed selection operator is pushed below the join operator. Queries are
XML files probably could not be retained in memory for long first grouped by their selection signature. There are 100 different
periods of time. Thus, many non-grouped queries may each have industries in our test data set. The output of the selection group
to scan and parse the same XML files multiple times. is written to 100 intermediate files and one hundred join groups
are created. Each join group consumes one of the intermediate
5.2.1 Experimental results on single type queries files as its input. The difference between the execution time
We studied how effectively incremental group optimization with and without grouping is much larger than in the previous
works for each type of query. We measured and compared experiments because a join operator is more expensive than a
execution time for queries of each type for both the grouping selection operator.
and non-grouping approaches.
5.2.2 Experiment results on mixed queries of Type-1
Experiment results on type-1 queries and type-3 (Figure 5.8) C =1000 tuples, F = N (N/2
Experiment 1. (Figure 5.3) C =1000 tuples. Type-1 queries and N/2 Type-3 queries)
• Case 1: F = N, i.e. all queries are fired in both approaches. Previous experiments studied each type of query separately for
The execution time of the non-grouping approach grows the purpose of showing the effectiveness of different kinds of
dramatically as N increases. It cannot be applied to a highly expression signatures. Our incremental group optimizer is not
loaded system. On the other hand, the grouping approach limited to group only one type of queries. Different types of
queries can also be grouped together if they have common
806 Chapter 9: Web Services and Data Bases

500 800
Gr o u p e d C a se 1 700 Gr o u p e d

Execution Time (s)


Execution Time (s)
400 N o n - Gr o u p e d C a se 1
600 N o n - Gr o u p e d
Gr o u p e d C a se 2
300 500
N o n - Gr o u p e d C a se 2
400
200 300
200
10 0
10 0
0 0
0 2000 4000 6000 8000 10 0 0 0 0 200 400 600 800 10 0 0 12 0 0 14 0 0 16 0 0 18 0 0 2 0 0 0
Number of Queries Data Size (Number of T uples)
Figure 5.3 Figure 5.4
14 0 0 500

12 0 0 Gr o u p e d Gr o u p e d
Execution Time (s)

Execution Time (s)


400
N o n - Gr o u p e d N o n - Gr o u p e d
10 0 0

800 300

600 200
400
10 0
200
0 0
0 2000 4000 6000 8000 10 0 0 0 0 2000 4000 6000 8000 10 0 0 0
Number of Queries Number of Queries
Figure 5.5 Figure 5.6
14 0 0 500

12 0 0 Gr o u p e d Gr o u p e d
Execution Time (s)

Execution Time (s)

400
N o n - Gr o u p e d N o n - Gr o u p e d
10 0 0

800 300

600
200
400
10 0
200

0 0
0 2000 4000 6000 8000 10 0 0 0 0 2000 4000 6000 8000 10 0 0 0
Number of Queries
Number of Queries
Figure 5.7 Figure 5.8
signatures. In this experiment, Type-1 queries and Type-3 suitable for our target environment. NiagaraCQ uses an
queries are grouped together because they have the same incremental query evaluation method but is not limited to
selection signature. Figure 5.8 shows the performance append-only data sources. We also include action and timer
difference between the grouped and non-grouped cases. events in Niagara continuous queries.
Continuous queries are similar to triggers in traditional database
5.3 System Status and Future Work systems. Triggers have been widely studied and implemented
A prototype version of NaigraCQ has been developed, which
[WF89][MD89][SJGP90][SPAM91][SK95]. Most trigger
includes a Group Optimizer, Continuous Query Manager, Event
systems use an Event-Condition-Action (ECA) model [MD89].
Detector, and Data Manager. As the core of our incremental
General issues of implementing triggers can be found in
group optimization, the Group Optimizer currently can
[WF89].
incrementally group selection and join operators. Our
incremental group optimizer is still at a preliminary stage. NiagaraCQ is different from traditional trigger systems in the
However, incremental group optimization has been shown to be following ways.
a promising way to achieve good performance and scalability. 1. The main purpose of the NiagaraCQ is to support continuous
We intend to extend incremental group optimization to queries query processing rather than to maintain data integrity.
containing operators other than selection and join. For example,
sharing computation for expensive operators, such as 2. NiagaraCQ is intended to support millions of continuous
aggregation, may be very effective. “Dynamic regrouping” is queries defined on large number of data sources. In a
another interesting future direction that we intend to explore. traditional DBMS, a very limited number of triggers can be
installed on each table and a trigger can usually only be
6. RELATED WORK AND DISCUSSION defined on a single table.
Terry et al. first proposed the notion of "continuous queries" 3. NiagaraCQ needs to monitor autonomous and heterogeneous
[TGNO92] as queries that are issued once and run continuously. data sources over the Internet. Traditional trigger systems
He used an incremental evaluation approach to avoid repetitive only handle local tables.
computation and return only new results to users. Their
4. Timer-based events are supported in NiagaraCQ.
approach was restricted to append-only systems, which is not
NiagaraCQ: A Scalable Continuous Query System for Internet Databases 807

Open-CQ [LPT99] [LPBZ96] also supports continuous queries 8. ACKNOWLEDGEMENT


on web data sources and has functionality similar to NiagaraCQ. We thank Zhichen Xu for his discussion with the first author
NiagaraCQ differs from Open-CQ in that we explore the during initial writing of the paper. We are particularly grateful to
similarity among large number of queries and use group Ashraf Aboulnaga, Navin Kabra and David Maier for their
optimization to achieve system scalability. careful review and helpful comments on the paper. We also
The TriggerMan [HCH+99] project proposes a method for thank the anonymous referees for their comments. Funding for
implementing a scalable trigger system based on the assumption this work was provided by DARPA through NAVY/SPAWAR
that many triggers may have common structure. It uses a special Contract No. N66001-99-1-8908 and NSF award CDA-
selection predicate index and an in-memory trigger cache to 9623632.
achieve scalability. We share the same assumption in our work
and borrow the concept of an expression signature from their
9. REFERENCES
[CM86] U..S. Chakravarthy and J. Minker. Multiple Query
work. We mainly focus on the incremental grouping of a subset
Processing in Deductive Databases using Query Graphs. VLDB
of the most frequently used expression signatures, which are in
Conference 1986: 384-391.
the format “Attribute op Constant”, where op is one of “<”, “=”
and “>”. The major differences between NiagaraCQ and [DFF+98] A. Deutsch, M. Fernandez, D. Florescu, A. Levy, D.
TriggerMan are: Suciu. XML-QL: A Query Langaage for XML.
http://www.w3.org/TR/NOTE-xml-ql.
1. NiagaraCQ uses an incremental group optimization strategy.
[HCH+99] E. N. Hanson, C. Carnes, L. Huang, M. Konyala, L.
2. NiagaraCQ uses a query-split scheme to allow the shared Noronha, S. Parthasarathy, J.B.Park and A. Vernon. Scalable
computation to become an individual query that can be Trigger Processing. In proceeding of 15th ICDE, page 266-275,
monitored and executed using a slightly modified query Sydney, Australia, 1999.
engine. TriggerMan uses a special in-memory predicate index
to evaluate the expression signature. [HJ94] E. N. Hanson and T. Johnson. Selection Predicate
Indexing for Active Databases Using Interval Skip List. TR94-
3. NiagaraCQ supports grouping of timer-based queries, a 017. CIS department, University of Florida, 1994.
capability not considered in [HCH+99].
[LPBZ96] L. Liu, C. Pu, R. Barga, T. Zhou. Differential
Sellis's work [Sel86] focused on finding an optimal plan for a Evaluation of Continual Queries. ICDCS 1996: 458-465.
small group of queries (usually lower than ten) by recognizing a [LPT99] L. Liu, C. Pu, W. Tang. Continual Queries for Internet
containment relationship among the selection predicates of Scale Event-Driven Information Delivery. TKDE 11(4): 610-
queries with both selection and join operators. This approach for 628 (1999).
group optimization was very expensive and not extendable to a
large number of queries. [MD89] D. McCarthy and U. Dayal. The architecture of an
active database management system. SIGMOD 1989: 215-224.
Recent work [ZDNS98] on group optimization mainly focuses
on applying group optimization to solve a specific problem. Our [RC88] A. Rosenthal and U. S. Chakravarthy. Anatomy of a
approach also falls into this category. Alert [SPAM91] was Modular Multiple Query Optimizer. VLDB 1988: 230-239.
among the earliest active database systems. It tried to reuse [Sel86] T. Sellis. Multiple query optimization. ACM
most parts of a passive DBMS to implement an active database. Transactions on Database Systems, 10(3), 1986.
[SJGP90]M. Stonebraker, A. Jhingran, J. Goh and S.
7. CONCLUSION Potamianos. On Rules, Procedures, Caching and Views in Data
Our goal is to develop an Internet-scale continuous query system Base Systems. SIGMOD Conference 1990: 281-290.
using group optimization based on the assumption that many
[SK95] E. Simon, A. Kotz-Dittrich. Promises and Realities of
continuous queries on the Internet will have some similarities.
Active Database Systems. VLDB 1995: 642-653.
Previous group optimization approaches consider grouping only
a small number of queries at the same time and are not scalable [SPAM91] U. Schreier, H. Pirahesh, R. Agrawal, and C.
to millions of queries. We propose a new “incremental Mohan. Alert: An architecture for transforming a passive dbms
grouping” methodology that makes group optimization more into an active dbms. VLDB 1991: 469-478.
scalable than the previous approaches. This idea can be applied [TGNO92] D. Terry, D. Goldberg, D. Nichols, and B. Oki.
to very general group optimization methods. We also propose a Continuous Queries over Append-Only Databases. SIGMOD
grouping method using a query-split scheme that requires 1992: 321-330.
minimal changes to a general purposed query engine. In our [WF89] J. Widom and S.J. Finklestein. Set-Oriented
system, both timer-based and change-based continuous queries Production Rules in Relational Database Systems. SIGMOD
can be grouped together for event detection and group Conference 1990: 259-270.
execution, a capability not found in other systems. Other
techniques to make our system scalable include incremental [ZDNS98] Y. Zhao, P. Deshpande, J. F. Naughton, A. Shukla.
evaluation of continuous queries, use of both pull and push Simultaneous Optimization and Evaluation of Multiple
models for detecting heterogeneous data source changes and a Dimensional Queries. SIGMOD 1998: 271-282.
caching mechanism. Preliminary experiments demonstrate that
our incremental group optimization significantly improves the
execution time comparing to non-grouping approach. The
results of experiments also show that the system can be scaled to
support very large number of queries.
Chapter 10
Stream-Based Data Management

Introduction

There are two driving forces behind the research presented in this section. The first concerns a
collection of applications that are poorly served by conventional DBMS technology, while the
second deals with the emergence of microsensors as an economically viable technology.

In financial services, there are many applications that deal with “data feeds”. These are typically
streams of stock market “tick” data, foreign exchange transactions, etc. Commercial feeds are
available from a variety of vendors, and all the major brokerage houses perform processing on
these feeds. Example applications include deciding where to send a trade request, determining
whether a given feed is damaged or late, deciding whether a given fund is in compliance with
SEC or brokerage house rules, and performing automated trading strategies. Essentially all of
these applications are currently written with “roll your own” technology. The brokerage houses
have looked for commercial solutions and come up empty-handed. There are two market
requirements that are not currently being met; namely scalable time series operations and real
time response. Real time means that the stream must be processed before it is stored in a DBMS.
Direct stream processing should be contrasted with store-and-query processing, where the data is
stored, indexed and then queried. Store-and-query processing has little chance of meeting the real
time requirements of Wall Street. Also, most of the Wall Street applications entail time-series
operations, which have historically been difficult for traditional DBMSs to deal with.

A similar state of affairs exists in industrial process control (IPC) applications. Continuous feed
factories, such as oil refineries, glass factories, chemical plants and food processing operations
generate substantial streams of real time data from individual processing steps. When the bottles
start breaking in the glass factory, automated software is desired that will quickly identify the
problem and adjust the machines in the factory to correct it. Again, we see time series operations
correlating multiple streams of data that must be performed in real time.

A third area with similar issues is network and system monitoring. Virtually all enterprises want
to perform real time network monitoring. Intrusion detection is the foremost application,
including denial of service attacks. Additionally, worm and virus detection costs enterprises
millions of dollars annually. The current solution is to run detection programs that look for
offensive code in incoming packets. However, this is only successful after the new threat has
been identified, a signature bit string in the threat discovered, and then loaded into anti-virus
software. By then, it is way too late. System administrators want to find threats in real time,
before they have an opportunity to wreak havoc. One possible scenario is to quarantine incoming
messages for a short while and look for patterns (for example, identical payloads from the same
set of sources addressed to many different people in the enterprise). Clearly, this is high volume
stream processing. Lastly, real time spam detection may be amenable to the same sort of
architecture.

In addition to network monitoring, similar applications can be built to monitor the health of large
computer systems. The individual hardware components (CPU, SAN, disk system, etc.) all
generate messages concerning their health and status; similar messages are generated by server
software (webservers, appservers, mailservers, DBMSs, etc.) Putting these disparate events
together into a coherent whole for delivery to a system administrator is a stream-based application
with high data rates.
Introduction 809

It is worth mentioning a cautionary tale from “click stream” analysis. In the late 1990’s there
were startups that focused on specialized systems to watch the clicks from a typical shopping site,
such as Amazon, with the idea of detecting interesting patterns in the sequence of clicks.
However, the commercial enterprises that tried this application reported that it was not
worthwhile, and click stream analysis as a standalone application fell out of favor. This is an
example where a single-app engine was not worth building. However, there may well be more
traction for more general-purpose stream analysis systems that monitor all kinds of system logs,
spanning a variety of hardware and software.

One of the main challenges in any of the above application settings is keeping up with the rate of
data production. Financial services feeds are typically running several thousand messages a
second, while network monitoring applications are a couple of orders of magnitude higher.
Database researchers in search of a high-volume, homogeneous-schema, structured data source
need look no further than the output of a packet sniffer on their local network. Running “tail -f”
on system logs adds a number of additional sources, leading to a fairly rich, high-volume
streaming database. An example of an interesting database-style query processor custom-
designed for high-volume network monitoring is Gigascope [CJSS03].

The second driving force behind streaming research is the emergence of low-cost microsensor
technology. This comes in a variety of forms.

At the low end are RFID tags. These are small, coin-sized devices that cost pennies per unit, and
are capable of transmitting a value (e.g. an ID) and perhaps computing a handful of instructions
(e.g. decrement a counter) when brought in proximity to an RFID reader. RFID readers are much
more expensive (currently from hundreds to thousands of dollars, though prices are falling) and
they require non-trivial power to run – hence they typically are connected to a fixed power source
and immobile. Over time, RFID will replace bar codes on most individual items in retail
applications. Currently, most retail stores use bar codes to track items as they are “swiped” at the
cash register; RFID can in principle allow retailers to know when an item leaves the shelf,
allowing perfect supply chain optimization. Similar benefits are available in warehouses, where
tracking pallets or cases by RFID can minimize misplaced merchandise. We expect that in the
next decade, RFID and similar tagging technologies will be cheap enough that every object of
material could be tagged and tracked.

Wireless sensor networks (sensornets) are a more intriguing technology, some years further away
from widespread deployment than RFID. Sensornets are made up of devices that combine
inexpensive sensors (e.g. temperature, pressure, acceleration, humidity, magnetic field, etc.) with
a low-function microprocessor (think of a 1980’s PC, or a PDP-11), a radio for communication,
and a battery for power. A set of such devices can autoconfigure themselves into a
communication network, and do modest computation while routing sensor readings toward some
base station. Sensornets can actively monitor their environment, and band together to do
distributed sensing and computation tasks.

Current generations of these devices are the size of a coin, but there are working, programmable
prototypes the size of a grain of salt. The vision is for these to become small and cheap enough in
the next decade to realize the science fiction idea of “Smart Dust” – disposable clouds of sensing
and computing infrastructure that could be easily deployed without careful installation or
configuration. A major challenge in these environments is to minimize battery drain by keeping
data acquisition and communication to a minimum. The design of “in-network” distributed data
acquisition and query systems for sensornets is in its infancy; the earliest proposal appeared in
810 Chapter 10: Stream-Based Data Management

2000 in Cornell’s Cougar project [BS00]. However, database-style querying has become a hot
topic in the sensornet research community, and the TinyDB system [MFHH03] is in steady use
and being supported by a leading commercial sensornet vendor.

There are (at least!) two major social issues that stand in the way of widespread deployment of
sensing technologies. The first is privacy. People are understandably wary of the possibility of
being monitored by third parties who do not have their interests at heart. A recent high-profile
story concerned Benetton, which backed down on plans to deploy RFID in each garment, when
public outcry developed about privacy considerations after the sale. At the very least, RFID
technology must be able to be permanently disabled in order to overcome end-customer concerns
about personal tracking. A second issue concerns the environmental impact of billions of
“disposable” computing devices. Some researchers are investigating biodegradable materials for
microsensing, which would certainly be a big change for the computer hardware industry.

Impact on DBMS Technology


The various instantiations of sensor devices will cause a whole new collection of applications to
emerge, such as the ones mentioned above. We will call these monitoring applications, and the
big question is “What impact will these applications have on data management systems?” Some
say current DBMSs can adequately deal with monitoring apps, while others claim new
technology is required.

There are several reasons why new DBMS technology might be required. First, monitoring
applications often require time-series operations. In a military monitoring application, one wants
to know the route of a particular vehicle over the last hour. In stock monitoring apps, one wants
to compute moving averages of particular securities. Although time series data was included in
the Informix Universal Server as a “blade”, efficient utilization of the blade required considerable
server changes. Also, the Informix blade was only able to handle “regular” time series data,
where events happened at specified intervals. Irregular time series, such as those resulting from
very thinly traded securities, required additional technology. It is fair to say that current servers
are not very good even at traditional time series applications like those from Wall Street.

Second, if a monitoring application becomes overloaded, it is sometimes acceptable to drop


observations on the floor or coalesce multiple observations into one synopsis, on the presumption
that more pressing fresh information will arrive shortly. Hence, in a condition of overload it can
be useful to trade off precision for unimportant objects to get good response time on important
matters. Priority scheduling and non-ACID behavior have long been tactics in networks and real-
time systems, but this sort of technology has not found its way into commercial DBMSs.

Third, monitoring applications have a big component that is “event-driven”. By this we mean
that a large volume of incoming messages must be processed to see which of a (perhaps large) set
of monitoring conditions becomes true. In effect the monitoring conditions are predicates and act
like queries. Hence, the “queries” are stored and the data is “pushed” through them. This is the
reverse of commercial DBMSs where the data is stored and the queries act against the stored data.

Put differently, monitoring applications are mostly events with a little querying on the side. In
contrast, OLTP applications are mostly querying. Current DBMSs have been optimized for
querying, and triggers (which act like monitoring conditions) have been added as an afterthought.
It is possible that there is a better architecture than that of commercial DBMSs for a mix of events
and queries.
Introduction 811

Lastly, incoming events typically have a fair amount of processing done on them. For example,
they are often noisy and must be cleaned. They often must be correlated with other events (so-
called sensor fusion). In addition, they must be converted to a common clock and/or a common
co-ordinate system. One possible architecture to perform this processing is to run such programs
in an application server in middleware, and then perform the DBMS functions in a back-end
DBMS. Such a multi-tier architecture will lead to many boundary crossings between middleware
and the DBMS, perhaps one per event processed. This could lead to serious performance
degradation. An alternative is to collapse the DBMS and middleware functions into a single
system specialized for stream processing.

Research Issues

It remains to be seen whether stream processing will have a long term impact on commercial
systems. However, the topic is being widely researched at the present time, not only in the
database systems community, but also in the theory community. There is a large body of
literature, for example [GGR02] that focuses on limited-memory, single-pass algorithms for
computing aggregates and mining results on stream-based information. This research has
similarities to the papers presented in the data mining section of this book. However, this class of
work assumes that only single-pass algorithms are acceptable on streams, and that limitations on
the size of the “state” that can be kept during the mining process may be present. We will not
discuss this stream mining in this section, but focus instead on the execution of explicit
monitoring queries.

The first issue in any stream processing system is the application program interface, i.e. how to
extend or change SQL to express the tasks of stream-oriented applications. Our first paper in this
chapter is the pioneering work of Seshadri et. al. on a stream-oriented query algebra from 1994; a
followon SQL-like language appeared a couple years thereafter [SLR96]. In our opinion, much
of the recent query language work in this area borrows heavily from these papers.

Then, we turn our attention to the challenges of handling a large number of standing queries over
streaming events. The query conditions may entail checking a predicate against each message,
or they may be more complex and involve joins with other message streams. For efficiency, it is
important to handle the batch of standing queries together, sharing work when possible. Early
research in this area focused on Rete networks [Forg82], and some of the later work has been
variants on this kind of discrimination network technology. The current best practice in this area
appears to be the work of Eric Hanson on Ariel, and we include his most recent paper as our
second selection in this chapter. It is instructive to view recent stream query work through the
lens of more traditional discrimination network ideas.

Current commercial systems support database triggers; however we know of no major vendor that
can support more than a few triggers per table. The basic problem is that these vendors merely
check each triggering condition individually on each update. Hence, supporting a large number
of triggering conditions will lead to very bad performance. To provide scalable performance on
large numbers of triggering conditions, an Ariel-style data structure must be used.

An alternate approach to discrimination networks borrows from conventional cost-based query


optimization. The result of query optimization is a query plan, which is a graph of operators,
through which tuples are either pulled or pushed. If tuples arrive at random times, they look like
messages, and the nodes look like monitoring operations. In conventional query optimization, the
sizes and data distributions of all the tables are known in advance, and the best query plan can be
812 Chapter 10: Stream-Based Data Management

obtained. However, in a message processing framework, messages may arrive according to


unknown distributions in arrival times and in data values, and an optimal plan cannot be
generated in advance. In this scenario, it is important that a query plan be able to adapt to
changes in the arrival rate and contents of messages. A survey of adaptive query optimization
schemes appears in [HFC+00]. Our third paper in this section presents the Berkeley work on
eddies, which is the first paper from the Telegraph project. Eddies are the most aggressive
proposal for allowing query plans to monitor and adapt to changes in the input distributions. This
first eddy paper does not particularly discuss streaming data sources, but it does discuss the way
in which various join operators allow for adapting a query plan mid-stream. The eddy
mechanism has been extended in a number of directions: [MSH02] and [CF03] extend eddies to
share work among multiple continuous queries over streams, and [RDH03] extends eddies to
consider multiple access and join methods.

One natural question concerning eddies is whether a system would need to adapt on every tuple.
In effect, the cost of adaptation must be balanced against the benefit of fine granularity alteration.
It turns out that the overhead of eddies can be largely masked by adapting at a slightly coarser
grain; say every 100 tuples or so [DESH04].

The last paper in this chapter, a description of the Aurora prototype, contains four features of
note. First, it represents an example of a complete specialized stream-processing system. Other
prototypes include Telegraph [CCD+03], and STREAM [MOTW03], and the interested reader is
encouraged to compare the architectures. As a second contribution, Aurora focuses on the issue
that stream processing engines are fundamentally real-time systems. As such, they must be aware
of latency and other quality-of-service issues, and Aurora has explored the benefits of building
quality-of-service deeply into the engine. Third, like the Berkeley work, Aurora recognizes the
need for adaptability in the query processing strategy. However, it stakes out a different point in
the adaptivity spectrum, trading fine-grained adaptivity for reduced overhead. A final feature of
Aurora is its focus on an algebra-style “boxes-and-arrows” approach to query composition, rather
than an extension to SQL. It is an open question whether dataflow diagrams or SQL queries will
be more natural for users of streaming systems.

The curious reader is also encouraged to consider the multi-query sharing focus of TelegraphCQ
[CCD+03], and the memory-minimization approach of STREAM [MOTW03,BBDM03,etc]. In
sum, these three systems focus on a mixture of adaptivity, controlled quality of service in the face
of overload, and multi-query sharing. In the absence of realistic applications, it is very hard to
tell what balance of these features is most important. Hopefully in the next few years these
projects will tackle some real problems and these lessons will come clear. We expect to see
continued research activity in this area, and attempts to move it into commercial systems. It will
be interesting to see the long term impact of this activity.

References

[BBDM03] Brian Babcock, Shivnath Babu, Mayur Datar and Rajeev Motwani. “Chain: Operator
Scheduling for Memory Minimization in Data Stream Systems”. In Proc. of the ACM-SIGMOD
International Conference on Management of Data, June 2003.

[BS00] Philippe Bonnet and Praveen Seshadri. “Device Database Systems.” In Proc. 16th
International Conference on Data Engineering (ICDE). San Diego, CA, February-March, 2000.

[CCD+03] Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph
M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel Madden, Vijayshankar Raman, Fred
Introduction 813

Reiss and Mehul A. Shah. “TelegraphCQ: Continuous Dataflow Processing for an Uncertain
World”. In Proc. First Biennial Conference on Innovative Data Systems Research (CIDR),
Asilomar, Ca., January 2003.

[CJSS03] Charles D. Cranor, Theodore Johnson, Oliver Spataschek and Vladislav Shkapenyuk.
“The Gigascope Stream Database”. ACM SIGMOD International Conference on Management of
Data, June, 2003.

[Forg82] Charles Forgy: Rete: A Fast Algorithm for the Many Patterns/Many Objects Match
Problem. Artificial Intelligence 19(1): 17-37 (1982).

[GGR02] Minos Garofalakis, Johannes Gehrke and Rajeev Rastogi. "Querying and Mining Data
Streams: You Only Get One Look" .Tutorial, ACM SIGMOD International Conference on
Management of Data, Madison, Wisconsin, June 2002. http://www.bell-
labs.com/user/minos/Talks/streams-tutorial02.ppt.

[HFC+00] Joseph M. Hellerstein, Michael Franklin, Sirish Chandrasekaran, Amol Deshpande,


Kris Hildrum, Sam Madden, Vijayshankar Raman and Mehul A. Shah. “Adaptive Query
Processing: Technology in Evolution”. IEEE Data Engineering Bulletin, June, 2000.

[MFHH03] Samuel R. Madden, et al. “The Design of an Acquisitional Query Processor for
Sensor Networks”. In Proceedings ACM-SIGMOD International Conference on Management of
Data, June 2003.

[MOTW03] Rajeev Motwani, Jennifer Widom, Arvind Arasu, Brian Babcock, ShivnathBabu,
Mayur Datar, Gurmeet Manku, Chris Olston, Justin Rosenstein, and Rohit Varma. “Query
Processing, Approximation, and Resource Management in a Data Stream Management System.”
In Proc. First Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, Ca.,
January 2003.

[MSH02] Samuel R. Madden, Mehul A. Shah and Joseph M. Hellerstein. “Continuously


Adaptive Continuous Queries over Streams”. In Proceedings ACM-SIGMOD International
Conference on Management of Data, Madison, WI, June 2002.

[RDH03] Vijayshankar Raman, Amol Deshpande and Joseph M. Hellerstein. Using State
Modules for Adaptive Query Processing. In Proc. International Conference on Data Engineering
(ICDE), 2003.

[SLR96] Praveen Seshadri, Miron Livny, Raghu Ramakrishnan: The Design and Implementation
of a Sequence Database System. In Proc. International Conference on Very Large Data Bases
(VLDB), 1996, pp. 99-110
Scalable Trigger Processing

Eric N. Hanson, Chris Carnes, Lan Huang, Mohan Konyala, Lloyd Noronha,
Sashi Parthasarathy, J. B. Park and Albert Vernon
301 CSE, CISE Department, University of Florida
Gainesville, FL 32611-6120
hanson@cise.ufl.edu, http://www.cise.ufl.edu/~hanson

† Internet and the World Wide Web makes it even more


Abstract
important that it be possible to support large numbers of
Current database trigger systems have extremely triggers. A web interface could allow users to
limited scalability. This paper proposes a way to develop interactively create triggers over the Internet. This type of
a truly scalable trigger system. Scalability to large architecture could lead to large numbers of triggers
numbers of triggers is achieved with a trigger cache to created in a single database.
use main memory effectively, and a memory-conserving This paper presents strategies for developing a highly
selection predicate index based on the use of unique scalable trigger system. The concepts introduced here are
expression formats called expression signatures. A key being implemented in a system we are developing called
observation is that if a very large number of triggers are TriggerMan, which consists of an extension module for an
created, many will have the same structure, except for the object-relational DBMS (a DataBlade for Informix with
appearance of different constant values. When a trigger Universal Data Option, hereafter simply called Informix
is created, tuples are added to special relations created [Info99]), plus some additional programs to be described
for expression signatures to hold the trigger’s constants. later. The approach we propose for implementing a
These tables can be augmented with a database index or scalable trigger system uses asynchronous trigger
main-memory index structure to serve as a predicate processing and a sophisticated predicate index. This can
index. The design presented also uses a number of types give good response time for updates, while still allowing
of concurrency to achieve scalability, including token processing of large numbers of potentially expensive
(tuple)-level, condition-level, rule action-level, and data- triggers. The scalability concepts outlined in this paper
level concurrency. could also be used in a trigger system inside a DBMS
server.
A key concept that can be exploited to develop a
1. Introduction scalable trigger system is that if a large number of triggers
Trigger features in commercial database products are are created, it is almost certainly the case that many of
quite popular with application developers since they allow them have almost the same format. Many triggers may
integrity constraint checking, alerting, and other have identical structure except that one constant has been
operations to be performed uniformly across all substituted for another, for example. Based on this
applications. Unfortunately, effective use of triggers is observation, a trigger system can identify unique
hampered by the fact that current trigger systems in expression signatures, and group predicates taken from
commercial database products do not scale. Numerous trigger conditions into equivalence classes based on these
database products only allow one trigger for each type of signatures.
update event (insert, delete and update) on each table. The number of distinct expression signatures is fairly
More advanced commercial trigger systems have effective small, small enough that main memory data structures can
limits of a few hundred triggers per table. be created for all of them. In what follows, we discuss the
Application designers could effectively use large TriggerMan command language and architecture, and then
numbers of triggers (thousands or even millions) in a turn to a discussion of how large numbers of triggers can
single database if it were feasible. The advent of the be handled effectively using expression signature
equivalence classes and a novel selection predicate

This research was supported by the Defense Advanced indexing technique.
Research Projects Agency, NCR Teradata Corporation, and
Informix Corporation.
Scalable Trigger Processing 815

2. The TriggerMan Command Language evaluated. This procedure binds the rule condition to the
rule action.
Commands in TriggerMan have a keyword-delimited,
An example of a more sophisticated rule (one whose
SQL-like syntax. TriggerMan supports the notion of a
condition involves joins) is as follows. Consider the
connection to a local Informix database, a remote
following schema for part of a real-estate database, which
database, or a generic data source program. A connection
would be imported by TriggerMan using define data
description for a database contains information about the
source commands:
host name where the database resides, the type of database
system running (e.g. Informix, Oracle, Sybase, DB2 etc.), house(hno,address,price,nno,spno)
the name of the database server, a user ID, and a salesperson(spno,name,phone)
password. A single connection is designated as the represents(spno,nno)
default connection. There can be multiple data sources neighborhood(nno,name,location)
defined for a single connection. Data sources normally
correspond to tables, but this is not essential. A rule on this schema might be “if a new house is
Triggers can be defined using this command: added which is in a neighborhood that salesperson Iris
represents then notify her,” i.e.:
create trigger <triggerName> [in setName] create trigger IrisHouseAlert
[optionalFlags] on insert to house
from fromList from salesperson s, house h, represents r
[on eventSpec] when s.name = ‘Iris’ and s.spno=r.spno and
[when condition] r.nno=h.nno
[group by attributeList] do raise event
[having groupCondition] NewHouseInIrisNeighborhood(h.hno, h.address)
do action
This command refers to three tables. The raise event
Triggers can be added to a specific trigger set. command used in the rule action is a special command
Otherwise they belong to a default trigger set. The from, that allows rule actions to communicate with the outside
on, and when clauses are normally present to specify the world [Hans98].
trigger condition. Optionally, group by and having
clauses, similar to those available in SQL [Date93], can 3. System Architecture
be used to specify trigger conditions involving aggregates The TriggerMan architecture is made up of the
or temporal functions. Multiple data sources can be following components:
referenced in the from clause. This allows multiple-table 1. the TriggerMan DataBlade which lives inside of
triggers to be defined. Informix,
An example of a rule, based on an emp table from a 2. data sources, which normally correspond to local or
database for which a connection has been defined, is remote tables. Most commonly, a data source will be
given below. This rule sets the salary of Fred to the salary a local table. In that case, standard Informix triggers
of Bob: are created automatically by TriggerMan to capture
create trigger updateFred updates to the table. We use the one trigger per table
from emp per update event available in Informix to capture
on update(emp.salary) updates and transmit them to TriggerMan by inserting
when emp.name = ’Bob’
Host DBMS: Informix with ...
do execSQL ’update emp set Universal Data Option
salary=:NEW.emp.salary where emp.name= TriggerMan
Data Source
’’Fred’’ ’ Trigger Driver

Update Queue
This rule illustrates the use of an execSQL TriggerMan Table
command that allows SQL statements to be run against a Data Source TriggerMan DataBlade TriggerMan
App Console
database. The :NEW notation in the rule action (the do Update Queue TriggerMan
...
clause) allows reference to new updated data values, the in Shared Memory Client App
new emp.salary value in this case. Similarly, :OLD allows Data Source ...
App
access to data values that were current just before an TriggerMan
Client App
update. Values matching the trigger condition are
substituted into the trigger action using macro Figure 1. The architecture of the TriggerMan
substitution. After substitution, the trigger action is trigger processor.
816 Chapter 10: Stream-Based Data Management

them in an update descriptor table. For remote data networks when used for trigger condition testing. The
sources, data source applications transmit update results could also be adapted to other trigger systems.
descriptors to TriggerMan through the data source
API (defined below). 4. General Trigger Condition Structure
3. TriggerMan client applications, which create Trigger conditions have the following general
triggers, drop triggers, register for events, receive structure. The from clause refers to one or more data
event notifications when triggers fire, etc., sources. The on clause may contain an event condition
4. one or more instances of the TriggerMan driver DataSource: emp
program, each of which periodically invokes a special Event: insert
TmanTest() function in the TriggerMan DataBlade, SyntaxTree:
>
allowing trigger condition testing and action
execution to be performed,
5. the TriggerMan console, a special application emp.sal CONSTANT
program that lets a user directly interact with the
system to create triggers, drop triggers, start the Figure 2. Example expression signature syntax
system, shut it down, etc. tree.
The general architecture of the TriggerMan system is for at most one of the data sources referred to in the from
illustrated in Figure 1. Two libraries that come with list. The when clause of a trigger is a Boolean-valued
TriggerMan allow writing of client applications and data expression. For a combination of one or more tuples from
source programs. These libraries define the TriggerMan data sources in the from list, the when clause evaluates to
client application programming interface (API) and the true or false.
TriggerMan data source API. The console program and A canonical representation of the when clause can be
other application programs use client API functions to formed in the following way:
connect to TriggerMan, issue commands, register for 1. Translate it to conjunctive normal form (CNF,
events, and so forth. Data source programs can be written i.e. and-of-ors notation).
using the data source API. Updates received from update 2. Each conjunct refers to zero, one, two, or
capture triggers or data source programs are consumed on possibly more data sources. Group the conjuncts
the next call to TmanTest(). by the set of data sources they refer to.
As Figure 1 shows, data source programs or triggers If a group of conjuncts refers to one data source, the
can place update descriptors in a table acting as a queue. logical AND of these conjuncts is a selection predicate. If
This works in the current implementation. We plan to it refers to two data sources, the AND of its conjuncts is a
allow updates to be delivered into a main-memory queue join predicate. If it refers to zero conjuncts, it is a trivial
as well in the future. This will deliver updates faster, but predicate. If it refers to three or more data sources, we
the safety of persistent update queuing will be lost. call it a hyper-join predicate.
Trigger processing in the current system is asynchronous. These predicates may or may not contain constants.
If simple Informix triggers are used to capture updates, The general premise of this paper is that very large
TriggerMan could process triggers synchronously as well. numbers of triggers will only be created if predicates in
We plan to add this feature in a later implementation. different triggers contain distinct constant values. Below,
TriggerMan is based on an object-relational data we will examine how to handle selection and join
model. The current implementation supports char, predicates that contain constants, so that scalability to
varchar, integer, and float data types. Support for user- large numbers of triggers can be achieved.
defined types is being added.
5. Scalable Predicate Indexing Using
Trigger Condition Testing Algorithm Expression Signatures
TriggerMan uses a discrimination network called an A- In what follows, we treat the event (on) condition
TREAT network [Hans96] a variation of the TREAT separately from the when condition as a convenience.
network [Mira97] for trigger condition testing. In the However, event conditions and when clause conditions
future, we plan to implement an optimized type of are both logically selection conditions [Hans96] that can
discrimination network called a Gator network in be applied to update descriptors submitted to the system.
TriggerMan [Hans97b]. A tuple variable is a symbol, defined in the from
This paper focuses primarily on efficient and scalable clause of a trigger, which corresponds to a usage of a
selection condition testing and rule action execution. The particular data source in that trigger. The general form of
results are applicable to TREAT, Rete [Forg82] and Gator a selection predicate is:
Scalable Trigger Processing 817

(C11 OR C 22 OR ... OR C1 N1 ) AND ... AND (C K 1 OR C K 2 OR ... OR C KN K ) Expression signatures represent the logical structure or
where all clauses C ij appearing in the predicate refer to schema of a part of a trigger condition. We assert that in a
real application of a trigger system like TriggerMan, even
the same tuple variable. Furthermore, each such clause is if very large numbers of triggers are defined, only a
an atomic expression that does not contain Boolean relatively small number of unique expression signatures
operators, other than possibly the NOT operator. A single will ever be observed - perhaps a few hundred or a few
clause may contain constants.
For convenience, we assume that every data source has data source
predicate index
a data source ID. A data source corresponds to a single predicate indexes
root
table in a remote or local database, or even a single stream
of tuples sent in messages from an application program. … …
An expression signature for a general selection or join
… … …
predicate expression is a triple consisting of a data source
ID, an operation code (insert, delete, update, or …
insertOrUpdate), and a generalized expression. If a tuple
variable appearing in the from clause of a trigger does not expression
have any event specified in the on clause, then the event is signature list
implicitly insert or update for that tuple variable. The
format of the generalized expression is: Figure 3. Predicate Index Structure.
(C ’11 OR C ’22 OR ... OR C ’1N1 ) AND ... AND (C ’K 1 OR C ’K 2 OR ... OR C ’KN K )

where clause C’ij is the same as Cij except that all thousand at most. Based on this observation, it is feasible
constants inCij are substituted with placeholder symbols. to keep a set of data structures in main memory to
represent all the distinct expression signatures appearing
If the entire expression has m constants, they are in all triggers. Since many triggers may have the same
numbered 1 to m from left to right. If the constant signature but contain different constants, tables will be
number x, 1 ≤ x ≤ m, appears in the clause C ij in the created to store these constants, along with information
linking them to their expression signature. When these
original expression, then it is substituted with placeholder
tables are small, low-overhead main-memory lists or
CONSTANTx in Cij in the expression signature. indexes can be used to cache information from them.
As a practical matter, most selection predicates will not When they are large, they can be stored as standard tables
contain OR’s, and most will have only a single clause. (with an index when appropriate) and queried as needed,
Consider this example trigger condition: using the SQL query processor, to perform trigger
condition testing. We will elaborate further on
on insert to emp implementation issues below.
when emp.salary > 80000
In an implementation, the generalized expression in an 5.1. Processing a Trigger Definition
expression signature can be a syntax tree with When a create trigger statement is processed, a
placeholders at some leaf nodes representing the location number of steps must be performed to update the trigger
where a constant must appear. For example, the signature system catalogs and main memory data structures, and to
of the trigger condition just given can be represented as “prime” the trigger to make it ready to run. The primary
shown in tables that form the trigger catalogs are these:
Figure 2. The condition: trigger_set(tsID, name, comments, creation_date,
on insert to emp isEnabled)
when emp.salary > 50000 trigger(triggerID, tsID, name, comments, trigger_text,
has a different constant than the earlier condition, but it creation_date, isEnabled, …)
has the same signature. In general, an expression signature The purpose of the isEnabled field is to indicate
defines an equivalence class of all instantiations of that whether a trigger or trigger set is currently enabled and
expression with different constant values. eligible to fire if matched by some update. The other
If an expression is in the equivalence class defined by fields are self-explanatory. A data structure called the
an expression signature, we say the expression matches trigger cache is maintained in main memory. This
the expression signature. contains complete descriptions of a set of recently
accessed triggers, including the trigger ID and name,
818 Chapter 10: Stream-Based Data Management

references to data sources relevant to the trigger, and the Here, N is the identification number of the expression
syntax tree and Gator network skeleton for the trigger. signature. The fields of const_tableN have the following
Given current main memory sizes, thousands of trigger meaning:
descriptions can be loaded in the trigger cache 1. exprID is the unique ID of a selection predicate E,
simultaneously. E.g. if a trigger description takes 4K 2. triggerID is the unique ID number of the trigger
bytes (a realistic number), and 64Mbytes are allocated to containing E,
the trigger cache, 16,384 trigger descriptions can be 3. nextNetworkNode identifies the next A-TREAT
loaded simultaneously. network node of trigger triggerID to pass a token to
Another main memory data structure called a predicate after it matches E (an alpha node or a P-node),
index is maintained. A diagram of the predicate index is 4. const1 … constK are constants found in the indexable
shown in Figure 3. The predicate index can take an update portion of E, and
descriptor and identify all predicates that match it. 5. restOfPredicate is a description of the non-indexable
Expression signatures may contain more than one part of E. The value of restOfPredicate is NULL if
conjunct. If a predicate has more than one conjunct, a the entire predicate is indexable.
single conjunct is identified as the most selective one. If the table is large, and the signature of the indexable
Only this one is indexed directly. If a token matches a part of the predicate is of the form
conjunct, any remaining conjuncts of the predicate are attribute1=CONSTANT1 AND …
located and tested against the token. If the remaining attributeK=CONSTANTK, the table will have a clustered
clauses match, then the token has completely matched the index on [const1, … constK] as a composite key. If the
predicate clause. See [Hans90] for more details on this predicate has a different type of signature based on an
technique. operator other than “=”, it may still be possible to use an
The root of the predicate index is linked to a set of index on the constant fields. As future work, we propose
data source predicate indexes using a hash table on data to develop ways to index for non-equality operators and
source ID. Each data source predicate index contains an constants whose types are user-defined [Kony98].
expression signature list with one entry for each unique Putting a clustered index on the constant attributes will
expression signature that has been used by one or more allow the triggerIDs of triggers relevant to a new update
triggers as a predicate on that data source. For each descriptor matching a particular set of constant values to
expression signature that contains one or more constant be retrieved together quickly without doing random I/O.
placeholders, there will be a constant table. This is an Notice that const_tableN is not in third normal form. This
ordinary database table containing one row for each was done purposely to eliminate the need to perform joins
expression occurring in some trigger that matches the when querying the information represented in the table.
expression signature. Referring back to the definition of the
When triggers are created, any new expression expression_signature table, we can now define the
signatures detected are added to the following table in the remaining attributes:
trigger system catalogs: 1. constTableName is a string giving the name of the
expression_signature(sigID, dataSrcID, signatureDesc, constant table for an expression signature,
constTableName, constantSetSize, 2. constantSetSize is the number of distinct constants
constantSetOrganization) appearing in expressions with a given signature, and
3. constantSetOrganization describes how the set of
The sigID field is a unique ID for a signature. The
constants will be organized in either a main-memory
dataSrcID field identifies the data source on which the
or disk-based structure to allow efficient trigger
signature is defined. The signatureDesc field is a text
condition testing. The issue of constant set
field with a description of the signature. We will define
organization will be covered more fully later in the
the other fields later.
paper.
When an expression signature E is encountered at
Given the disk- and memory-based data structures just
trigger creation time, it is broken into two parts: the
described, the steps to process a create trigger statement
indexable part, E_I, and the non-indexable part, E_NI, as
are:
follows:
1. Parse the trigger and validate it (check that it is a
E = E_I AND E_NI
legal statement).
The non-indexable portion may be NULL. The format of
2. Convert the when clause to conjunctive normal form
the constant table for an expression signature containing
and group the conjuncts by the distinct sets of tuple
K distinct constants in its indexable portion is:
variables they refer to, as described in section 4.
const_tableN(exprID, triggerID, nextNetworkNode, 3. Based on the analysis in the previous step, form a
const1, … constK, restOfPredicate) trigger condition graph. This is an undirected graph
Scalable Trigger Processing 819

with a node for each tuple variable, and an edge for data source predicate index root
predicate indexes
each join predicate identified. The nodes contain a
reference to the selection predicate for that node, … …
represented as a CNF expression. The edges each expression
… … … signature list
contain a reference to a CNF expression for the join
condition associated with that edge. Groups of … …
conjuncts that refer to zero tuple variables or three or
more tuple variables are attached to a special “catch constant set (set of

all” list associated with the query graph. These will unique constants)
be handled as special cases. Fortunately, they will …
rarely occur. We will ignore them here to simplify
the discussion. triggerID set (set of IDs
4. Build the A-TREAT network for the rule. of different triggers
having same set of
5. For each selection predicate above an alpha node in constants)
the network, do the following:
Check to see if its signature has been seen before by Figure 4. Expanded View of Normalized
comparing its signature to the signatures in the Predicate Index Structure.
expression signature list for the data source on which
the predicate is defined (see Figure 3). If no but different constants -- they are mandatory in a scalable
predicate with the same signature has been seen trigger system. Strategies 1 and 2 are also required in
before, order to make the common case (a few thousand triggers
• add the signature of the predicate to the list and or less) fast. A cost model that illustrates the tradeoffs is
update the expresssion_signature catalog table. presented in [Hans98b]. Strategies 1 and 2 have been
• If the signature has at least one constant implemented in TriggerMan and strategies 3 and 4 are
placeholder in it, create a constant table for the under construction.
expression signature.
If the predicate has one or more constants in it, add 5.3. Common Sub-expression Elimination for
one row to the constant table for the expression Selection Predicates
signature of the predicate. An important performance enhancement to reduce the
total time needed to determine which selection predicates
5.2. Alternative Organization Strategies for
match a token is common sub-expression elimination.
Expression Equivalence Classes This can be achieved by normalizing the predicate index
For a particular expression signature that contains at structure. Figure 4 shows an expanded view of the
least one constant placeholder, there may be one or more predicate index given in Figure 3. The constant set of an
expressions in its equivalence class that belong to expression signature contains one element for each
different triggers. This number could be small or large. constant (or tuple of constants [const1, … ,constK])
To get optimal performance over a wide range of sizes of occurring in some selection predicate that matches the
the equivalence classes of expressions for a particular signature. Each constant is linked to a triggerID set,
expression signature, alternative indexing strategies are which is a set of the ID numbers of triggers containing a
needed. Main-memory data structures with low overhead particular selection predicate. For example, if there are
are needed when the size of an equivalence class is small. rules of the form:
Disk-based structures, including indexed or non-indexed
tables, are needed when the size of an equivalence class is create trigger T_I from R when R.a = 100 do …
large. for I=1 to N, then there will be an expression signature
The following four ways can be considered to organize R.a=CONSTANT, the constant set for this signature will
the predicates in an expression signature’s equivalence contain an entry 100, and the triggerID set for 100 will
class: contain the ID numbers of T_1 … T_N.
1. main memory list We will implement constant sets and triggerID sets in a
2. main memory index fully normalized form, as shown in Figure 4, when these
3. non-indexed database table sets are stored as either main memory lists or indexes
4. indexed database table (organizations 1 and 2). This normalized main-memory
Strategies 3 and 4 must be implemented to make it data structure will be built using the data retrieved from
feasible to process very large numbers of triggers the constant table for the expression signature.
containing predicate expressions with the same signature
820 Chapter 10: Stream-Based Data Management

(from top part of


5.4. Processing Update Descriptors Using the predicate index)
Predicate Index …
Recall that an update descriptor (token) consists of a
data source ID, an operation code, and an old tuple, new
… … expression
tuple, or old/new tuple pair. When a new token arrives, signature list
the system passes it to the root of the predicate index, …
… 1
which locates its data source predicate index. For each constant set (set of … 2
unique constants) …
expression signature in the data source predicate index, a … N
… … …
specific type of predicate testing data structure (in-
memory list, in-memory lightweight index, non-indexed triggerID set (set of IDs
database table, or indexed database table) is in use for that of different triggers
1 2 N having same constant
expression signature. The predicate testing data structure appearing for a
of each of these expression signatures is searched to find particular signature)
matches against the current token.
When a matching constant is found, the triggerID set Figure 5. Illustration of partitioned constant
for the constant contains one or more elements. Each of sets and triggerID sets to facilitate concurrent
these elements contains zero or more additional selection processing.
predicate clauses. For each element of the triggerID set 2. Condition-level concurrency: multiple selection
currently being visited, the additional predicate clause(s) conditions can be tested against a single token
are tested against the token, if there are any. concurrently.
When a token is found to have matched a complete 3. Rule action concurrency: multiple rule actions that
selection predicate expression that belongs to a trigger, have been fired can be processed at the same time.
that trigger is pinned in the trigger cache. This pin 4. Data-level concurrency: a set of data values in an
operation is analogous to the pin operation in a traditional alpha or beta memory node of an A-TREAT or Gator
buffer pool; it checks to see if the trigger is in memory, network [Hans97] can be processed by a query that
and if it is not, it brings it in from the disk-based trigger can run in parallel.
catalog. The pin operation ensures that the A-TREAT For ideal scalability, a trigger system must be able to
network and the syntax tree of the trigger are in main- capitalize on all four of these types of concurrency. The
memory. After the trigger is pinned, ensuring that it’s A- current implementation supports token level concurrency
TREAT network is in main memory, the token is passed only. We plan to support the other types of concurrency
to the node of the network identified by the in future versions of the system. Such a future version will
nextNetworkNode field of the expression that just make use of a task queue kept in shared memory to store
matched the token. incoming or internally generated work. An explicit task
Processing of join and temporal conditions is then queue must be maintained because it is not possible to
performed if any are present. Finally, if the trigger spawn native operating system threads or processes to
condition is satisfied, the trigger action is executed. carry out tasks due to the process architecture of Informix
[Info99].
6. Concurrent Token Processing and
The concurrent processing architecture, as illustrated in
Action Execution Figure 1, will make use of N driver processes. We define
An important way to get better scalability is to use NUM_CPUS to be the number of real CPUs in the
concurrent processing. On an SMP platform, concurrent system, and TMAN_CONCURRECY_LEVEL to be the
tasks can execute in parallel. Even on a single processor, fraction of CPUs to devote to concurrent processing in
use of concurrency can give better throughput and TriggerMan, which can be in the range (0%,100%]. The
response time by making scarce CPU and I/O resources TriggerMan administrator can set the
available to multiple tasks so any eligible task can use TMAN_CONCURRENCY_LEVEL parameter. Its
them. There are a number of different kinds of default value is 100%. N is defined as follows:
concurrency that a trigger system can exploit for improved N = NUM_CPUS*TMAN_CONCURRENCY_LEVEL
scalability: Each driver process will call TriggerMan's TmanTest()
1. Token-level concurrency: multiple tokens can be function every T time units. Each driver will also call back
processed in parallel through the selection predicate immediately after one execution of TmanTest() if work is
index and the join condition-testing network. still left to do. We propose a default value of T equal to
250 milliseconds; determining the best value of T is left
for future work. TmanTest will do the following:
Scalable Trigger Processing 821

while(total execution time of this invocation of equal size. Multiple subsets would be processed in
TmanTest < THRESHOLD and work is left in the parallel to achieve a speedup.
task queue)
{ 7. Trigger Application Design
Get a task from the task queue and execute it. The trigger system proposed in this paper is designed
Yield the processor so other Informix tasks can use it to be highly scalable. However, just because programmers
(call the Informix mi_yield routine [Info99]). can create a large number of triggers does not mean that is
} always the best approach. If triggers have extremely
if task queue is empty regular structure, it may be best to create a single trigger
return TASK_QUEUE_EMPTY and a table of data referenced in the trigger’s from clause
return TASKS_REMAINING to customize the trigger’s behavior. This is discussed in
more detail in a longer version of this paper [Hans98b].
The driver program will wait for T time units if the last
call to TmanTest() returns TASK_QUEUE_EMPTY. 8. Related Work
Otherwise, the driver program will immediately call There has been a large body of work on active database
TmanTest() again. The default value of THRESHOLD systems, but little of it has focussed on predicate indexing
will be 250 milliseconds also, to keep the task switch or scalability. Representative works include HiPAC,
overhead between the driver programs and the Informix Ariel, the POSTGRES rule system, the Starburst Rule
processes reasonably low, yet avoid a long user-defined System, A-RDL, Chimera, RPL, DIPS and Ode
routine (UDR) execution. A long execution inside [Hans96,McCa89,Ston90,Wido96]. Most active database
TriggerMan should be avoided since it could result in systems follow the event-condition-action (ECA) model
higher probability of faults such as deadlock or running proposed for HiPAC in a straightforward way, testing the
out of memory. Keeping the execution time inside condition of every applicable trigger whenever an update
TriggerMan reasonably short also avoids the problem of event occurs. The cost of this is always at least linear in
excessive lost work if a rollback occurs during trigger the number of triggers associated with the relevant event
processing. since no predicate indexing is normally used. Moreover,
Tasks can be one of the following: the cost per trigger can be high since checking the
1. process one token to see which rules it matches condition can involve running an expensive query.
2. run one rule action Work by Hanson and Johnson focuses on indexing of
3. process a token against a set of conditions range predicates using the interval skip-list data structure
4. process a token to run a set of rule actions triggered [Hans96b], but this approach does not scale to very large
by that token numbers of rules since it may use a large amount of main
Task types 1 and 2 are self-explanatory. Tasks of type memory. Work on the Rete [Forg82] and TREAT
3 and 4 can be generated if conditions and potential [Mira87] algorithms for efficient implementation of AI
actions (triggerID structures containing the “rest of the production systems is related to the work presented here,
condition”) in the predicate index are partitioned in but the implicit assumption in AI rule system architectures
advance so that multiple predicates can be processed in is that the number of rules is small enough to fit in main
parallel. An example of when it may be beneficial to memory. Additional work has been done in the AI
partition predicates in advance is when there are many community on parallel processing of production rule
rules with the same condition but different actions. For systems [Acha92], but this does not fully address the issue
example, suppose there are M rules of the form: of scaling to large numbers of rules. Issues related to
create trigger T_K high-performance parallel rule processing in production
from R systems are surveyed by Gupta et al. [Gupt89]. They cite
when R.company = "IBM" several types of parallelism that can be exploited,
do raise event notify_user("user K", R.company, including node, intranode, action, and data parallelism.
R.sharePrice) These overlap with the types of concurrency we outlined
in section 6. Work by Hellerstein on performing
for K=1..M. If M is a large number, a speedup can be selections after joins in query processing [Hell98] is
obtained by partitioning this set of triggers into N sets of related to the issue of performing expensive selections
equal size. This would result in a predicate index after joins in Gator networks and A-TREAT networks
substructure like that illustrated in Figure 5. [Kand98]. Proper placement of selection predicates in
Here, the triggerID set would contain references to Gator networks can improve trigger system performance,
triggers T_1 … T_M. These references would be and thus scalability.
partitioned round robin into N subsets of approximately
822 Chapter 10: Stream-Based Data Management

The developers of POSTGRES proposed a marking- kept in a database [Bran93], and is thus related to rule
based predicate indexing scheme, where data and index system scalability. A contribution of the DATEX system
records are tagged with physical markers to indicate that a was an improved way to represent information normally
rule might apply to them [Ston87,Ston90]. Predicates that kept in alpha-memory nodes in TREAT networks.
can’t be solved using an index result in placement of a However, DATEX was focussed on large-scale
table-level marker. This scheme has the advantage that production systems, whereas the work presented in this
the system can determine which rules apply primarily by paper is oriented to handling large numbers of triggers
detecting markers on tables, data records, and index that operate in conjunction with databases and database
records. Query and update processing algorithms must be applications, so our work is not directly comparable to
extended in minor ways to accomplish this. DATEX. In summary, what sets our work apart from
A disadvantage of this scheme is that it complicates prior research efforts on database trigger systems and
implementation of storage and index structures. database-oriented expert systems tools is our focus on
Moreover, when new records are inserted or existing scalability from multiple dimensions. These include the
records are updated, a large number of table-level markers capacity to accommodate large numbers of triggers,
may be disturbed. The predicate corresponding to every handle high rates of data update, and efficiently fire large
one of these disturbed markers must be tested against the numbers of triggers simultaneously. We achieve
records, which may be quite time-consuming scalability through careful selection predicate index
[Ston87,Ston90]. This phenomenon will occur even for design, and support for four types of concurrency (token-
simple predicates of the form attribute=constant if there is level, condition-level, rule-action-level, and data-level).
no index on the attribute.
Research on the RPL system [Delc88a,Delc88b] 9. Conclusion
addressed the issue of execution of production-rule-style This paper describes an architecture that can be used to
triggers in a relational DBMS, but its developers did not build a truly scalable trigger system. As of the date of this
use a discrimination network structure. They instead used writing, this architecture is being implemented as an
an approach that runs database queries to test rule Informix DataBlade along with a console program, a
conditions as updates occur. This type of approach has driver program, and data source programs. The
limited scalability due to the potentially large number of architecture presented is a significant advance over what
queries that could be generated if there are many rules. is currently available in database products. It also
Work on consistent processing of constraints and triggers generalizes earlier research results on predicate indexing
in SQL relational databases [Coch96] has helped lead to and improves upon their limited scalability
recent enhancements to the SQL3 standard. However, the [Forg82,Mira87,Ston87,Hans90,Hans96]. This
focus of this work is on trigger and constraint semantics. architecture could be implemented in any object-relational
An implicit assumption in it is that constraints and triggers DBMS that supports the ability to execute SQL statements
will be processed using a query-based approach, which inside user-defined routines (SQL callbacks). A variation
will not scale up to a large number of triggers and of this architecture could also be made to work as an
constraints. We speculate that it may be possible to work external application, communicating with the database via
around this assumption. A predicate index like the one a standard interface (ODBC [Geig95]).
proposed in this paper potentially could be used. One topic for future research includes developing ways
The DIPS system [Sell88] uses a set of special to handle temporal trigger processing [Hans97,AlFa98] in
relations called COND relations for each condition a scalable way, so that large numbers of triggers with
element (tuple variable) in a rule. These COND relations temporal conditions can be processed efficiently. Another
are queried and updated to perform testing of both potential future research topic involves ways to support
selection and join conditions of rules. Embedding all scalable trigger processing for trigger conditions involving
selection predicate testing into a process that must query aggregates. Finally, a third potential research topic is to
database tables is not particularly efficient – it will not develop a technique to make the implementation of the
compare favorably to using some sort of main-memory main-memory and disk-based structures used to organize
predicate index. A main-memory predicate index should the constant sets illustrated in Figure 4 extensible, so they
be used to get the best performance for a small-to-medium will work effectively with new operators and data types.
number of predicates, which is the common case. In the end, the results of this paper and the additional
However, DIPS was capable of utilizing parallelism via research outlined here can make highly efficient, scalable,
the database query processor to test rule conditions, a and extensible trigger processing a reality.
feature in common with the system described in this
paper. The DATEX system addresses the issue of
executing large expert systems when working memory is
Scalable Trigger Processing 823

[Hans98] Hanson, Eric N., I.C. Chen, R. Dastur, K. Engel, V.


References Ramaswamy, W. Tan, C. Xu, “A Flexible and Recoverable
[Acha92] Acharya, A., M. Tambe, and A. Gupta, Client/Server Database Event Notification System,” VLDB
“Implementation of Production Systems on Message-Passing Journal, vol. 7, 1998, pp. 12-24.
Computers,” IEEE Transactions on Knowledge and Data [Hans98b] Hanson, Eric N. et al., “Scalable Trigger Processing
Engineering, 3(4), July 1992. in TriggerMan,” TR-98-008, U. Florida CISE Dept., July 1998.
[AlFa98] Al-Fayoumi, Nabeel, Temporal Trigger Processing in http://www.cise.ufl.edu
the TriggerMan Active DBMS, Ph.D. dissertation, Univ. of [Hell98] Hellerstein, J., “Optimization Techniques for Queries
Florida, August, 1998. with Expensive Methods,” to appear, ACM Transactions on
[Bran93] Brant, David A. and Daniel P. Miranker, “Index Database Systems (TODS). Available at
Support for Rule Activation,” Proceedings of the ACM www.cs.berkeley.edu/~jmh.
SIGMOD Conference, May, 1993, pp. 42-48. [Info99] “Informix Dynamic Server, Universal Data Option,”
[Coch96] Cochrane, Roberta, Hamid Pirahesh and Nelson http://www.informix.com.
Mattos, “Integrating Triggers and Declarative Constraints in [Kand98] Kandil, Mohktar, Predicate Placement in Active
SQL Database Systems,” Proceedings of the 22nd VLDB Database Discrimination Networks, Ph.D. Dissertation, CISE
Conference, pp. 567-578, Bombay, India, 1996. Department, Univ. of Florida, Gainesville, August 1998.
[Date93] Date, C. J. And Hugh Darwen, A Guide to the SQL [Kony98] Konyala, Mohan K., Predicate Indexing in
Standard, 3rd Edition, Addison Wesley, 1993. TriggerMan, MS thesis, CISE Department, Univ. of Florida,
[Delc88a] Delcambre, Lois and James Etheredge, “The Gainesville, Dec. 1998.
Relational Production Language: A Production Language for [McCa89] "McCarthy, Dennis R. and Umeshwar Dayal, “The
Relational Databases,” Proceedings of the Second International Architecture of an Active Data Base Management System,”
Conference on Expert Database Systems, pp. 153-162, April Proceedings of the. ACM SIGMOD Conference on Management
1988. of Data, Portland, OR, June, 1989, pp. 215-224.
[Delc88b] Delcambre, Lois and James Etheredge, “A Self- [Mira87] Miranker, Daniel P., “TREAT A Better Match
Controlling Interpreter for the Relational Production Language,” Algorithm for AI Production Systems,” Proceedings of the AAAI
Proceedings of the ACM-SIGMOD Conference on Management Conference, August 1987, pp. 42-47.
of Data, pp. 396-403, June 1988. [Sell88] Sellis, T., C.C. Lin and L. Raschid, “Implementing
[Forg82] Forgy, C. L., Rete: “A Fast Algorithm for the Many Large Production Systems in a DBMS Environment: Concepts
Pattern/Many Object Pattern Match Problem,” Artificial and Algorithms,” Proceedings of the 1988 ACM SIGMOD
Intelligence, vol. 19, pp. 17-37, 1982. Conference.
[Geig95] Geiger, Kyle, Inside ODBC, Microsoft Press, 1995. [Ston87] Stonebraker, M., T. Sellis and E. Hanson, “An
[Gupt89] Gupta, Anoop, Charles Forgy and Allen Newell, Analysis of Rule Indexing Implementations in Database
“High Speed Implementations of Rule-Based Systems,” ACM Systems,” Expert Database Systems: Proceedings from the First
Transactions on Computer Systems, vol. 7, no. 2, pp. 119-146, International Workshop, Benjamin Cummings, 1987, pp. 465-
May, 1989. 476.
[Hans90] Hanson, Eric N., M. Chaabouni, C. Kim and Y. [Ston90] Stonebraker, Michael, Larry Rowe and Michael
Wang, “A Predicate Matching Algorithm for Database Rule Hirohama, “The Implementation of POSTGRES,” IEEE
Systems,” Proceedings of the ACM-SIGMOD Conference on Transactions on Knowledge and Data Engineering, vol. 2, no.
Management of Data, pp. 271-280, Atlantic City, NJ, June 7, March, 1990, pp. 125-142.
1990. [Wido96] Widom, J. And S. Ceri, Active Database Systems,
[Hans96] Hanson, Eric N., “The Design and Implementation of Morgan Kaufmann, 1996.
the Ariel Active Database Rule System,” IEEE Transactions on.
Knowledge and Data Engineering, vol. 8, no. 1, pp. 157-172,
February 1996.
[Hans96b] Hanson, Eric N. and Theodore Johnson, “Selection
Predicate Indexing for Active Databases Using Interval Skip
Lists,” Information Systems, vol. 21, no. 3, pp. 269-298, 1996.
[Hans97] Hanson, Eric N., N. Al-Fayoumi, C. Carnes, M.
Kandil, H. Liu, M. Lu, J.B. Park, A. Vernon, “TriggerMan: An
Asynchronous Trigger Processor as an Extension to an Object-
Relational DBMS,” University of Florida CISE Dept. Tech.
Report 97-024, December 1997. http://www.cise.ufl.edu.
[Hans97b] Hanson, Eric N., Sreenath Bodagala, and Ullas
Chadaga, “Optimized Trigger Condition Testing in Ariel Using
Gator Networks,” University of Florida CISE Dept. Tech.
Report 97-021, November 1997. http://www.cise.ufl.edu.
The Design and Implementation of a Sequence Database System 825
826 Chapter 10: Stream-Based Data Management
The Design and Implementation of a Sequence Database System 827
828 Chapter 10: Stream-Based Data Management
The Design and Implementation of a Sequence Database System 829
830 Chapter 10: Stream-Based Data Management
The Design and Implementation of a Sequence Database System 831
832 Chapter 10: Stream-Based Data Management
The Design and Implementation of a Sequence Database System 833
834 Chapter 10: Stream-Based Data Management
The Design and Implementation of a Sequence Database System 835
Eddies: Continuously Adaptive Query Processing
Ron Avnur Joseph M. Hellerstein
University of California, Berkeley
avnur@cohera.com, jmh@cs.berkeley.edu

    

In large federated and shared-nothing databases, resources can


exhibit widely fluctuating characteristics. Assumptions made
at the time a query is submitted will rarely hold throughout
the duration of query processing. As a result, traditional static
query optimization and execution techniques are ineffective in
these environments.
In this paper we introduce a query processing mechanism
called an eddy, which continuously reorders operators in a
query plan as it runs. We characterize the moments of sym-
metry during which pipelined joins can be easily reordered,
and the synchronization barriers that require inputs from dif-
ferent sources to be coordinated. By combining eddies with
appropriate join algorithms, we merge the optimization and
execution phases of query processing, allowing each tuple to
have a flexible ordering of the query operators. This flexibility
is controlled by a combination of fluid dynamics and a simple
learning algorithm. Our initial implementation demonstrates Figure 1: An eddy in a pipeline. Data flows into the eddy from
promising results, with eddies performing nearly as well as input relations   and . The eddy routes tuples to opera-
 

a static optimizer/executor in static scenarios, and providing tors; the operators run as independent threads, returning tuples
dramatic improvements in dynamic execution environments. to the eddy. The eddy sends a tuple to the output only when
 
       

it has been handled by all the operators. The eddy adaptively
chooses an order to route each tuple through the operators.
There is increasing interest in query engines that run at un-
precedented scale, both for widely-distributed information re- meric data sets is fairly well understood, and there has been
sources, and for massively parallel database systems. We are initial work on estimating statistical properties of static sets of
building a system called Telegraph, which is intended to run data with complex types [Aok99] and methods [BO99]. But
queries over all the data available on line. A key requirement federated data often comes without any statistical summaries,
of a large-scale system like Telegraph is that it function ro- and complex non-alphanumeric data types are now widely in
bustly in an unpredictable and constantly fluctuating environ- use both in object-relational databases and on the web. In these
ment. This unpredictability is endemic in large-scale systems, scenarios – and even in traditional static relational databases –
because of increased complexity in a number of dimensions: selectivity estimates are often quite inaccurate.
Hardware and Workload Complexity: In wide-area envi- User Interface Complexity: In large-scale systems, many
ronments, variabilities are commonly observable in the bursty queries can run for a very long time. As a result, there is in-
performance of servers and networks [UFA98]. These systems terest in Online Aggregation and other techniques that allow
often serve large communities of users whose aggregate be- users to “Control” properties of queries while they execute,
havior can be hard to predict, and the hardware mix in the wide based on refining approximate results [HAC 99].

area is quite heterogeneous. Large clusters of computers can For all of these reasons, we expect query processing param-
exhibit similar performance variations, due to a mix of user eters to change significantly over time in Telegraph, typically
requests and heterogeneous hardware evolution. Even in to- many times during a single query. As a result, it is not appro-
tally homogeneous environments, hardware performance can priate to use the traditional architecture of optimizing a query
be unpredictable: for example, the outer tracks of a disk can and then executing a static query plan: this approach does
exhibit almost twice the bandwidth of inner tracks [Met97]. not adapt to intra-query fluctuations. Instead, for these en-
Data Complexity: Selectivity estimation for static alphanu- vironments we want query execution plans to be reoptimized
regularly during the course of query processing, allowing the
system to adapt dynamically to fluctuations in computing re-
sources, data characteristics, and user preferences.
In this paper we present a query processing operator called
an eddy, which continuously reorders the application of pipe-
Eddies: Continuously Adaptive Query Processing 837

lined operators in a query plan, on a tuple-by-tuple basis. An this paper we narrow our focus somewhat to concentrate on
eddy is an -ary tuple router interposed between data sources the initial, already difficult problem of run-time operator re-
and a set of query processing operators; the eddy encapsulates ordering in a single-site query executor; that is, changing the
the ordering of the operators by routing tuples through them effective order or “shape” of a pipelined query plan tree in the
dynamically (Figure 1). Because the eddy observes tuples en- face of changes in performance.
tering and exiting the pipelined operators, it can adaptively In our discussion we will assume that some initial query
change its routing to effect different operator orderings. In this plan tree will be constructed during parsing by a naive pre-
paper we present initial experimental results demonstrating the optimizer. This optimizer need not exercise much judgement
viability of eddies: they can indeed reorder effectively in the since we will be reordering the plan tree on the fly. However
face of changing selectivities and costs, and provide benefits by constructing a query plan it must choose a spanning tree of
in the case of delayed data sources as well. the query graph (i.e. a set of table-pairs to join) [KBZ86], and
Reoptimizing a query execution pipeline on the fly requires algorithms for each of the joins. We will return to the choice of
significant care in maintaining query execution state. We high- join algorithms in Section 2, and defer to Section 6 the discus-
light query processing stages called moments of symmetry, dur- sion of changing the spanning tree and join algorithms during
ing which operators can be easily reordered. We also describe processing.
synchronization barriers in certain join algorithms that can re- We study a standard single-node object-relational query pro-
strict performance to the rate of the slower input. Join algo- cessing system, with the added capability of opening scans and
rithms with frequent moments of symmetry and adaptive or indexes from external data sets. This is becoming a very com-
non-existent barriers are thus especially attractive in the Tele- mon base architecture, available in many of the commercial
graph environment. We observe that the Ripple Join family object-relational systems (e.g., IBM DB2 UDB [RPK 99], 

[HH99] provides efficiency, frequent moments of symmetry, Informix Dynamic Server UDO [SBH98]) and in federated
and adaptive or nonexistent barriers for equijoins and non- database systems (e.g., Cohera [HSC99]). We will refer to
equijoins alike. these non-resident tables as external tables. We make no as-
The eddy architecture is quite simple, obviating the need for sumptions limiting the scale of external sources, which may be
traditional cost and selectivity estimation, and simplifying the arbitrarily large. External tables present many of the dynamic
logic of plan enumeration. Eddies represent our first step in a challenges described above: they can reside over a wide-area
larger attempt to do away with traditional optimizers entirely, network, face bursty utilization, and offer very minimal infor-
in the hope of providing both run-time adaptivity and a reduc- mation on costs and statistical properties.
tion in code complexity. In this paper we focus on continuous ! = > ?
* 
?
 * A

operator reordering in a single-site query processor; we leave


other optimization issues to our discussion of future work. Before introducing eddies, in Section 2 we discuss the prop-
! #

 % &
 ( * , .      


erties of query processing algorithms that allow (or disallow)
them to be frequently reordered. We then present the eddy ar-
Three properties can vary during query processing: the costs chitecture, and describe how it allows for extreme flexibility
of operators, their selectivities, and the rates at which tuples in operator ordering (Section 3). Section 4 discusses policies
arrive from the inputs. The first and third issues commonly for controlling tuple flow in an eddy. A variety of experiments
occur in wide area environments, as discussed in the literature in Section 4 illustrate the robustness of eddies in both static
[AFTU96, UFA98, IFF 99]. These issues may become more 
and dynamic environments, and raise some questions for fu-
common in cluster (shared-nothing) systems as they “scale ture work. We survey related work in Section 5, and in Sec-
out” to thousands of nodes or more [Bar99]. tion 6 lay out a research program to carry this work forward.
Run-time variations in selectivity have not been widely dis- 1 #

*    *    .   B  E F .


cussed before, but occur quite naturally. They commonly arise


due to correlations between predicates and the order of tuple A basic challenge of run-time reoptimization is to reorder pipe-
delivery. For example, consider an employee table clustered lined query processing operators while they are in flight. To
by ascending age, and a selection salary > 100000; age change a query plan on the fly, a great deal of state in the var-
and salary are often strongly correlated. Initially the selection ious operators has to be considered, and arbitrary changes can
will filter out most tuples delivered, but that selectivity rate require significant processing and code complexity to guaran-
will change as ever-older employees are scanned. Selectivity tee correct results. For example, the state maintained by an
over time can also depend on performance fluctuations: e.g., in operator like hybrid hash join [DKO 84] can grow as large as
a parallel DBMS clustered relations are often horizontally par-


the size of an input relation, and require modification or re-


titioned across disks, and the rate of production from various computation if the plan is reordered while the state is being
partitions may change over time depending on performance constructed.
characteristics and utilization of the different disks. Finally, By constraining the scenarios in which we reorder opera-
Online Aggregation systems explicitly allow users to control tors, we can keep this work to a minimum. Before describing
the order in which tuples are delivered based on data prefer- eddies, we study the state management of various join algo-
ences [RRH99], resulting in similar effects. rithms; this discussion motivates the eddy design, and forms
! 1

 3   *    .    ( :   


the basis of our approach for reoptimizing cheaply and con-


tinuously. As a philosophy, we favor adaptivity over best-case
Telegraph is intended to efficiently and flexibly provide both performance. In a highly variable environment, the best-case
distributed query processing across sites in the wide area, and scenario rarely exists for a significant length of time. So we
parallel query processing in a large shared-nothing cluster. In
838 Chapter 10: Stream-Based Data Management

will sacrifice marginal improvements in idealized query pro-


cessing algorithms when they prevent frequent, efficient reop-
timization.
1 ! J    M

B 3    L       *  

Binary operators like joins often capture significant state. A


particular form of state used in such operators relates to the
interleaving of requests for tuples from different inputs.
As an example, consider the case of a merge join on two
sorted, duplicate-free inputs. During processing, the next tu-
ple is always consumed from the relation whose last tuple
had the lower value. This significantly constrains the order in
which tuples can be consumed: as an extreme example, con- Figure 2: Tuples generated by a nested-loops join, reordered at
sider the case of a slowly-delivered external relation slowlow two moments of symmetry. Each axis represents the tuples of
with many low values in its join column, and a high-bandwidth the corresponding relation, in the order they are delivered by
but large local relation fasthi with only high values in its join an access method. The dots represent tuples generated by the
column – the processing of fasthi is postponed for a long time join, some of which may be eliminated by the join predicate.
while consuming many tuples from slowlow. Using terminol- The numbers correspond to the barriers reached, in order. Q R

ogy from parallel programming, we describe this phenomenon and are the cursor positions maintained by the correspond-
Q S

as a synchronization barrier: one table-scan waits until the ing inputs at the time of the reorderings.
other table-scan produces a value larger than any seen before.
In general, barriers limit concurrency – and hence perfor- the iterator producing notes its current cursor position .  Q R

mance – when two tasks take different amounts of time to com- In that case, the new “outer” loop on begins rescanning by 

plete (i.e., to “arrive” at the barrier). Recall that concurrency fetching the first tuple of , and is scanned from to the   Q R

arises even in single-site query engines, which can simultane- end. This can be repeated indefinitely, joining tuples with 

ously carry out network I/O, disk I/O, and computation. Thus all tuples in from position to the end. Alternatively, at
 Q R

it is desirable to minimize the overhead of synchronization the end of some loop over (i.e. at a moment of symmetry), 

barriers in a dynamic (or even static but heterogeneous) per- the order of inputs can be swapped again by remembering the
formance environment. Two issues affect the overhead of bar- current position of , and repeatedly joining the next tuple in 

riers in a plan: the frequency of barriers, and the gap between  (starting at ) with tuples from between and the end.
Q
R
 Q
S

arrival times of the two inputs at the barrier. We will see in up- Figure 2 depicts this scenario, with two changes of ordering.
coming discussion that barriers can often be avoided or tuned Some operators like the pipelined hash join of [WA91] have no
by using appropriate join algorithms. barriers whatsoever. These operators are in constant symme-
1 ! 1 O
 ( *

   E
J
B ( ( *   B
try, since the processing of the two inputs is totally decoupled.
Moments of symmetry allow reordering of the inputs to a
Note that the synchronization barrier in merge join is stated single binary operator. But we can generalize this, by noting
in an order-independent manner: it does not distinguish be- that since joins commute, a tree of binary joins can be U W

tween the inputs based on any property other than the data viewed as a single -ary join. One could easily implement a
they deliver. Thus merge join is often described as a symmet- doubly-nested-loops join operator over relations , and ,   

ric operator, since its two inputs are treated uniformly1 . This is and it would have moments of complete symmetry at the end
not the case for many other join algorithms. Consider the tra- of each loop of . At that point, all three inputs could be re- 

ditional nested-loops join, for example. The “outer” relation ordered (say to then then ) with a straightforward exten-   

in a nested-loops join is synchronized with the “inner” rela- sion to the discussion above: a cursor would be recorded for
tion, but not vice versa: after each tuple (or block of tuples) each input, and each loop would go from the recorded cursor
is consumed from the outer relation, a barrier is set until a full position to the end of the input.
scan of the inner is completed. For asymmetric operators like The same effect can be obtained in a binary implementa-
nested-loops join, performance benefits can often be obtained tion with two operators, by swapping the positions of binary
by reordering the inputs. operators: effectively the plan tree transformation would go
When a join algorithm reaches a barrier, it has declared the in steps, from to and Y  Z \ ^  ` Z \ c  Y  Z \ c  ` Z \ ^ 

end of a scheduling dependency between its two input rela- then to . This approach treats an operator
Y  Z \ c  ` Z \ ^ 

tions. In such cases, the order of the inputs to the join can of- and its right-hand input as a unit (e.g., the unit ), and k Z \
c
 l

ten be changed without modifying any state in the join; when swaps units; the idea has been used previously in static query
this is true, we refer to the barrier as a moment of symmetry. optimization schemes [IK84, KBZ86, Hel98]. Viewing the sit-
Let us return to the example of a nested-loops join, with outer uation in this manner, we can naturally consider reordering
relation and inner relation . At a barrier, the join has com-
 
multiple joins and their inputs, even if the join algorithms are
pleted a full inner loop, having joined each tuple in a subset different. In our query , we need and Y  Z \
^
 ` Z \
c
 k Z \
^
 l

of with every tuple in . Reordering the inputs at this point


 
k Z to be mutually commutative, but do not require them
\ c  l

can be done without affecting the join algorithm, as long as to be the same join algorithm. We discuss the commutativity
P
of join algorithms further in Section 2.2.2.
If there are duplicates in a merge join, the duplicates are handled by an Note that the combination of commutativity and moments
asymmetric but usually small nested loop. For purposes of exposition, we can
ignore this detail here.
of symmetry allows for very aggressive reordering of a plan
Eddies: Continuously Adaptive Query Processing 839

tree. A single -ary operator representing a reorderable plan have infrequent moments of symmetry and imbalanced barri-
tree is therefore an attractive abstraction, since it encapsulates ers, making them undesirable as well.
any ordering that may be subject to change. We will exploit The other algorithms we consider are based on frequent-
this abstraction directly, by interposing an -ary tuple router ly-symmetric versions of traditional iteration, hashing and in-
(an “eddy”) between the input tables and the join operators. dexing schemes, i.e., the Ripple Joins [HH99]. Note that the
1 ! 1 ! p
 




 
 * r * 
original pipelined hash join of [WA91] is a constrained ver-
sion of the hash ripple join. The external hashing extensions
Nested-loops joins can take advantage of indexes on the in- of [UF99, IFF 99] are directly applicable to the hash rip- 

ner relation, resulting in a fairly efficient pipelining join algo- ple join, and [HH99] treats index joins as a special case as
rithm. An index nested-loops join (henceforth an “index join”) well. For non-equijoins, the block ripple join algorithm is ef-
is inherently asymmetric, since one input relation has been fective, having frequent moments of symmetry, particularly
pre-indexed. Even when indexes exist on both inputs, chang- at the beginning of processing [HH99]. Figure 3 illustrates
ing the choice of inner and outer relation “on the fly” is prob- block, index and hash ripple joins; the reader is referred to
lematic2 . Hence for the purposes of reordering, it is simpler [HH99, IFF 99, UF99] for detailed discussions of these al- 

to think of an index join as a kind of unary selection operator gorithms and their variants. These algorithms are adaptive
on the unindexed input (as in the join of and in Figure 1).  t

without sacrificing much performance: [UF99] and [IFF 99] 

The only distinction between an index join and a selection is demonstrate scalable versions of hash ripple join that perform
that – with respect to the unindexed relation – the selectivity competitively with hybrid hash join in the static case; [HH99]
of the join node may be greater than 1. Although one cannot shows that while block ripple join can be less efficient than
swap the inputs to a single index join, one can reorder an index nested-loops join, it arrives at moments of symmetry much
join and its indexed relation as a unit among other operators in more frequently than nested-loops joins, especially in early
a plan tree. Note that the logic for indexes can be applied to stages of processing. In [AH99] we discuss the memory over-
external tables that require bindings to be passed; such tables heads of these adaptive algorithms, which can be larger than
may be gateways to, e.g., web pages with forms, GIS index standard join algorithms.
systems, LDAP servers and so on [HKWY97, FMLS99]. Ripple joins have moments of symmetry at each “corner”
1 ! 1 ! 1
F 3 B   . F   : *    *  v F  *    *  v w  ( (    
?
  B

of a rectangular ripple in Figure 3, i.e., whenever a prefix of


the input stream has been joined with all tuples in a prefix
Clearly, a pre-optimizer’s choice of an index join algorithm


of input stream and vice versa. For hash ripple joins and in-
constrains the possible join orderings. In the -ary join view,


dex joins, this scenario occurs between each consecutive tuple


an ordering constraint must be imposed so that the unindexed consumed from a scanned input. Thus ripple joins offer very
join input is ordered before (but not necessarily directly be- frequent moments of symmetry.
fore) the indexed input. This constraint arises because of a Ripple joins are attractive with respect to barriers as well.
physical property of an input relation: indexes can be probed Ripple joins were designed to allow changing rates for each
but not scanned, and hence cannot appear before their cor- input; this was originally used to proactively expend more pro-
responding probing tables. Similar but more complex con- cessing on the input relation with more statistical influence on
straints can arise in preserving the ordered inputs to a merge intermediate results. However, the same mechanism allows re-
join (i.e., preserving “interesting orders”). active adaptivity in the wide-area scenario: a barrier is reached
The applicability of certain join algorithms raises additional at each corner, and the next corner can adaptively reflect the
constraints. Many join algorithms work only for equijoins, and relative rates of the two inputs. For the block ripple join, the
will not work on other joins like Cartesian products. Such al- next corner is chosen upon reaching the previous corner; this
gorithms constrain reorderings on the plan tree as well, since can be done adaptively to reflect the relative rates of the two
they always require all relations mentioned in their equijoin inputs over time.
predicates to be handled before them. In this paper, we con- The ripple join family offers attractive adaptivity features
sider ordering constraints to be an inviolable aspect of a plan at a modest overhead in performance and memory footprint.
tree, and we ensure that they always hold. In Section 6 we Hence they fit well with our philosophy of sacrificing marginal
sketch initial ideas on relaxing this requirement, by consider- speed for adaptability, and we focus on these algorithms in
ing multiple join algorithms and query graph spanning trees. Telegraph.
1 ! 1 ! = p   # 

  . y     3 (   *    *   y

= # ? 
 *        * 

In order for an eddy to be most effective, we favor join algo-


rithms with frequent moments of symmetry, adaptive or non- The above discussion allows us to consider easily reordering
existent barriers, and minimal ordering constraints: these al- query plans at moments of symmetry. In this section we pro-
gorithms offer the most opportunities for reoptimization. In ceed to describe the eddy mechanism for implementing re-
[AH99] we summarize the salient properties of a variety of ordering in a natural manner during query processing. The
join algorithms. Our desire to avoid blocking rules out the use techniques we describe can be used with any operators, but al-
of hybrid hash join, and our desire to minimize ordering con- gorithms with frequent moments of symmetry allow for more
straints and barriers excludes merge joins. Nested loops joins frequent reoptimization. Before discussing eddies, we first in-
z

troduce our basic query processing environment.


In unclustered indexes, the index ordering is not the same as the scan order-
ing. Thus after a reordering of the inputs it is difficult to ensure that – using the = ! #


?

* 

terminology of Section 2.2 – lookups on the index of the new “inner” relation {

produce only tuples between and the end of . | } {

We implemented eddies in the context of River [AAT 99], a 

shared-nothing parallel query processing framework that dy-


840 Chapter 10: Stream-Based Data Management

Figure 3: Tuples generated by block, index, and hash ripple join. In block ripple, all tuples are generated by the join, but some may
be eliminated by the join predicate. The arrows for index and hash ripple join represent the logical portion of the cross-product
space checked so far; these joins only expend work on tuples satisfying the join predicate (black dots). In the hash ripple diagram,
one relation arrives 3 faster than the other. €

namically adapts to fluctuations in performance and workload. that they are relatively low selectivity), followed by as many
River has been used to robustly produce near-record perfor- arbitrary non-equijoin edges as required to complete a span-
mance on I/O-intensive benchmarks like parallel sorting and ning tree.
hash joins, despite heterogeneities and dynamic variability in Given a spanning tree of the query graph, the pre-optimizer
hardware and workloads across machines in a cluster. For needs to choose join algorithms for each edge. Along each
more details on River’s adaptivity and parallelism features, the equijoin edge it can use either an index join if an index is avail-
interested reader is referred to the original paper on the topic able, or a hash ripple join. Along each non-equijoin edge it can
[AAT 99]. In Telegraph, we intend to leverage the adaptabil-

use a block ripple join.
ity of River to allow for dynamic shifting of load (both query These are simple heuristics that we use to allow us to focus
processing and data delivery) in a shared-nothing parallel en- on our initial eddy design; in Section 6 we present initial ideas
vironment. But in this paper we restrict ourselves to basic on making spanning tree and algorithm decisions adaptively.
(single-site) features of eddies; discussions of eddies in par- = ! 1 
   B 

 3 *
#

?
* 

allel rivers are deferred to Section 6.


Since we do not discuss parallelism here, a very simple An eddy is implemented via a module in a river containing
overview of the River framework suffices. River is a dataflow an arbitrary number of input relations, a number of partici-
query engine, analogous in many ways to Gamma [DGS 90], 

pating unary and binary modules, and a single output relation


Volcano [Gra90] and commercial parallel database engines, (Figure 1)3 . An eddy encapsulates the scheduling of its par-
in which “iterator”-style modules (query operators) commu- ticipating operators; tuples entering the eddy can flow through
nicate via a fixed dataflow graph (a query plan). Each mod- its operators in a variety of orders.
ule runs as an independent thread, and the edges in the graph In essence, an eddy explicitly merges multiple unary and
correspond to finite message queues. When a producer and binary operators into a single -ary operator within a query
consumer run at differing rates, the faster thread may block plan, based on the intuition from Section 2.2 that symmetries
on the queue waiting for the slower thread to catch up. As can be easily captured in an -ary operator. An eddy module
in [UFA98], River is multi-threaded and can exploit barrier- maintains a fixed-sized buffer of tuples that are to be processed
free algorithms by reading from various inputs at indepen- by one or more operators. Each operator participating in the
dent rates. The River implementation we used derives from eddy has one or two inputs that are fed tuples by the eddy, and
the work on Now-Sort [AAC 97], and features efficient I/O 

an output stream that returns tuples to the eddy. Eddies are so


mechanisms including pre-fetching scans, avoidance of oper- named because of this circular data flow within a river.
ating system buffering, and high-performance user-level net- A tuple entering an eddy is associated with a tuple descrip-
working. tor containing a vector of Ready bits and Done bits, which
= ! !

F  *
% >

:   (  L   


indicate respectively those operators that are elgibile to pro-


cess the tuple, and those that have already processed the tuple.
Although we will use eddies to reorder tables among joins, The eddy module ships a tuple only to operators for which the
a heuristic pre-optimizer must choose how to initially pair off corresponding Ready bit turned on. After processing the tuple,
relations into joins, with the constraint that each relation par- the operator returns it to the eddy, and the corresponding Done
ticipates in only one join. This corresponds to choosing a span- bit is turned on. If all the Done bits are on, the tuple is sent
ning tree of a query graph, in which nodes represent relations to the eddy’s output; otherwise it is sent to another eligible
and edges represent binary joins [KBZ86]. One reasonable operator for continued processing.
heuristic for picking a spanning tree forms a chain of cartesian 

products across any tables known to be very small (to handle Nothing prevents the use of -ary operators with ‚ in an eddy, but ‚ ƒ „

since implementations of these are atypical in database query processing we do


“star schemas” when base-table cardinality statistics are avail- not discuss them here.
able); it then picks arbitrary equijoin edges (on the assumption
Eddies: Continuously Adaptive Query Processing 841

When an eddy receives a tuple from one of its inputs, it ze- Table Cardinality values in column ‡

roes the Done bits, and sets the Ready bits appropriately. In R 10,000 500 - 5500
the simple case, the eddy sets all Ready bits on, signifying S 80,000 0 - 5000
that any ordering of the operators is acceptable. When there T 10,000 N/A
are ordering constraints on the operators, the eddy turns on U 50,000 N/A
only the Ready bits corresponding to operators that can be ex-
ecuted initially. When an operator returns a tuple to the eddy, Table 1: Cardinalities of tables; values are uniformly dis-
the eddy turns on the Ready bit of any operator eligible to pro- tributed.
cess the tuple. Binary operators generate output tuples that
250
correspond to combinations of input tuples; in these cases, the
Done bits and Ready bits of the two input tuples are ORed. In
this manner an eddy preserves the ordering constraints while
maximizing opportunities for tuples to follow different possi- 200

completion time (secs)


ble orderings of the operators.
Two properties of eddies merit comment. First, note that ed- s1 before s2
s2 before s1
dies represent the full class of bushy trees corresponding to the 150 Ž

Naive
set of join nodes – it is possible, for instance, that two pairs of Lottery

tuples are combined independently by two different join mod-


ules, and then routed to a third join to perform the 4-way con- 100
catenation of the two binary records. Second, note that eddies
do not constrain reordering to moments of symmetry across
the eddy as a whole. A given operator must carefully refrain 50
from fetching tuples from certain inputs until its next moment
Š

0
ˆ 2
‰ 4 6
‹ 8
Π10
cost of s1.
of symmetry – e.g., a nested-loops join would not fetch a new 

tuple from the current outer relation until it finished rescan-


Figure 4: Performance of two 50% selections, has cost 5,
ning the inner. But there is no requirement that all operators in
 

varies across runs.


the eddy be at a moment of symmetry when this occurs; just
 W

the operator that is fetching a new tuple. Thus eddies are quite
as spin loops corresponding to their relative costs, followed
flexible both in the shapes of trees they can generate, and in
by a randomized selection decision with the appropriate selec-
the scenarios in which they can logically reorder operators.
tivity. We describe the relative costs of selections in terms of
†
#
   

y
&
 : . *  

    * 

abstract “delay units”; for studying optimization, the absolute


number of cycles through a spin loop are irrelevant. We imple-
An eddy module directs the flow of tuples from the inputs mented the simplest version of hash ripple join, identical to the
through the various operators to the output, providing the flex- original pipelining hash join [WA91]; our implementation here
ibility to allow each tuple to be routed individually through does not exert any statistically-motivated control over disk re-
the operators. The routing policy used in the eddy determines source consumption (as in [HH99]). We simulated index joins
the efficiency of the system. In this section we study some by doing random I/Os within a file, returning on average the
promising initial policies; we believe that this is a rich area for number of matches corresponding to a pre-programmed selec-
future work. We outline some of the remaining questions in tivity. The filesystem cache was allowed to absorb some of the
Section 6. index I/Os after warming up. In order to fairly compare eddies
An eddy’s tuple buffer is implemented as a priority queue to static plans, we simulate static plans via eddies that enforce
with a flexible prioritization scheme. An operator is always a static ordering on tuples (setting Ready bits in the correct
given the highest-priority tuple in the buffer that has the corre- order).
sponding Ready bit set. For simplicity, we start by considering †
! 1 ‘

?
*    B ’ , .    “ B

(  


>
: *     w    

a very simple priority scheme: tuples enter the eddy with low
priority, and when they are returned to the eddy from an oper- To illustrate how an eddy works, we consider a very simple
ator they are given high priority. This simple priority scheme single-table query with two expensive selection predicates, un-
ensures that tuples flow completely through the eddy before der the traditional assumption that no performance or selec-
new tuples are consumed from the inputs, ensuring that the tivity properties change during execution. Our SQL query is
eddy does not become “clogged” with new tuples. simply the following:
†
!
 r : *   ( *

 .
J
*   :

SELECT *
FROM U
In order to illustrate how eddies work, we present some initial WHERE AND ;  W Y `   Y `

experiments in this section; we pause briefly here to describe In our first experiment, we wish to see how well a “naive” eddy
our experimental setup. All our experiments were run on a can account for differences in costs among operators. We run
single-processor Sun Ultra-1 workstation running Solaris 2.6, the query multiple times, always setting the cost of to 5  

with 160 MB of RAM. We used the Euphrates implementation delay units, and the selectivities of both selections to 50%. In
of River [AAT 99]. We synthetically generated relations as in 

each run we use a different cost for , varying it between  W

Table 1, with 100 byte tuples in each relation. 1 and 9 delay units across runs. We compare a naive eddy
To allow us to experiment with costs and selectivities of se- of the two selections against both possible static orderings of
lections, our selection modules are (artificially) implemented
842 Chapter 10: Stream-Based Data Management

60 100

cumulative % of tuples routed to s1 first


80
completion time (secs)

50
60
s1 before s2
Naive


s2 before s1 œ

Lottery
ž

Naive
—

Lottery
˜

40
40

20

30 0
0.0
• 0.2 • 0.4 • 0.6
• 0.8 • 1.0
•
0.0
š 0.2
š 0.4
š 0.6
š 0.8 š 1.0 š

selectivity of s1 –
Selectivity of s1
›

Figure 5: Performance of two selections of cost 5,   has 50% Figure 6: Tuple flow with lottery scheme for the variable-
selectivity, varies across runs.  W
selectivity experiment(Figure 5).

the two selections (and against a “lottery”-based eddy, about does not capture differing selectivities.
which we will say more in Section 4.3.) One might imagine To track both consumption and production over time, we
that the flexible routing in the naive eddy would deliver tuples enhance our priority scheme with a simple learning algorithm
to the two selections equally: half the tuples would flow to implemented via Lottery Scheduling [WW94]. Each time the
before , and half to
 W before , resulting in middling
     W
eddy gives a tuple to an operator, it credits the operator one
performance over all. Figure 4 shows that this is not the case: “ticket”. Each time the operator returns a tuple to the eddy,
the naive eddy nearly matches the better of the two orderings in one ticket is debited from the eddy’s running count for that op-
all cases, without any explicit information about the operators’ erator. When an eddy is ready to send a tuple to be processed,
relative costs. it “holds a lottery” among the operators eligible for receiving
The naive eddy’s effectiveness in this scenario is due to the tuple. (The interested reader is referred to [WW94] for
simple fluid dynamics, arising from the different rates of con- a simple and efficient implementation of lottery scheduling.)
sumption by and . Recall that edges in a River dataflow  W  
An operator’s chance of “winning the lottery” and receiving
graph correspond to fixed-size queues. This limitation has the the tuple corresponds to the count of tickets for that operator,
same effect as back-pressure in a fluid flow: production along which in turn tracks the relative efficiency of the operator at
the input to any edge is limited by the rate of consumption at draining tuples from the system. By routing tuples using this
the output. The lower-cost selection (e.g., at the left of Fig-  W
lottery scheme, the eddy tracks (“learns”) an ordering of the
ure 4) can consume tuples more quickly, since it spends less operators that gives good overall efficiency.
time per tuple; as a result the lower-cost operator exerts less The “lottery” curve in Figures 4 and 5 show the more in-
back-pressure on the input table. At the same time, the high- telligent lottery-based routing scheme compared to the naive
cost operator produces tuples relatively slowly, so the low-cost back-pressure scheme and the two static orderings. The lottery
operator will rarely be required to consume a high-priority, scheme handles both scenarios effectively, slightly improv-
previously-seen tuple. Thus most tuples are routed to the low- ing the eddy in the changing-cost experiment, and performing
cost operator first, even though the costs are not explicitly ex- much better than naive in the changing-selectivity experiment.
posed or tracked in any way. To explain this a bit further, in Figure 6 we display the per-
†
! =

,      B ’ ™ * 





y
J

* . *  
?

   * 
cent of tuples that followed the order (as opposed to  W   

 ) in the two eddy schemes; this roughly represents the


   W

The naive eddy works well for handling operators with differ- average ratio of lottery tickets possessed by and over  W  

ent costs but equal selectivity. But we have not yet considered time. Note that the naive back-pressure policy is barely sen-
differences in selectivity. In our second experiment we keep sitive to changes in selectivity, and in fact drifts slightly in
the costs of the operators constant and equal (5 units), keep the wrong direction as the selectivity of is increased. By  W

the selectivity of fixed at 50%, and vary the selectivity of  


contrast, the lottery-based scheme adapts quite nicely as the
across runs. The results in Figure 5 are less encouraging,
 W
selectivity is varied.
showing the naive eddy performing as we originally expected, In both graphs one can see that when the costs and selec-
about half-way between the best and worst plans. Clearly our tivities are close to equal ( ), the percent-  W Ÿ   Ÿ £ ¥ ¦

naive priority scheme and the resulting back-pressure are in- age of tuples following the cheaper order is close to 50%.
sufficient to capture differences in selectivity. This observation is intuitive, but quite significant. The lottery-
To resolve this dilemma, we would like our priority scheme based eddy approaches the cost of an optimal ordering, but
to favor operators based on both their consumption and pro- does not concern itself about strictly observing the optimal or-
duction rate. Note that the consumption (input) rate of an oper- dering. Contrast this to earlier work on runtime reoptimiza-
ator is determined by cost alone, while the production (output) tion [KD98, UFA98, IFF 99], where a traditional query op- 

rate is determined by a product of cost and selectivity. Since timizer runs during processing to determine the optimal plan
an operator’s back-pressure on its input depends largely on its remnant. By focusing on overall cost rather than on finding
consumption rate, it is not surprising that our naive scheme
Eddies: Continuously Adaptive Query Processing 843

200
the optimal plan, the lottery scheme probabilistically provides
nearly optimal performance with much less effort, allowing
re-optimization to be done with an extremely lightweight tech-

execution time of plan (secs)


nique that can be executed multiple times for every tuple. 150
A related observation is that the lottery algorithm gets closer
to perfect routing ( %) on the right of Figure 6 than it § Ÿ ¥
¿

Hash First
does ( %) on the left. Yet in the corresponding perfor-
§ Ÿ W ¥ ¥

100
À

Lottery
mance graph (Figure 5), the differences between the lottery- Naive
Á

Index First
Â

based eddy and the optimal static ordering do not change much
in the two settings. This phenomenon is explained by exam-
ining the “jeopardy” of making ordering errors in either case. 50
Consider the left side of the graph, where the selectivity of  W

is 10%, is 50%, and the costs of each are delay units.   Q Ÿ £

Let be the rate at which tuples are routed erroneously (to


¬  

0
before in this case). Then the expected cost of the query
 W

is Y W U . By contrast, in ¬ ` ­ W ® W Q ° ¬ ­ W ® £ Q Ÿ ® ³ ¬ Q ° W ® W Q

Figure 7: Performance of two joins: a selective Index Join and


the second case where the selectivity of is changed to 90%,  W

a Hash Join
the expected cost is . Y W U ¬ ` ­ W ® £ Q ° ¬ ­ W ® ¶ Q Ÿ ® ³ ¬ Q ° W ® £ Q

Since the jeopardy is higher at 90% selectivity than at 10%, the 150

lottery more aggressively favors the optimal ordering at 90%


selectivity than at 10%.

execution time of plan (secs)


! p 

† †   

100
We have discussed selections up to this point for ease of ex- Ä

20%, ST before SR
20%, Eddy
Ä

position, but of course joins are the more common expensive Ä

20%, SR before ST
operator in query processing. In this section we study how
Ã

180%, ST before SR
eddies interact with the pipelining ripple join algorithms. For 180%, Eddy
180%, SR before ST
the moment, we continue to study a static performance envi- 50
ronment, validating the ability of eddies to do well even in
scenarios where static techniques are most effective.
We begin with a simple 3-table query:
SELECT *
FROM     
0

WHERE
Figure 8: Performance of hash joins and .
 ® ‡ Ÿ  ® ‡

AND  Z \   Z \ 

has selectivity 100% w.r.t. , the selectivity of


 ® · Ÿ  ® ·

In our experiment, we constructed a preoptimized plan with a  Z \    Z \ 

hash ripple join between and , and an index join between  


w.r.t. varies between 20% and 180% in the two runs. 

and . Since our data is uniformly distributed, Table 1 in-


quite close in the case of the experiment where the hash join
 

dicates that the selectivity of the join is ; its


should precede the index join. In this case, the relative cost
  W ® ¸ € W ¥ ¹ º

selectivity with respect to is 180% – i.e., each tuple enter-


of index join is so high that the jeopardy of choosing it first
 

ing the join finds 1.8 matching tuples on average [Hel98].


drives the hash join to nearly always win the lottery.


We artificially set the selectivity of the index join w.r.t. to be 

(overall selectivity
W ¥ ¦ ). Figure 7 shows the relative W € W ¥
¹ ¼
†
! Æ #
*  : 

 

y   “ B

(  , .      



performance of our two eddy schemes and the two static join
orderings. The results echo our results for selections, show- Eddies should adaptively react over time to the changes in
ing the lottery-based eddy performing nearly optimally, and performance and data characteristics described in Section 1.1.
the naive eddy performing in between the best and worst static The routing schemes described up to this point have not con-
plans. sidered how to achieve this. In particular, our lottery scheme
As noted in Section 2.2.1, index joins are very analogous to weighs all experiences equally: observations from the distant
selections. Hash joins have more complicated and symmetric past affect the lottery as much as recent observations. As a re-
behavior, and hence merit additional study. Figure 8 presents sult, an operator that earns many tickets early in a query may
performance of two hash-ripple-only versions of this query. become so wealthy that it will take a great deal of time for it
Our in-memory pipelined hash joins all have the same cost. to lose ground to the top achievers in recent history.
We change the data in and so that the selectivity of the    
To avoid this, we need to modify our point scheme to for-
join w.r.t. is 20% in one version, and 180% in the other. get history to some extent. One simple way to do this is to
use a window scheme, in which time is partitioned into win-
  

In all runs, the selectivity of the join predicate w.r.t. is


dows, and the eddy keeps track of two counts for each op-
  

fixed at 100%. As the figure shows, the lottery-based eddy


continues to perform nearly optimally. erator: a number of banked tickets, and a number of escrow
Figure 9 shows the percent of tuples in the eddy that follow tickets. Banked tickets are used when running a lottery. Es-
one order or the other in all four join experiments. While the crow tickets are used to measure efficiency during the win-
eddy is not strict about following the optimal ordering, it is dow. At the beginning of the window, the value of the es-
844 Chapter 10: Stream-Based Data Management

100

cumulative % of tuples routed to Index #1 first.


cumulative % of tuples routed to the correct join first 100

80
80

60
60
index beats hash
Ç
hash beats index
hash/hash 20%
hash/hash 180% 40
40

20
20

0 0
0 Í 20 40 60 80 100
Í

% of tuples seen. Î

Figure 9: Percent of tuples routed in the optimal order in all of Figure 11: Adapting to changing join costs: tuple movement.
the join experiments.
plan ( before ), the initial join begins fast, processing Ê Ë Ì Ê Ì Ë

5000 about 29,000 tuples, and passing about 290 of those to the sec-
ond (slower) join. After 30 seconds, the second join becomes
fast and handles the remainder of the 290 tuples quickly, while
execution time of plan (secs)

4000 the first join slowly processes the remaining 1,000 tuples at 5
seconds per tuple. The eddy outdoes both static plans: in the
first phase it behaves identically to the second static plan, con-
3000
I_sf first suming 29,000 tuples and queueing 290 for the eddy to pass
Eddy to . Just after phase 2 begins, the eddy adapts its ordering
È

Ì Ë
Ê

I_fs first
2000 and passes tuples to – the new fast join – first. As a result, Ê Ì Ë

the eddy spends 30 seconds in phase one, and in phase two


it has less then 290 tuples queued at (now fast), and only Ê Ì Ë

1000 1,000 tuples to process, only about 10 of which are passed to


Ê (now slow). Ë Ì

A similar, more controlled experiment illustrates the eddy’s


0 adaptability more clearly. Again, we run a three-table join,
with two external indexes that return a match 10% of the time.
Figure 10: Adapting to changing join costs: performance. We read 4,000 tuples from the scanned table, and toggle costs
between 1 and 100 cost units every 1000 tuples – i.e., three
crow account replaces the value of the banked account (i.e., times during the experiment. Figure 11 shows that the eddy
banked = escrow), and the escrow account is reset (es- adapts correctly, switching orders when the operator costs
crow = 0). This scheme ensures that operators “re-prove switch. Since the cost differential is less dramatic here, the
themselves” each window. jeopardy is lower and the eddy takes a bit longer to adapt. De-
We consider a scenario of a 3-table equijoin query, where spite the learning time, the trends are clear – the eddy sends
two of the tables are external and used as “inner” relations most of the first 1000 tuples to index #1 first, which starts off
by index joins. Our third relation has 30,000 tuples. Since cheap. It sends most of the second 1000 tuples to index #2
we assume that the index servers are remote, we implement first, causing the overall percentage of tuples to reach about
the “cost” in our index module as a time delay (i.e., while 50%, as reflected by the near-linear drift toward 50% in the
(gettimeofday() x) ;) rather than a spin loop; this É

second quarter of the graph. This pattern repeats in the third


better models the behavior of waiting on an external event like and fourth quarters, with the eddy eventually displaying an
a network response. We have two phases in the experiment: even use of the two orderings over time – always favoring the
initially, one index (call it ) is fast (no time delay) and the Ê Ë Ì

best ordering.
other ( ) is slow (5 seconds per lookup). After 30 seconds Ê
Ì Ë

For brevity, we omit here a similar experiment in which


we begin the second phase, in which the two indexes swap we fixed costs and modified selectivity over time. The re-
speeds: the index becomes slow, and becomes fast. Ê Ë Ì Ê Ì Ë

sults were similar, except that changing only the selectivity of


Both indexes return a single matching tuple 1% of the time. two operators results in less dramatic benefits for an adaptive
Figure 10 shows the performance of both possible static scheme. This can be seen analytically, for two operators of
plans, compared with an eddy using a lottery with a window cost whose selectivites are swapped from low to hi in a man- Q

scheme. As we would hope, the eddy is much faster than ei- ner analogous to our previous experiment. To lower-bound the
ther static plan. In the first static plan ( before ), the Ê Ì Ë Ê Ë Ì

performance of either static ordering, selectivities should be


initial index join in the plan is slow in the first phase, process- toggled to their extremes (100% and 0%) for equal amounts of
ing only 6 tuples and discarding all of them. In the remainder time – so that half the tuples go through both operators. Ei-
of the run, the plan quickly discards 99% of the tuples, passing ther static plan thus takes time, whereas an optimal Q ° W Ï  Q

300 to the (now) expensive second join. In the second static


Eddies:
200 Continuously Adaptive Query Processing 845
execution time of plan (secs)

150

begins producing tuples (at 43.5 on the x axis of Figure 13),


RS First
Ñ

100 the values bottled up in the


Eddy
Ò

 join burst forth, and the  

eddy quickly throttles the


ST First join, allowing the join to
Ó

   

process most tuples first. This scenario indicates two prob-


lems with our implementation. First, our ticket scheme does
50 not capture the growing selectivity inherent in a join with a
delayed input. Second, storing tuples inside the hash tables of
a single join unnecessarily prevents other joins from process-
ing them; it might be conceivable to hash input tuples within
0 multiple joins, if care were taken to prevent duplicate results
from being generated. A solution to the second problem might
Figure 12: Adapting to an initial delay on  : performance obviate the need to solve the first; we intend to explore these
100
issues further in future work.
For brevity, we omit here a variation of this experiment, in
which we delayed the delivery of by 10 seconds instead of
cumulative % of tuples routed to ST first

80 . In this case, the delay of affects both joins identically,


 

and simply slows down the completion time of all plans by


about 10 seconds.
60 Æ #

* .  *  ×   Ù

40 To our knowledge, this paper represents the first general query


processing scheme for reordering in-flight operators within a
pipeline, though [NWMN99] considers the special case of
20 unary operators. Our characterization of barriers and moments
of symmetry also appears to be new, arising as it does from our
interest in reoptimizing general pipelines.
0
0 Ô 20 40 60 80 100
Ô
Recent papers consider reoptimizing queries at the ends of
% of S tuples seen. pipelines [UFA98, KD98, IFF 99], reordering operators only 

after temporary results are materialized. [IFF 99] observantly 

notes that this approach dates back to the original INGRES


query decomposition scheme [SWK76]. These inter-pipeline
techniques are not adaptive in the sense used in traditional con-
Figure 13: Adapting to an initial delay on : tuple movement. trol theory (e.g., [Son98]) or machine learning (e.g., [Mit97]);
they make decisions without any ongoing feedback from the


operations they are to optimize, instead performing static op-


dynamic plan takes time, a ratio of only 3/2. With more op- Q

timizations at coarse-grained intervals in the query plan. One


erators, adaptivity to changes in selectivity can become more can view these efforts as complementary to our work: eddies
significant, however. can be used to do tuple scheduling within pipelines, and tech-
niques like those of [UFA98, KD98, IFF 99] can be used to
! Æ ! ?
† “ * . B *  “ * .  *  B

As a final experiment, we study the case where an input rela- reoptimize across pipelines. Of course such a marriage sac-
tion suffers from an initial delay, as in [AFTU96, UFA98]. We rifices the simplicity of eddies, requiring both the traditional
return to the 3-table query shown in the left of Figure 8, with complexity of cost estimation and plan enumeration along with
the selectivity at 100%, and the selectivity at 20%. We the ideas of this paper. There are also significant questions on
how best to combine these techniques – e.g., how many mate-
   

delay the delivery of by 10 seconds; the results are shown


rialization operators to put in a plan, which operators to put in


in Figure 12. Unfortunately, we see here that our eddy – even


with a lottery and a window-based forgetting scheme – does which eddy pipelines, etc.
not adapt to initial delays of as well as it could. Figure 13 DEC Rdb (subsequently Oracle Rdb) used competition to
choose among different access methods [AZ96]. Rdb briefly


tells some of the story: in the early part of processing, the


eddy incorrectly favors the join, even though no tuples observed the performance of alternative access methods at run-
time, and then fixed a “winner” for the remainder of query
  

are streaming in, and even though the join should appear
execution. This bears a resemblance to sampling for cost esti-
 

second in a normal execution (Figure 8). The eddy does this


because it observes that the join does not produce any out- mation (see [BDF 97] for a survey). More distantly related is


the work on “parameterized” or “dynamic” query plans, which


 

put tuples when given tuples. So the eddy awards most


postpone some optimization decisions until the beginning of
 

tuples to the join initially, which places them in an internal


query execution [INSS97, GC94].
 

hash table to be subsequently joined with tuples when they


The initial work on Query Scrambling [AFTU96] studied


arrive. The join is left to fetch and hash tuples. This


network unpredictabilities in processing queries over wide-
  

wastes resources that could have been spent joining tuples


area sources. This work materialized remote data while pro-


with tuples during the delay, and “primes” the join to


cessing was blocked waiting for other sources, an idea that
  

produce a large number of tuples once the s begin appearing.


can be used in concert with eddies. Note that local material-


Note that the eddy does far better than pessimally: when
ization ameliorates but does not remove barriers: work to be

846 Chapter 10: Stream-Based Data Management

done locally after a barrier can still be quite significant. Later of output from a plan, and the rate of refinement for online ag-
work focused on rescheduling runnable sub-plans during ini- gregation estimators. We have also begun studying schemes to
tial delays in delivery [UFA98], but did not attempt to reorder allow eddies to effectively order dependent predicates, based
in-flight operators as we do here. on reinforcement learning [SB98]. In a related vein, we would
Two out-of-core versions of the pipelined hash join have like to automatically tune the aggressiveness with which we
been proposed recently [IFF 99, UF99]. The X-Join [UF99] 
forget past observations, so that we avoid introducing a tun-
enhances the pipelined hash join not only by handling the out- ing knob to adjust window-length or some analogous constant
of-core case, but also by exploiting delay time to aggressively (e.g., a hysteresis factor).
match previously-received (and spilled) tuples. We intend to Another main goal is to attack the remaining static aspects
experiment with X-Joins and eddies in future work. of our scheme: the “pre-optimization” choices of spanning
The Control project [HAC 99] studies interactive analysis 
tree, join algorithms, and access methods. Following [AZ96],
of massive data sets, using techniques like online aggregation, we believe that competition is key here: one can run multi-
online reordering and ripple joins. There is a natural syn- ple redundant joins, join algorithms, and access methods, and
ergy between interactive and adaptive query processing; online track their behavior in an eddy, adaptively choosing among
techniques to pipeline best-effort answers are naturally adap- them over time. The implementation challenge in that sce-
tive to changing performance scenarios. The need for opti- nario relates to preventing duplicates from being generated,
mizing pipelines in the Control project initially motivated our while the efficiency challenge comes in not wasting too many
work on eddies. The Control project [HAC 99] is not ex- 
computing resources on unpromising alternatives.
plicitly related to the field of control theory [Son98], though A third major challenge is to harness the parallelism and
eddies appears to link the two in some regards. adaptivity available to us in rivers. Massively parallel systems
The River project [AAT 99] was another main inspiration 
are reaching their limit of manageability, even as data sizes
of this work. River allows modules to work as fast as they continue to grow very quickly. Adaptive techniques like ed-
can, naturally balancing flow to whichever modules are faster. dies and rivers can significantly aid in the manageability of a
We carried the River philosophy into the intial back-pressure new generation of massively parallel query processors. Rivers
design of eddies, and intend to return to the parallel load- have been shown to adapt gracefully to performance changes
balancing aspects of the optimization problem in future work. in large clusters, spreading query processing load across nodes
In addition to commercial projects like those in Section 1.2, and spreading data delivery across data sources. Eddies face
there have been numerous research systems for heterogeneous additional challenges to meet the promise of rivers: in particu-
data integration, e.g. [GMPQ 97, HKWY97, IFF 99], etc.  
lar, reoptimizing queries with intra-operator parallelism entails
Ú
w 

.    



 ,     * ×   Ù
repartitioning data, which adds an expense to reordering that
was not present in our single-site eddies. An additional com-
Query optimization has traditionally been viewed as a coarse- plication arises when trying to adaptively adjust the degree of
grained, static problem. Eddies are a query processing mech- partitioning for each operator in a plan. On a similar note, we
anism that allow fine-grained, adaptive, online optimization. would like to explore enhancing eddies and rivers to tolerate
Eddies are particularly beneficial in the unpredictable query failures of sources or of participants in parallel execution.
processing environments prevalent in massive-scale systems, Finally, we are exploring the application of eddies and rivers
and in interactive online query processing. They fit naturally to the generic space of dataflow programming, including appli-
with algorithms from the Ripple Join family, which have fre- cations such as multimedia analysis and transcoding, and the
quent moments of symmetry and adaptive or non-existent syn- composition of scalable, reliable internet services [GWBC99].
chronization barriers. Eddies can be used as the sole optimiza- Our intent is for rivers to serve as a generic parallel dataflow
tion mechanism in a query processing system, obviating the engine, and for eddies to be the main scheduling mechanism
need for much of the complex code required in a traditional in that environment.
query optimizer. Alternatively, eddies can be used in con- Ù

 A . *  y ( *

 

cert with traditional optimizers to improve adaptability within


pipelines. Our initial results indicate that eddies perform well Vijayshankar Raman provided much assistance in the course
under a variety of circumstances, though some questions re- of this work. Remzi Arpaci-Dusseau, Eric Anderson and Noah
main in improving reaction time and in adaptively choosing Treuhaft implemented Euphrates, and helped implement ed-
join orders with delayed sources. We are sufficiently encour- dies. Mike Franklin asked hard questions and suggested direc-
aged by these early results that we are using eddies and rivers tions for future work. Stuart Russell, Christos Papadimitriou,
as the basis for query processing in the Telegraph system. Alistair Sinclair, Kris Hildrum and Lakshminarayanan Subra-
In order to focus our energies in this initial work, we have manian all helped us focus on formal issues. Thanks to Navin
explicitly postponed a number of questions in understanding, Kabra and Mitch Cherniack for initial discussions on run-time
tuning, and extending these results. One main challenge is reoptimization, and to the database group at Berkeley for feed-
to develop eddy “ticket” policies that can be formally proved back. Stuart Russell suggested the term “eddy”.
to converge quickly to a near-optimal execution in static sce- This work was done while both authors were at UC Berke-
narios, and that adaptively converge when conditions change. ley, supported by a grant from IBM Corporation, NSF grant
This challenge is complicated by considering both selections IIS-9802051, and a Sloan Foundation Fellowship. Computing
and joins, including hash joins that “absorb” tuples into their and network resources for this research were provided through
hash tables as in Section 4.5.1. We intend to focus on multiple NSF RI grant CDA-9401156.
performance metrics, including time to completion, the rate
Eddies: Continuously Adaptive Query Processing 847

# 

* E *  * * 

[HH99] P. J. Haas and J. M. Hellerstein. Ripple Joins for Online Ag-


gregation. In Proc. ACM-SIGMOD International Conference on
[AAC 97] Ü A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, D. E. Culler, J. M. Management of Data, pages 287–298, Philadelphia, 1999.
Hellerstein, and D. A. Patterson. High-Performance Sorting on
Networks of Workstations. In Proc. ACM-SIGMOD Interna- [HKWY97] L. Haas, D. Kossmann, E. Wimmers, and J. Yang. Optimizing
tional Conference on Management of Data, Tucson, May 1997. Queries Across Diverse Data Sources. In Proc. 23rd Interna-
tional Conference on Very Large Data Bases (VLDB), Athens,
[AAT 99] Ü R. H. Arpaci-Dusseau, E. Anderson, N. Treuhaft, D. E. Culler, 1997.
J. M. Hellerstein, D. A. Patterson, and K. Yelick. Cluster I/O
with River: Making the Fast Case Common. In Sixth Workshop [HSC99] J. M. Hellerstein, M. Stonebraker, and R. Caccia. Open, Inde-
on I/O in Parallel and Distributed Systems (IOPADS ’99), pages pendent Enterprise Data Integration. IEEE Data Engineering
10–22, Atlanta, May 1999. Bulletin, 22(1), March 1999. http://www.cohera.com.

[AFTU96] L. Amsaleg, M. J. Franklin, A. Tomasic, and T. Urhan. Scram- [IFF 99]


Ü Z. G. Ives, D. Florescu, M. Friedman, A. Levy, and D. S. Weld.
bling Query Plans to Cope With Unexpected Delays. In 4th In- An Adaptive Query Execution System for Data Integration. In
ternational Conference on Parallel and Distributed Information Proc. ACM-SIGMOD International Conference on Management
Systems (PDIS), Miami Beach, December 1996. of Data, Philadelphia, 1999.
[IK84] T. Ibaraki and T. Kameda. Optimal Nesting for Computing
[AH99] R. Avnur and J. M. Hellerstein. Continuous query optimization.
N-relational Joins. ACM Transactions on Database Systems,
Technical Report CSD-99-1078, University of California, Berke-
9(3):482–502, October 1984.
ley, November 1999.
[INSS97] Y. E. Ioannidis, R. T. Ng, K. Shim, and T. K. Sellis. Parametric
[Aok99] P. M. Aoki. How to Avoid Building DataBlades That Know the Query Optimization. VLDB Journal, 6(2):132–151, 1997.
Value of Everything and the Cost of Nothing. In 11th Interna-
tional Conference on Scientific and Statistical Database Man- [KBZ86] R. Krishnamurthy, H. Boral, and C. Zaniolo. Optimization of
agement, Cleveland, July 1999. Nonrecursive Queries. In Proc. 12th International Conference
on Very Large Databases (VLDB), pages 128–137, August 1986.
[AZ96] G. Antoshenkov and M. Ziauddin. Query Processing and Opti-
mization in Oracle Rdb. VLDB Journal, 5(4):229–237, 1996. [KD98] N. Kabra and D. J. DeWitt. Efficient Mid-Query Reoptimization
of Sub-Optimal Query Execution Plans. In Proc. ACM-SIGMOD
[Bar99] R. Barnes. Scale Out. In High Performance Transaction Pro- International Conference on Management of Data, pages 106–
cessing Workshop (HPTS ’99), Asilomar, September 1999. 117, Seattle, 1998.
[BDF 97] Ü D. Barbara, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. [Met97] R. Van Meter. Observing the Effects of Multi-Zone Disks. In
Hellerstein, Y. E. Ioannidis, H. V. Jagadish, T. Johnson, R. T. Proceedings of the Usenix 1997 Technical Conference, Anaheim,
Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The New Jersey January 1997.
Data Reduction Report. IEEE Data Engineering Bulletin, 20(4),
[Mit97] T. Mitchell. Machine Learning. McGraw Hill, 1997.
December 1997.
[NWMN99] K. W. Ng, Z. Wang, R. R. Muntz, and S. Nittel. Dynamic Query
[BO99] J. Boulos and K. Ono. Cost Estimation of User-Defined Meth-
Re-Optimization. In 11th International Conference on Scientific
ods in Object-Relational Database Systems. SIGMOD Record,
and Statistical Database Management, Cleveland, July 1999.
28(3):22–28, September 1999.
[RPK 99]Ü B. Reinwald, H. Pirahesh, G. Krishnamoorthy, G. Lapis, B. Tran,
[DGS 90] Ü D. J. DeWitt, S. Ghandeharizadeh, D. Schneider, A. Bricker, and S. Vora. Heterogeneous Query Processing Through SQL Ta-
H.-I Hsiao, and R. Rasmussen. The Gamma database machine ble Functions. In 15th International Conference on Data Engi-
project. IEEE Transactions on Knowledge and Data Engineer- neering, pages 366–373, Sydney, March 1999.
ing, 2(1):44–62, Mar 1990.
[RRH99] V. Raman, B. Raman, and J. M. Hellerstein. Online Dynamic
[DKO 84] Ü D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. R. Stone- Reordering for Interactive Data Processing. In Proc. 25th Inter-
braker, and D. Wood. Implementation Techniques for Main national Conference on Very Large Data Bases (VLDB), pages
Memory Database Systems. In Proc. ACM-SIGMOD Interna- 709–720, Edinburgh, 1999.
tional Conference on Management of Data, pages 1–8, Boston,
June 1984. [SB98] R. S. Sutton and A. G. Bartow. Reinforcement Learning. MIT
Press, Cambridge, MA, 1998.
[FMLS99] D. Florescu, I. Manolescu, A. Levy, and D. Suciu. Query
Optimization in the Presence of Limited Access Patterns. In [SBH98] M. Stonebraker, P. Brown, and M. Herbach. Interoperability,
Proc. ACM-SIGMOD International Conference on Management Distributed Applications, and Distributed Databases: The Virtual
of Data, Phildelphia, June 1999. Table Interface. IEEE Data Engineering Bulletin, 21(3):25–34,
September 1998.
[GC94] G. Graefe and R. Cole. Optimization of Dynamic Query Evalua-
[Son98] E. D. Sontag. Mathematical Control Theory: Deterministic
tion Plans. In Proc. ACM-SIGMOD International Conference on
Finite-Dimensional Systems, Second Edition. Number 6 in Texts
Management of Data, Minneapolis, 1994.
in Applied Mathematics. Springer-Verlag, New York, 1998.
[GMPQ 97] H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A Rajaraman,
Ü

[SWK76] M. R. Stonebraker, E. Wong, and P. Kreps. The Design and Im-


Y. Sagiv, J. Ullman, and J. Widom. The TSIMMIS Project: Inte- plementation of INGRES. ACM Transactions on Database Sys-
gration of Heterogeneous Information Sources. Journal of Intel- tems, 1(3):189–222, September 1976.
ligent Information Systems, 8(2):117–132, March 1997.
[UF99] T. Urhan and M. Franklin. XJoin: Getting Fast Answers From
[Gra90] G. Graefe. Encapsulation of Parallelism in the Volcano Query Slow and Bursty Networks. Technical Report CS-TR-3994, Uni-
Processing System. In Proc. ACM-SIGMOD International Con- versity of Maryland, February 1999.
ference on Management of Data, pages 102–111, Atlantic City,
May 1990. [UFA98] T. Urhan, M. Franklin, and L. Amsaleg. Cost-Based Query
Scrambling for Initial Delays. In Proc. ACM-SIGMOD Interna-
[GWBC99] S. D. Gribble, M. Welsh, E. A. Brewer, and D. Culler. The Multi- tional Conference on Management of Data, Seattle, June 1998.
Space: an Evolutionary Platform for Infrastructural Services. In
Proceedings of the 1999 Usenix Annual Technical Conference, [WA91] A. N. Wilschut and P. M. G. Apers. Dataflow Query Execution
Monterey, June 1999. in a Parallel Main-Memory Environment. In Proc. First Interna-
tional Conference on Parallel and Distributed Info. Sys. (PDIS),
[HAC 99] Ü J. M. Hellerstein, R. Avnur, A. Chou, C. Hidber, C. Olston, pages 68–77, 1991.
V. Raman, T. Roth, and P. J. Haas. Interactive Data Analysis: The
[WW94] C. A. Waldspurger and W. E. Weihl. Lottery scheduling: Flex-
Control Project. IEEE Computer, 32(8):51–59, August 1999.
ible proportional-share resource management. In Proc. of the
[Hel98] J. M. Hellerstein. Optimization Techniques for Queries with First Symposium on Operating Systems Design and Implemen-
Expensive Methods. ACM Transactions on Database Systems, tation (OSDI ’94), pages 1–11, Monterey, CA, November 1994.
23(2):113–157, 1998. USENIX Assoc.
Retrospective on Aurora
Hari Balakrishnan3 , Magdalena Balazinska3 , Don Carney2 , Uğur Çetintemel2 , Mitch Cherniack1 , Christian Convey2 ,
Eddie Galvez1 , Jon Salz3 , Michael Stonebraker3 , Nesime Tatbul2 , Richard Tibbetts3 , Stan Zdonik2
1
Department of Computer Science, Brandeis University, Waltham, MA 02454, USA (e-mail: {mfc, eddie}@cs.brandeis.edu)
2
Department of Computer Science, Brown University, Providence, RI 02912, USA (e-mail: {dpc, ugur, cjc, tatbul, sbz}@cs.brown.edu)
3
Department of EECS and Laboratory of Computer Science, Massachussetts Institute of Technology, Cambridge, MA 02139, USA
(e-mail: {hari, mbalazin, jsalz, stonebraker, tibbetts}@lcs.mit.edu)

Edited by ♣. Received: ♣/ Accepted: ♣


Published online: ♣♣ 2004 – c Springer-Verlag 2004

Abstract. This experience paper summarizes the key lessons incoming messages before (or instead of) storing them. Lastly,
we learned throughout the design and implementation of the an SPE must have capabilities to gracefully deal with spikes in
Aurora stream-processing engine. For the past 2 years, we have message load. Fundamentally, incoming traffic is bursty, and
built five stream-based applications using Aurora. We first de- it is desirable to selectively degrade the performance of the
scribe in detail these applications and their implementation in applications running on an SPE.
Aurora. We then reflect on the design of Aurora based on this The Aurora stream-processing engine, motivated by these
experience. Finally, we discuss our initial ideas on a follow-on three tenets, is currently operational. It consists of some 100K
project, called Borealis, whose goal is to eliminate the limita- lines of C++ and Java and runs on both Unix- and Linux-based
tions of Aurora as well as to address new key challenges and platforms. It was constructed with the cooperation of students
applications in the stream-processing domain. and faculty at Brown, Brandeis, and MIT. The fundamental
design of the engine has been well documented elsewhere: the
Keywords: Data stream management – Stream-processing architecture of the engine is described in [7], while the schedul-
engines – Monitoring applications – Distributed stream pro- ing algorithms are presented in [8]. Load-shedding algorithms
cessing – Quality-of-service are presented in [18], and our approach to high availability in a
multisite Aurora installation is covered in [10,13]. Lastly, we
have been involved in a collective effort to define a benchmark
that described the sort of monitoring applications that we have
in mind. The result of this effort is called Linear Road and is
1 Introduction and history described in [4].
Recently, we have used Aurora to build five different ap-
Over the last several years, a great deal of progress has been plication systems. Throughout the process, we have learned a
made in the area of stream-processing engines (SPEs) [7,9, great deal about the key requirements of streaming applica-
15]. Three basic tenets distinguish SPEs from current data- tions. In this paper, we reflect on the design of Aurora based
processing engines. First, they must support primitives for on this experience.
streaming applications. Unlike OLTP, which processes mes- The first application is an Aurora implementation of Lin-
sages in isolation, streaming applications entail time series ear Road, mentioned above. In addition to Linear Road, we
operations on streams of messages. Although a time series have implemented a pilot application that detects late arrival of
“blade” was added to the Illustra Object-Relational DBMS, messages in a financial-services feed-processing environment.
generally speaking, time series operations are not well sup- Furthermore, one of our collaborators, a military medical re-
ported by current DBMSs. Second, streaming applications en- search laboratory [20], asked us to build a system to monitor
tail a real-time component. If one is content to see an answer the levels of hazardous materials in fish. We have also worked
later, then one can store incoming messages in a data ware- with a major defense contractor on a pilot application that deals
house and run a historical query on the warehouse to find in- with battlefield monitoring in a hostile environment. Lastly,
formation of interest. This tactic does not work if the answer we have used Aurora to build Medusa, a distributed version
must be constructed in real time. Real time also dictates a of Aurora that is intended to be used by multiple enterprises
fundamentally different storage architecture. DBMSs univer- that operate in different administrative domains. Medusa uses
sally store and index data records before making them avail- an innovative agoric model to deal with cross-system resource
able for query activity. Such outbound processing, where data allocation and is described in more detail in [5].
are stored before being processed, cannot deliver real-time la- We start with a short review of the Aurora design in Sect. 2.
tency, as required by SPEs. To meet more stringent latency Following this, we discuss the five case studies mentioned
requirements, SPEs must adopt an alternate model, inbound above in detail in Sect. 3 so the reader can understand the con-
processing, where query processing is performed directly on
Retrospective on Aurora 849

Fig. 1. Aurora graphical user interface

text for the retrospection that follows. In Sect. 4, we present the rora network can be spread across any number of machines to
lessons we have learned on the design of SPEs. These include achieve high scalability and availability characteristics.
the necessity of supporting stored tables, the requirement of In Aurora, a developer uses the GUI to wire together a net-
synchronization primitives to main consistency of stored data work of boxes and arcs that will process streams in a manner
in a streaming environment, the need for supporting primi- that produces the outputs necessary for his or her application.
tives for late-arriving or missing messages, the requirement A screen shot of the GUI used to create Aurora networks is
for a myriad of adaptors for other feed formats, and the need shown in Fig. 1. The black boxes indicate input and output
for globally accessible catalogs and a programming notation streams that connect Aurora with the stream sources and ap-
to specify Aurora networks (in addition to the “boxes and ar- plications, respectively. The other boxes are Aurora operators,
rows” GUI). Since stream-processing applications are usually and the arcs represent dataflow among the operators. Users can
time critical, we also discuss the importance of lightweight drag and drop operators from the palette on the left and con-
scheduling and quantify the performance of the current Au- nect them by simply drawing arrows between them. It should
rora prototype using a microbenchmark on basic stream oper- be noted that a developer can name a collection of boxes and
ators. Aurora performance on the Linear Road benchmark is replace it with a “superbox”. This “macrodefinition” mecha-
documented elsewhere [4]. nism drastically eases the development of big networks.
The current Aurora prototype is being transferred to the TheAurora operators are presented in detail in [3] and sum-
commercial domain, with venture capital backing. As such, marized in Fig. 2. Aurora’s operator choices were influenced
the academic project is hard at work on a complete redesign by numerous systems. The basic operators Filter, Map, and
of Aurora, which we call Borealis. The intent of Borealis is Union are modeled after the Select, Project, and Union oper-
to overcome some of the shortcomings of Aurora as well as ations of the relational algebra. Join’s use of a distance metric
make a major leap forward in several areas. Hence, in Sect. 5, to relate joinable elements on opposing streams is reminiscent
we discuss the ideas we have for Borealis in several new ar- of the relational band join [12]. Aggregate’s sliding-window
eas including mechanisms for dynamic modification of query semantics is a generalized version of the sliding-window con-
specification and query results and a distributed optimization structs of SEQ [17] and SQL-99 (with generalizations includ-
framework that operates across server and sensor networks. ing allowance for disorder (SLACK), timeouts, value-based
windows, etc.). The ASSUME ORDER clause (used in Aggre-
gate and Join), which defines a result in terms of an order that
2 Aurora architecture may or may not be manifested, is borrowed from AQuery [14].
Each input must obey a particular schema (a fixed number
of fixed- or variable-length fields of the standard data types).
Aurora is based on a dataflow-style “boxes and arrows” para-
Every output is similarly constrained. An Aurora network ac-
digm. Unlike other stream-processing systems that use SQL-
cepts inputs, performs message filtering, computation, aggre-
style declarative query interfaces (e.g., STREAM [15]), this
gation, and correlation, and then delivers output messages to
approach was chosen because it allows query activity to be
applications. Moreover, every output is optionally tagged with
interspersed with message processing (e.g., cleaning, corre-
a Quality-of-Service (QoS) specification. This specification
lation, etc.). Systems that only perform the query piece must
indicates how much latency the connected application can tol-
ping-pong back and forth to an application for the rest of the
erate as well as what to do if adequate responsiveness cannot
work, thereby adding to system overhead and latency. An Au-
850 Chapter 10: Stream-Based Data Management

Fig. 2. Aurora operators

be assured under overload situations. Note that the Aurora 3.1 Financial services application
notion of QoS is different from the traditional QoS notion
that typically implies hard performance guarantees, resource Financial service organizations purchase stock ticker feeds
reservations, and strict admission control. from multiple providers and need to switch in real time be-
On various arcs in an Aurora network, the developer can tween these feeds if they experience too many problems. We
note that Aurora should remember historical messages. The worked with a major financial services company on developing
amount of history to be kept by such “connection points” can an Aurora application that detects feed problems and triggers
be specified by a time range or a message count. The his- the switch in real time. In this section, we summarize the ap-
torical storage is achieved by extending the basic message- plication (as specified by the financial services company) and
queue management mechanism. New boxes can be added to its implementation in Aurora.
an Aurora network at connection points at any time. History An unexpected delay in the reporting of new prices is an
is replayed through the added boxes, and then conventional example of a feed problem. Each security has an expected
Aurora processing continues. This processing continues until reporting interval, and the application needs to raise an alarm if
the extra boxes are deleted. a reporting interval exceeds its expected value. Furthermore, if
The Aurora optimizer can rearrange a network by perform- more than some number of alarms are recorded, a more serious
ing box swapping when it thinks the result will be favorable. alarm is raised that could indicate that it is time to switch feeds.
Such box swapping cannot occur across a connection point; The delay can be caused by the underlying exchange (e.g.,
hence connection points are arcs that restrict the behavior of NYSE, NASDAQ) or by the feed provider (e.g., Comstock,
the optimizer as well as remember history. Reuters). If it is the former, switching to another provider will
When a developer is satisfied with an Aurora network, not help, so the application must be able to rapidly distinguish
he or she can compile it into an intermediate form, which is between these two cases.
stored in an embedded database. At run time this data structure Ticker information is provided as a real-time data feed
is read into virtual memory and drives a real-time scheduler. from one or more providers, and a feed typically reports more
The scheduler makes decisions based on the form of the net- than one exchange. As an example, let us assume that there
work, the QoS specifications present, and the length of the are 500 securities within a feed that update at least once every
various queues. When queues overflow the buffer pool in vir- 5 s and they are called “fast updates”. Let us also assume that
tual memory, they are spooled to the embedded database. More there are 4000 securities that update at least once every 60 s
detailed information on these various topics can be obtained and they are called “slow updates”.
from the referenced papers [3,7,8,18]. If a ticker update is not seen within its update interval, the
monitoring system should raise a low alarm. For example, if
MSFT is expected to update within 5 s, and 5 s or more elapse
3 Aurora case studies since the last update, a low alarm is raised.
Since the source of the problem could be in the feed or the
In this section, we present five case studies of applications exchange, the monitoring application must count the number
built using the Aurora engine and tools. of low alarms found in each exchange and the number of low
alarms found in each feed. If the number for each of these
categories exceeds a threshold (100 in the following example),
a high alarm is raised. The particular high alarm will indicate
what action should be taken. When a high alarm is raised, the
low alarm count is reset and the counting of low alarms begins
Retrospective on Aurora 851

Fig. 3. Aurora query network for the alarm correlation application

again. In this way, the system produces a high alarm for every second tuple in the pair is delayed (call this an Alarm tuple)
100 low alarms of a particular type. and a 0 when it is on time.
Furthermore, the posting of a high alarm is a serious con- Aurora’s operators have been designed to react to imper-
dition, and low alarms are suppressed when the threshold is fections such as delayed tuples. Thus, the triggering of an
reached to avoid distracting the operator with a large number Alarm tuple is accomplished directly using this built-in mech-
of low alarms. anism. The window defined on each pair of tuples will timeout
Figure 3 presents our solution realized with an Aurora if the second tuple does not arrive within the given threshold
query network. We assume for simplicity that the securities (5 s in this case). In other words, the operator will produce one
within each feed are already separated into the 500 fast up- alarm each time a new tuple fails to arrive within 5 s, as the
dating tickers and the 4000 slowly updating tickers. If this is corresponding window will automatically timeout and close.
not the case, then the separation can be easily achieved with The high-level specification of Aggregate boxes 1 through 4
a lookup. The query network in Fig. 3 actually represents six is:
different queries (one for each output). Notice that much of
the processing is shared. Aggregate(Group by ticker,
The core of this application is in the detection of late tick- Order on arrival,
ers. Boxes 1, 2, 3, and 4 are all Aggregate boxes that perform Window (Size = 2 tuples,
the bulk of this computation. An Aggregate box groups input Step = 1 tuple,
tuples by common value of one or more of their attributes, Timeout = 5 sec))
thus effectively creating a substream for each possible com- Boxes 5 through 8 are Filters that eliminate the normal
bination of these attribute values. In this case, the aggregates outputs, thereby letting only the Alarm tuples through. Box 9
are grouping the input on common value of ticker symbol. For is a Union operator that merges all Reuters alarms onto a single
each grouping or substream, a window is defined that demar- stream. Box 10 performs the same operation for Comstock.
cates interesting runs of consecutive tuples called windows. The rest of the network determines when a large number of
For each of the tuples in one of these windows, some mem- Alarms is occurring and what the cause of the problem might
ory is allocated and an aggregating function (e.g., Average) is be.
applied. In this example, the window is defined to be every Boxes 11 and 15 count Reuters alarms and raise a high
consecutive pair (e.g., tuples 1 and 2, tuples 2 and 3, etc.) and alarm when a threshold (100) is reached. Until that time, they
the aggregating function generates one output tuple per win- simply pass through the normal (low) alarms. Boxes 14 and 18
dow with a boolean flag called Alarm, which is a 1 when the do the same for Comstock. Note that the boxes labeled Count
852 Chapter 10: Stream-Based Data Management

100 are actually Map boxes. Map takes a user-defined function detected when multiple cars report close positions at the same
as a parameter and applies it to each input tuple. That is, for time), and vehicles that use a particular expressway often are
each tuple t in the input stream, a Map box parameterized by a issued “frequent traveler” discounts.
function f produces the tuple f (x). In this example, Count 100 The Linear Road benchmark demands support for five
simply applies the following user-supplied function (written queries: two continuous and three historical. The first con-
in pseudocode) to each tuple that passes through: tinuous query calculates and reports a segment toll every time
a vehicle enters a segment. This toll must then be charged
F (x:tuple) = cnt++ to the vehicle’s account when the vehicle exits that segment
if (cnt % 100 != 0) without exiting the expressway. Again, tolls are based on cur-
if !suppress rent congestion conditions on the segment, recent accidents in
emit lo-alarm the vicinity, and frequency of use of the expressway for the
else given vehicle. The second continuous query involves detecting
emit drop-alarm and reporting accidents and adjusting tolls accordingly. The
else historical queries involve requesting an account balance or a
emit hi-alarm day’s total expenditure for a given vehicle on a given express-
set suppress = true way and a prediction of travel time between two segments on
Boxes 12, 13, 16, and 17 separate the alarms from both the basis of average speeds on the segments recorded previ-
Reuters and Comstock into alarms from NYSE and alarms ously. Each of the queries must be answered with a specified
from NASDAQ. This is achieved by using Filters to take NYSE accuracy and within a specified response time. The degree of
alarms from both feed sources (Boxes 12 and 13) and merging success for this benchmark is measured in terms of the num-
them using a Union (Box 16). A similar path exists for NAS- ber of expressways the system can support, assuming 1000
DAQ alarms. The results of each of these streams are counted position reports issued per second per expressway, while an-
and filtered as explained above. swering each of the five queries within the specified latency
In summary, this example illustrates the ability to share bounds.
computation among queries, the ability to extend functionality An early Aurora implementation of this benchmark sup-
through user-defined Aggregate and Map functions, and the porting one expressway was demonstrated at SIGMOD 2003
need to detect and exploit stream imperfections. [2].

3.2 The Linear Road benchmark 3.3 Battalion monitoring

Linear Road is a benchmark for stream-processing engines [2, We have worked closely with a major defense contractor on a
4]. This benchmark simulates an urban highway system that battlefield monitoring application. In this application, an ad-
uses “variable tolling” (also known as “congestion pricing”) vanced aircraft gathers reconnaissance data and sends them to
[11,1,16], where tolls are determined according to such dy- monitoring stations on the ground. These data include posi-
namic factors as congestion, accident proximity, and travel tions and images of friendly and enemy units. At some point,
frequency. As a benchmark, Linear Road specifies input data the enemy units cross a given demarcation line and move to-
schemas and workloads, a suite of continuous and historical ward the friendly units, thereby signaling an attack.
queries that must be supported, and performance (query and Commanders in the ground stations monitor these data for
transaction response time) requirements. analysis and tactical decision making. Each ground station is
Variable tolling is becoming increasingly prevalent in ur- interested in particular subsets of the data, each with differing
ban settings because it is effective at reducing traffic conges- priorities. In the real application, the limiting resource is the
tion and because recent advances in microsensor technology bandwidth between the aircraft and the ground. When an attack
make it feasible. Traffic congestion in major metropolitan ar- is initiated, the priorities for the data classes change. More data
eas is an increasing problem as expressways cannot be built become critical, and the bandwidth likely saturates. In this
fast enough to keep traffic flowing freely at peak periods. The case, selective dropping of data is allowed in order to service
idea behind variable tolling is to issue tolls that vary according the more important classes.
to time-dependent factors such as congestion levels and acci- For our purposes, we built a simplified version of this
dent proximity with the motivation of charging higher tolls application to test our load-shedding techniques. Instead of
during peak traffic periods to discourage vehicles from using modeling bandwidth, we assume that the limited resource is
the roads and contributing to the congestion. Illinois, Califor- the CPU. We introduce load shedding as a way to save cycles.
nia, and Finland are among the highway systems that have Aurora supports two kinds of load shedding. The first tech-
pilot programs utilizing this concept. nique inserts random drop boxes into the network. These boxes
The benchmark itself assumes a fictional metropolitan area discard a fraction of their input tuples chosen randomly. The
(called “Linear City”) that consists of 10 expressways of 100- second technique inserts semantic, predicate-based drop filters
mile-long segments each and 1,000,000 vehicles that report into the network. Based on QoS functions, system statistics
their positions via GPS-based sensors every 30 s. Tolls must be (like operator cost and selectivity), and input rates, our algo-
issued on a per-segment basis automatically, based on statistics rithms choose the best drop locations and the drop amount as
gathered over the previous 5 min concerning average speed and indicated by a drop rate (random drop) or a predicate (seman-
number of reporting cars. A segment’s tolls are overridden tic drop). Drop insertion plans are constructed and stored in a
when accidents are detected in the vicinity (an accident is table in advance. As load levels change, drops are automati-
Retrospective on Aurora 853

We ran this query network with tuples generated by the Au-


rora workload generator based on a battle scenario that we got
from the defense contractor. We fed the input tuples at differ-
ent rates to create specific levels of overload in the network;
then we let the load-shedding algorithm remove the excess
load by inserting drops to the network. Figure 5 shows the
result. We compare the performance of three different load-
shedding algorithms in terms of their value utility loss (i.e.,
average degradation in the QoS provided by the system) across
all outputs at increasing levels of load.
We make the following important observations. First, our
semantic load-shedding algorithm, which drops tuples based
on attribute values, achieves the least value utility loss at all
load levels. Second, our random load-shedding algorithm in-
serts drops of the same amounts at the same network locations
as the semantic load shedder. Since tuples are dropped ran-
Fig. 4. Aurora query network for battlefield monitoring application domly, however, loss in value utility is higher compared to the
semantic load shedder. As excess load increases, the perfor-
mance of the two algorithms becomes similar. The reason is
that at high load levels, our semantic load shedder also drops
tuples from the high utility value ranges. Lastly, we compare
both of our algorithms against a simple admission control al-
gorithm, which sheds random tuples at the network inputs.
Both our algorithms achieve lower utility loss compared to
this algorithm. Our load-shedding algorithms may sometimes
decide to insert drops on inner arcs of the query network. On
networks with box sharing among queries (e.g., the union box
is shared among all four queries, Fig. 4), inner arcs may be
preferable to avoid utility loss at multiple query outputs. On
the other hand, at very high load levels, since drops at inner
arcs become insufficient to save the needed CPU cycles, our al-
gorithms also insert drops close to the network inputs. Hence,
all algorithms tend to converge to the same utility loss levels
Fig. 5. Comparison of various load-shedding approaches (%excess at very high loads.
load vs. % value utility loss)

3.4 Environmental monitoring

cally inserted and removed from the query networks based on We have also worked with a military medical research labora-
these plans [18]. tory on an application that involves monitoring toxins in the
One of the query networks that we used in this study is water. This application is fed streams of data indicating fish
shown in Fig. 4. There are four queries in this network. The behavior (e.g., breathing rate) and water quality (e.g., tem-
Analysis query merges all tuples about positions of all units perature, pH, oxygenation, and conductivity). When the fish
for analysis and archiving. The next two queries, labeled En- behave abnormally, an alarm is sounded.
emy Tanks and Enemy Aircraft, select enemy tank and enemy Input data streams were supplied by the army laboratory
aircraft tuples using predicates on their IDs. The last query, as a text file. The single data file interleaved fish observations
Across The Line, selects all the objects that have crossed the with water quality observations. The alarm message emitted
demarcation line toward the friendly side. by Aurora contains fields describing the fish behavior and two
Each query has a value-based QoS function attached to different water quality reports: the water quality at the time
its output. A value-based QoS function maps the tuple values the alarm occurred and the water quality from the last time
observed at an output to utility values that express the impor- the fish behaved normally. The water quality reports contain
tance of a given result tuple. In this example, the functions not only the simple measurements but also the 1-/2-/4-hour
are defined on the x-coordinate attribute of the output tuple, sliding-window deltas for those values.
which indicates where an object is positioned horizontally. The application’s Aurora processing network is shown in
The functions take values in the range [0, 500], of which 350 Fig. 6 (snapshot taken from the Aurora GUI): The input port
corresponds to the position of the vertical demarcation line. (1) shows where tuples enter Aurora from the outside data
Initially all friendly units are on the [0, 350] side of this line source. In this case, it is the application’s C++ program that
whereas enemy units are on the [350, 500] side. The QoS func- reads in the sensor log file. A Union box (2) serves merely
tions are specified by an application administrator and reflect to split the stream into two identical streams. A Map box (3)
the basic fact that tuples for enemy objects that have crossed eliminates all tuple fields except those related to water quality.
the demarcation line are more important than others. Each superbox (4) calculates the sliding-window statistics for
854 Chapter 10: Stream-Based Data Management

1 7
3
5
4

6
2

Fig. 6. Aurora query network for the environmental contamination detection applications (GUI snapshot)

one of the water quality attributes. The parallel paths (5) form a The ease with which the processing flow could be exper-
binary join network that brings the results of (4)’s subnetworks imentally reconfigured during development, while remaining
back into a single stream. The top branch in (6) has all the comprehensible, was surprising. It appears that this was only
tuples where the fish act oddly, and the bottom branch has possible by having both a well-suited operator set and a GUI
the tuples where the fish act normally. For each of the tuples tool that let us visualize the processing. It seems likely that
sent into (1) describing abnormal fish behavior, (6) emits an this application was developed at least as quickly in Aurora as
alarm message tuple. This output tuple has the sliding-window it would have been with standard procedural programming.
water quality statistics for both the moment the fish acted oddly We note that, for this particular application, real-time re-
and for the most recent previous moment that the fish acted sponse was not required. The main value Aurora added in this
normally. Finally, the output port (7) shows where result tuples case was the ease of developing stream-oriented applications.
are made available to the C++-based monitoring application.
Overall, the entire application ended up consisting of 3400
lines of C++ code (primarily for file parsing and a simple 3.5 Medusa: distributed stream processing
monitoring GUI) and a 53-operator Aurora query network.
During the development of the application, we observed Medusa is a distributed stream-processing system built using
that Aurora’s stream model proved very convenient for de- Aurora as the single-site query-processing engine. Medusa
scribing the required sliding-window calculations. For exam- takes Aurora queries and distributes them across multiple
ple, a single instance of the aggregate operator computed the nodes. These nodes can all be under the control of one en-
4-h sliding-window deltas of water temperature. tity or be organized as a loosely coupled federation under the
Aurora’s GUI for designing query networks also proved control of different autonomous participants.
invaluable. As the query network grew large in the number A distributed stream-processing system such as Medusa
of operators used, there was great potential for overwhelming offers several benefits:
complexity. The ability to manually place the operators and 1. It allows stream processing to be incrementally scaled over
arcs on a workspace, however, permitted a visual represen- multiple nodes.
tation of “subroutine” boundaries that let us comprehend the 2. It enables high availability because the processing nodes
entire query network as we refined it. can monitor and take over for each other when failures
We found that small changes in the operator language de- occur.
sign would have greatly reduced our processing network com- 3. It allows the composition of stream feeds from different
plexity. For example, Aggregate boxes apply some window participants to produce end-to-end services and to take
function [such as DELTA(water-pH)] to the tuples in a advantage of the distribution inherent in many stream-
sliding window. Had an Aggregate box been capable of evalu- processing applications (e.g., climate monitoring, finan-
ating multiple functions at the same time on a single window cial analysis, etc.).
[such as DELTA(water-pH) and DELTA(watertemp)], 4. It allows participants to cope with load spikes without indi-
we could have used significantly fewer boxes. Many of these vidually having to maintain and administer the computing,
changes have since been made to Aurora’s operator language. network, and storage resources required for peak opera-
tion. When organized as a loosely coupled federated sys-
Retrospective on Aurora 855

Table 1. Overview of a subset of the Aurora API


Medusa Node
Query Processor
start and shutdown: Respectively starts processing and shuts
Aurora down a complete query network.
Local Partition of
Distributed Catalog Brain
modifyNetwork: At runtime, adds or removes schemas,
(Lookup) streams, and operator boxes from a query network processed by
IO Queues a single Aurora engine.
typecheck: Validates (part of) a query network. Computes
DHT Transport Independent RPC properties of intermediate and output streams.
(Chord) (XML−RPC, TCP−RPC, Local) enqueue and dequeue: Push and pull tuples on named streams.
listEntities and describe(Entity): Provide informa-
tion on entities in the current query network.
getPerfStats: Provides performance and load information.
Control Data

Fig. 7. Medusa software architecture


restart processing. The second method performs the converse
actions atomically. It stops processing, adds a box to a query
tem, load movements between participants based on pre-
network, initializes the box’s state, and restarts processing. To
defined contracts can significantly improve performance.
minimize the amount of state moved, we are exploring freez-
Figure 7 shows the software structure of a Medusa node. ing operators around the windows of tuples on which they
There are two components in addition to the Aurora query pro- operate rather than at random instants. When Medusa moves
cessor. The Lookup component is a client of an internode dis- an operator or a group of operators, it handles the forwarding
tributed catalog that holds information on streams, schemas, of tuples to their new locations.
and queries running in the system. The Brain handles query Medusa employs an agoric system model to create incen-
setup operations and monitors local load using information tives for autonomous participants to handle each other’s load.
about the queues (IOQueues) feeding Aurora and statistics Clients outside the system pay Medusa participants for pro-
on box load. The Brain uses this information as input to a cessing their queries and Medusa participants pay each other to
bounded-price distributed load management mechanism that handle load. Payments and load movements are based on pair-
converges efficiently to good load allocations [5]. wise contracts negotiated offline between participants. These
The development of Medusa prompted two important contracts set tightly bounded prices for migrating each unit of
changes to the Aurora processing engine. First, it became ap- load and specify the set of tasks that each participant is will-
parent that it would be useful to offer Aurora not only as a ing to execute on behalf of its partner. Contracts can also be
stand-alone system but also as a library that could easily be customized with availability, performance, and other clauses.
integrated within a larger system. Second, we felt the need for Our mechanism, called the bounded-price mechanism, thus
an Aurora API, summarized in Table 1. This API is composed allows participants to manage their excess load through pri-
of three types of methods: (1) methods to set up queries and vate and customized service agreements. The mechanism also
push or pull tuples from Aurora, (2) methods to modify query achieves a low runtime overhead by bounding prices through
networks at runtime (operator additions and removals), and offline negotiations.
(3) methods giving access to performance information. Figure 8 shows the simulation results of a 995-node Medu-
sa system running the bounded-price load management mech-
Load movement. To move operators with a relatively low ef- anism. Figure 8a shows that convergence from an unbalanced
fort and overhead compared to full-blown process migration, load assignment to an almost optimal distribution is fast with
Medusa participants use remote definitions. A remote defini- our approach. Figure 8b shows the excess load remaining at
tion maps an operator defined at one node onto an operator various nodes for increasing numbers of contracts.A minimum
defined at another node. At runtime, when a path of operators of just seven contracts per node in a network of 995 nodes en-
in the boxes-and-arrows diagram needs to be moved to another sures that all nodes operate within capacity when capacity
node, all that is required is for the corresponding operators to exists in the system. The key advantages of our approach over
be instantiated remotely and for the incoming streams to be previous distributed load management schemes are (1) lower
diverted to the appropriately named inputs on the new node. runtime overhead, (2) possibility of service customization and
For some operators, the internal operator state may need to price discrimination, and (3) relatively invariant prices that one
be moved when a task moves between machines, unless some participant pays another for processing a unit of load.
“amnesia” is acceptable to the application. Our current proto-
type restarts operator processing after a move from a fresh state High availability. We are also currently exploring the run-
and the most recent position of the input streams. To support time overhead and recovery time tradeoffs among different
the movement of operator state, we are adding two new func- approaches to achieve high availability (HA) in distributed
tions to the Aurora API and modifying the Aurora engine. The stream processing, in the context of Medusa and Aurora* [4].
first method freezes a query network and removes an operator These approaches range from classical Tandem-style process-
with its state by performing the following sequence of actions pairs [6] to using upstream nodes in the processing flow as
atomically: stop all processing, remove a box from a query backup for their downstream neighbors. Different approaches
network, extract the operator’s internal state, subscribe an out- also provide different recovery semantics where: (1) some tu-
side client to what used to be the operator’s input streams, and ples are lost, (2) some tuples are reprocessed, or (3) operations
856 Chapter 10: Stream-Based Data Management

3000 Medusa Protocol 1. Open windows (connection points): Linear Road re-
Optimal Total Cost quires maintaining the last 10 weeks’ worth of toll data
2500 for each driver to support both historical queries and in-
2000
tegrated queries. This form of historical data resembles
a window in its FIFO-based update pattern but must be
Cost

1500 shared by multiple queries and therefore be openly acces-


sible.
1000 2. Aggregate summaries (latches): Linear Road requires
500
maintaining such aggregated historical data as: the cur-
rent toll balance for every vehicle (SUM(Toll)), the last
0 reported position of every vehicle (MAX(Time)), and the
0 10 20 30 40 50 60 average speed on a given segment over the past 5 min
a Time (sec) (AVG(Speed)). In all cases, the update patterns involve
maintaining data by key value (e.g., vehicle or segment ID)
2500 and using incoming tuples to update the aggregate value
that has the appropriate key. As with open windows, ag-
2000 gregate summaries must be shared by multiple queries and
therefore must be openly accessible.
Excess Load

1500 3. Tables: Linear Road requires maintaining tables of his-


torical data whose update patterns are arbitrary and deter-
1000 mined by the values of streaming data. For example, a table
must be maintained that holds every accident that has yet
to be cleared (such that an accident is detected when mul-
500
tiple vehicles report the same position at the same time).
This table is used to determine tolls for segments in the
0 vicinity of the accident and to alert drivers approaching
0 1 2 3 4 5 6 7 8
the scene of the accident. The update pattern for this ta-
b Minimum Number of Contracts ble resembles neither an open window nor an aggregate
summary. Rather, accidents must be deleted from the ta-
Fig. 8a,b. Performance of Medusa load management protocol. a Con- ble when an incoming tuple reports that the accident has
vergence speed with a minimum of 7 contracts/node. b Final alloca- been cleared. This requires the declaration of an arbitrary
tion for increasing number of contracts
update pattern.
Whereas open windows and aggregate summaries have
take over precisely where the failure happened. We discuss fixed update patterns, tables require update patterns to be
these algorithms in more detail in [13]. An important HA goal explicitly specified. Therefore, the Aurora query algebra
for the future is handling network partitions in addition to (SQuAl) includes an Update box that permits an update pattern
individual node failures. to be specified in SQL. This box has the form
UPDATE (Assume O, SQL U, Report t)
4 Lessons learned such that U is an SQL update issued with every incoming
tuple and includes variables that get instantiated with the val-
4.1 Support for historical data ues contained in that tuple. O specifies the assumed ordering
of input tuples, and t specifies a tuple to output whenever an
From our work on a variety of streaming applications, it be- update takes place. Further, because all three forms of histor-
came apparent that each application required maintaining and ical collections require random access, SQuAl also includes a
accessing a collection of historical data. For example, the Lin- Read box that initiates a query over stored data (also specified
ear Road benchmark, which represents a realistic application, in SQL) and returns the result as a stream. This box has the
required maintaining 10 weeks of toll history for each driver, form
as well as the current positions of every vehicle and the loca-
tions of accidents tying up traffic. Historical data might be used READ (Assume O, SQL Q)
to support historical queries (e.g., tell me how much driver X such that Q is an SQL query issued with every incoming tuple
has spent on tolls on expressway Y over the past 10 weeks) and includes variables that get instantiated with the values
or serve as inputs to hybrid queries involving both streaming contained in that tuple.
and historical data [e.g., report the current toll for vehicle X In short, the streaming applications we have looked at
based on its current position (streamed data) and the presence share the need for maintaining and randomly accessing collec-
of any accidents in its vicinity (historical data)]. tions of historical data. These collections, used for both his-
In the applications we have looked at, historical data take torical and hybrid queries, are of three forms differing by their
three different forms. These forms differ by their update pat- update patterns. To support historical data in Aurora, we in-
terns – the means by which incoming stream data are used to clude an update operation (to update tables with user-specified
update the contents of a historical collection. These forms are update patterns) and a read operation (to read any of the forms
summarized below. of historical data).
Retrospective on Aurora 857

4.2 Synchronization will simply drop such data tuples immediately. Nonetheless,
Aurora understands that this may at times be too drastic a con-
As continuous queries, stream applications inherently rely on straint and provides an optional slack parameter to allow for
shared data and computation. Shared data might be contained some tolerance in the number of data tuples that may arrive
in a table that one query updates and another query reads. For out of order. A tuple that arrives out of order within the slack
example, the Linear Road application requires that vehicle bounds will be processed as if it had arrived in order.
position data be used to update statistics on highway usage, With respect to possible irregularity in the arrival rate of
which in turn are read to determine tolls for each segment data streams, the Aurora operator set offers all windowed op-
on the highway. Alternatively, box output can be shared by erators an optional timeout parameter. The timeout parameter
multiple queries to exploit common subexpressions or even by tells the operator how long to wait for the next data tuple to ar-
a single query as a way of merging intermediate computations rive. This has two benefits: it prevents blocking (i.e., no output)
after parallelization. when one stream is stalled, and it offers another way for the
Transactions are required in traditional databases because network designer to characterize the value of data that arrive
data sharing can lead to data inconsistencies. An equivalent later than they should, as in the financial services application
synchronization mechanism is required in streaming settings, in which the timeout parameter was used to determine when
as data sharing in this setting can also lead to inconsistencies. a particular data packet arrived late.
For example, if a toll charge can expire, then a toll assessment
to a given vehicle should be delayed until a new toll charge is
determined. The need for synchronization with data sharing is 4.4 XML and other feed formats adaptor required
achieved in SQuAl via the WaitFor box whose syntax is shown
below: Aurora provides a network protocol that may be used to en-
queue and dequeue tuples via Unix or TCP sockets. The proto-
WaitFor (P: Predicate, T: Timeout).
col is intentionally very low-level: to eliminate copies and im-
This binary operator buffers each tuple t on one input stream prove throughput, the tuple format is closely tied to the format
until a tuple arrives on the second input stream that with t of Aurora’s internal queue format. For instance, the protocol
satisfies P (or until the timeout expires, in which case t is requires that each packet contain a fixed amount of padding
discarded). If a Read operation must follow a given Update reserved for bookkeeping and that integer and floating-point
operation, then a WaitFor can buffer the Read request (tuple) fields in the packet match the architecture’s native format.
until a tuple output by the Update box (and input to the sec- While we anticipate that performance-critical applications
ond input of WaitFor) indicates that the Read operation can will use our low-level protocol, we also recognize that the for-
proceed. mats of Aurora’s input streams may be outside the immediate
In short, the inherent sharing possible in streaming envi- control of the Aurora user or administrator, for example, stock
ronments makes it sometimes necessary to synchronize op- quote data arriving in XML format from a third-party infor-
erations to ensure data consistency. We currently implement mation source. Also, even if the streams are being generated
synchronization in SQuAl with a dedicated operator. or consumed by an application within an organization’s con-
trol, in some cases protocol stability and portability (e.g., not
requiring the client to be aware of the endian-ness of the server
4.3 Resilience to unpredictable stream behavior architecture) are important enough to justify a minor perfor-
mance loss.
Streams are by their nature unpredictable. Monitoring appli- One approach to addressing these concerns is to simply
cations require the system to continue operation even when the require the user to build a proxy application that accepts tuples
unpredictable happens. Sometimes, the only way to do this is in the appropriate format, converts them to Aurora’s internal
to produce approximate answers. Obviously, in these cases, format, and pipes them into the Aurora process. This approach,
the system should try to minimize errors. while simple, conflicts with one of Aurora’s key design goals
We have seen examples of streams that do not behave as – to minimize the number of boundary crossings in the system
expected. The financial services application that we described – since the proxy application would be external to Aurora and
earlier requires the ability to detect a problem in the arrival hence live in its own address space.
rate of a stream. The military application must fundamentally We resolve this problem by allowing the user to provide
adjust its processing to fit the available resources during times plug-ins called converter boxes. Converter boxes are shared
of stress. In both of these cases, Aurora primitives for unpre- libraries that are dynamically linked into the Aurora process
dictable stream behavior were brought to bear on the problem. space; hence their use incurs no boundary crossings. A user-
Aurora makes no assumptions that a data stream arrives in defined input converter box provides a hook that is invoked
any particular order or with any temporal regularity. Tuples can when data arrive over the network. The implementation may
be late or out of order due to the nature of the data sources, examine the data and inject tuples into the appropriate streams
the network that carries the streams, or the behavior of the in the Aurora network. This may be as simple as consum-
operators themselves. Accordingly, our operator set includes ing fixed-length packets and enforcing the correct byte order
user-specified parameters that allow handling such “damaged” on fields or as complex as transforming fully formed XML
streams gracefully. documents into tuples. An output converter box performs the
For many of the operators, an input stream can be speci- inverse function: it accepts tuples from streams in Aurora’s
fied to obey an expected order. If out-of-order data are known internal format and converts them into a byte stream to be
to the network designer not to be of relevance, the operator consumed by an external application.
858 Chapter 10: Stream-Based Data Management

Input and output converter boxes are powerful connectivity Second to the message bus, the scheduler is the core el-
mechanisms: they provide a high level of flexibility in dealing ement of an SPE. The scheduler is responsible for allocat-
with external feeds and sinks without incurring a performance ing processor time to operators. It is tempting to decorate the
hit. This combination of flexibility and high performance is scheduler with all sorts of high-level optimization such as in-
essential in a streaming database that must assimilate data telligent allocation of processor time or real-time profiling
from a wide variety of sources. of query plans. But it is important to remember that sched-
uler overhead can be substantial in networks where there are
many operators and that the scheduler makes no contribution
4.5 Programmatic interfaces and globally accessible to the actual processing. All addition of scheduler functional-
catalogs are a good idea ity must be greeted with skepticism and should be aggressively
profiled.
Initially, Aurora networks were created using the GUI and all
Once the core of the engine has been aggressively opti-
Aurora metadata (i.e., catalogs) were stored in an internal rep-
mized, the remaining hot spots for performance are to be found
resentation. Our experience with the Medusa system quickly
in the implementation of the operators. In our implementation,
made us realize that, in order for Aurora to be easily integrated
each operator has a “tight loop” that processes batches of in-
within a larger system, a higher-level, programmatic interface
put tuples. This loop is a prime target for optimization. We
was needed to script Aurora networks and metadata needed to
make sure nothing other than necessary processing occurs in
be globally accessible and updatable.
the loop. In particular, housekeeping of data structures such
Although we initially assumed that only Aurora itself (i.e.,
as memory allocations and deallocation needs to be done out-
the runtime and the GUI) would need direct access to the cat-
side of this loop so that its cost can be amortized across many
alog representation, we encountered several situations where
tuples.
this assumption did not hold. For instance, in order to manage
Data structures are another opportunity for operator opti-
distribution operation across multiple Aurora nodes, Medusa
mization. Many of our operators are stateful; they retain in-
required knowledge of the contents of node catalogs and the
formation or even copies of previous input. Because these op-
ability to selectively move parts of catalogs from node to node.
erators are asked to process and store large numbers of tuples,
Medusa needed to be able to create catalog objects (schema,
efficiency of these data structures is important. Ideally, pro-
streams, and boxes) without direct access to the Aurora cata-
cessing of each input tuple is accomplished in constant time.
log database, which would have violated abstraction. In other
In our experience, processing that is linear in the amount of
words, relying on the Aurora runtime and GUI as the sole soft-
states stored is unacceptable.
ware components able to examine and modify catalog struc-
In addition to the operators themselves, any parts of the
tures turned out to be an unworkable solution when we tried
system that are used by those operators in the tight loops must
to build sophisticated applications on the Aurora platform. We
be carefully examined. For example, we have a small language
concluded that we needed a simple and transparent catalog
used to specify expressions for Map operators. Because these
representation that is easily readable and writable by external
expressions are evaluated in such tight loops, optimizing them
applications. This would make it much easier to write higher-
was important. The addition of an expensive compilation step
level systems that use Aurora (such as Medusa) and alternative
may even be appropriate.
authoring tools for catalogs.
To assess the relative performance of various parts of the
To this end, Aurora currently incorporates appropriate in-
Aurora system, we developed a simple series of microbench-
terfaces and mechanisms (Sect. 3.5) to make it easy to develop
marks. Each microbenchmark follows the following pattern:
external applications to inspect and modify Aurora query net-
works. A universally readable and writable catalog represen-
1. Initialize Aurora using a query network q.
tation is crucial in an environment where multiple applications
2. Create d dequeuers receiving data from the output of the
may operate on Aurora catalogs.
query network. (If d is 0, then there are no dequeuers, i.e.,
tuples are discarded as soon as they are output.)
4.6 Performance critical 3. Begin a timer.
4. Enqueue n tuples into the network in batches of b tuples
During the development of Aurora, our primary tool for keep- at a time. Each tuple is 64 bytes long.
ing performance in mind was a series of “microbenchmarks”. 5. Wait until the network is drained, i.e., every box is done
Each of these benchmarks measured the performance of a processing every input tuple and every dequeuer has re-
small part of our system, such as a single operator, or the raw ceived every output tuple. Stop the timer. Let t be the
performance of the message bus. These benchmarks allowed amount of time required to process each input tuple, i.e.,
us to measure the merits of changes to our implementation the total amount of time passed divided by n.
quickly and easily.
Fundamental to an SPE is a high-performance “message For the purposes of this benchmark, we fixed n at
bus”. This is the system that moves tuples from one opera- 2,000,000 tuples. We used several different catalogs. Note that
tor to the next, storing them temporarily, as well as into and these networks are functionally identical: every input tuple is
out of the query network. Since every tuple is passed on the output to the dequeuers, and the only difference is the type
bus a number of times, this is definitely a performance bot- and amount of processing done to each tuple. This is neces-
tleneck. Even such trivial optimizations as choosing the right sary to isolate the impact of each stage of tuple processing;
memcpy() implementation gave substantial improvements if some networks returned a different number of tuples, any
to the whole system. performance differential might be attributed simply to there
Retrospective on Aurora 859

Table 2. Microbenchmark results distributed stream-processing system. Borealis inherits core


stream-processing functionality from Aurora and distribution
Query(q) # Dequers(d) Batch size(b) Average latency functionality from Medusa. Borealis modifies and extends
A NULL 0 1 1211 ns both systems in nontrivial and critical ways to provide ad-
B NULL 0 10 176 ns vanced capabilities that are commonly required by newly
C NULL 0 100 70 ns emerging stream-processing applications.
D NULL 0 1000 60 ns The Borealis design is driven by our experience in us-
E NULL 1 10 321 ns ing Aurora and Medusa, in developing several streaming ap-
F NULL 1 100 204 ns plications including the Linear Road benchmark, and several
G NULL 1 1000 191 ns commercial opportunities. Borealis will address the following
H NULL 5 1000 764 ns requirements of newly emerging streaming applications.
I NULL 10 1000 1748 ns
J FILTER 1 1000 484 ns
K UNION 1 1000 322 ns 5.1 Dynamic revision of query results
L UNION-CHAIN 1 1000 858 ns
In many real-world streams, corrections or updates to previ-
ously processed data are available only after the fact. For in-
being less or more work to do because of the different number stance, many popular data streams, such as the Reuters stock
of tuples to enqueue or dequeue. market feed, often include messages that allow the feed orig-
inator to correct errors in previously reported data. Further-
• NULL: A catalog with no boxes, i.e., input values are more, stream sources (such as sensors), as well as their con-
passed directly to dequeuers. nectivity, can be highly volatile and unpredictable. As a result,
• FILTER: A catalog with a filter box whose condition is data may arrive late and miss their processing window or be
true for each tuple. ignored temporarily due to an overload situation. In all these
• UNION: A union box that combines the input stream with cases, applications are forced to live with imperfect results,
an empty stream. unless the system has means to correct its processing and re-
• UNION-CHAIN: A chain of five union boxes, each of sults to take into account newly available data or updates.
which combines the input stream with an empty stream. The Borealis data model will extend that of Aurora by sup-
Table 2 shows the performance of the benchmark with porting such corrections by way of revision records. The goal
various settings of q, d, and b. is to process revisions intelligently, correcting query results
We observe that the overhead to enqueue a tuple in Aurora that have already been emitted in a manner that is consistent
is highly dependent on the batch size but for large batch sizes with the corrected data. Processing of a revision message must
settles to 60 ns. Dequeuers add a somewhat higher overhead replay a portion of the past with a new or modified value. Thus,
(between 130 ns (G–D) and 200 ns (I–H)/5] each) because cur- to process revision messages correctly, we must make a query
rently one copy of each tuple is made per dequeuer. Comparing diagram “replayable”. In theory, we could process each revi-
cases G and K, or cases G and L, we see that adding a box sion message by replaying processing from the point of the
on a tuple path incurs a delay of approximately 130 ns per tu- revision to the present. In most cases, however, revisions on
ple; evaluating a simple comparison predicate on a tuple adds the input affect only a limited subset of output tuples, and
about 160 ns (J–K). to regenerate unaffected output is wasteful and unnecessary.
These microbenchmarks measure the overhead involved in To minimize runtime overhead and message proliferation, we
passing tuples into and out of Aurora boxes and networks; they assume a closed model for replay that generates revision mes-
do not measure the time spent in boxes performing nontrivial sages when processing revision messages. In other words, our
operations such as joining and aggregation. Message-passing model processes and generates “deltas” showing only the ef-
overhead, however, can be a significant time sink in stream- fects of revisions rather than regenerating the entire result.
ing databases (as it was in earlier versions of Aurora). Mi- The primary challenge here is to develop efficient revision-
crobenchmarking was very useful in eliminating performance processing techniques that can work with bounded history.
bottlenecks in Aurora’s message-passing infrastructure. This
infrastructure is now fast enough in Aurora that nontrivial box
operations are the only noticeable bottleneck, i.e., CPU time 5.2 Dynamic query modification
is overwhelmingly devoted to useful work and not simply to
shuffling around tuples. In many stream-processing applications, it is desirable to
change certain attributes of the query at runtime. For example,
in the financial services domain, traders typically wish to be
5 Future plans: Borealis alerted of interesting events, where the definition of “interest-
ing” (i.e., the corresponding filter predicate) varies based on
The Aurora team has secured venture capital backing to com- current context and results. In network monitoring, the sys-
mercialize the current code line. Some of the group is mor- tem may want to obtain more precise results on a specific
phing into pursuing this venture. Because of this event, there subnetwork if there are signs of a potential denial-of-service
is no reason for the Aurora team to improve the current sys- attack. Finally, in a military stream application that MITRE
tem. This section presents the initial ideas that we plan to [19] explained to us, they wish to switch to a “cheaper” query
explore in a follow-on system, called Borealis, which is a when the system is overloaded. For the first two applications,
860 Chapter 10: Stream-Based Data Management

it is sufficient to simply alter the operator parameters (e.g., nates, thus ensuring that simultaneous replays on any path in
window size, filter predicate), whereas the last one calls for the query diagram are processed in sequence and do not con-
altering the operators that compose the running query. Another flict. When offset into the future, time-offset operators predict
motivating application comes again from the financial services future values. As new data become available, these predictors
community. Universally, people working on trading engines can (but do not have to) produce more accurate revisions to
wish to test out new trading strategies as well as debug their their past predictions. Additionally, when a predictor receives
applications on historical data before they go live. As such, revision messages, possibly due to time travel into the past, it
they wish to perform “time travel” on input streams. Although can also revise its previous predictions.
this last example can be supported in most current SPE proto-
types (i.e., by attaching the engine to previously stored data),
a more user-friendly and efficient solution would obviously 5.3 Distributed optimization
be desirable.
Two important features that will facilitate online modifi- Currently, commercial stream-processing applications are
cation of continuous queries in Borealis are control lines and popular in industrial process control (e.g., monitoring oil re-
time travel. Control lines extend Aurora’s basic query model fineries and cereal plants), financial services (e.g., feed pro-
with the ability to change operator parameters as well as op- cessing, trading engine support and compliance), and network
erators themselves on the fly. Control lines carry messages monitoring (e.g., intrusion detection, fraud detection). Here
with revised box parameters and new box functions. For ex- we see a server-heavy optimization problem – the key chal-
ample, a control message to a Filter box can contain a ref- lenge is to process high-volume data streams on a collection
erence to a boolean-valued function to replace its predicate. of resource-rich “beefy” servers. Over the horizon, we see a
Similarly, a control message to an Aggregate box may con- very large number of applications of wireless sensor technol-
tain a revised window size parameter. Additionally, each con- ogy (e.g., RFID in retail applications, cell phone services).
trol message must indicate when the change in box semantics Here we see a sensor-heavy optimization problem – the key
should take effect. Change is triggered when a monotonically challenges revolve around extracting and processing sensor
increasing attribute received on the data line attains a certain data from a network of resource-constrained “tiny” devices.
value. Hence, control messages specify an <attribute, value> Further over the horizon, we expect sensor networks to be-
pair for this purpose. For windowed operators like Aggregate, come faster and increase in processing power. In this case
control messages must also contain a flag to indicate if open the optimization problem becomes more balanced, becoming
windows at the time of change must be prematurely closed for sensor-heavy/server-heavy. To date, systems have exclusively
a clean start. focused on either a server-heavy environment or a sensor-
Time travel allows multiple queries (different queries or heavy environment. Off into the future, there will be a need
versions of the same query) to be easily defined and executed for a more flexible optimization structure that can deal with
concurrently, starting from different points in the past or “fu- a very large number of devices and perform cross-network
ture” (typically by running a simulation of some sort). To sensor-heavy/server-heavy resource management and optimi-
support these capabilities, we leverage three advanced mech- zation.
anisms in Borealis: enhanced connection points, connection The purpose of the Borealis optimizer is threefold. First, it
point versions, and revision messages. To facilitate time travel, is intended to optimize processing across a combined sensor
we define two new operations on connection points. The replay and server network. To the best of our knowledge, no previous
operation replays messages stored at a connection point from work has studied such a cross-network optimization problem.
an arbitrary message in the past. The offset operation is used Second, QoS is a metric that is important in stream-based ap-
to set the connection point offset in time. When offset into plications, and optimization must deal with this issue. Third,
the past, a connection point delays current messages before scalability, sizewise and geographical, is becoming a signif-
pushing them downstream. When offset into the future, the icant design consideration with the proliferation of stream-
connection point predicts future data. When producing future based applications that deal with large volumes of data gener-
data, various prediction algorithms can be used based on the ated by multiple distributed sensor networks. As a result, Bo-
application. A connection point version is a distinctly named realis faces a unique, multiresource, multimetric optimization
logical copy of a connection point. Each named version can be challenge that is significantly different than the optimization
manipulated independently. It is possible to shift a connection problems explored in the past. Our current thinking is that
point version backward and forward in time without affecting Borealis will rely on a hierarchical, distributed optimizer that
other versions. runs at different time granularities.
To replay history from a previous point in time t, we use Another part of the Borealis vision involves addressing re-
revision messages. When a connection point receives a re- covery and high-availability issues. High availability demands
play command, it first generates a set of revision messages that node failure be masked by seamless handoff of processing
that delete all the messages and revisions that have occurred to an alternate node. This is complicated by the fact that the
since t. To avoid the overhead of transmitting one revision per optimizer will dynamically redistribute processing, making it
deleted message, we use a macro message that summarizes all more difficult to keep backup nodes synchronized. Further-
deletions. Once all messages are deleted, the connection point more, wide-area Borealis applications are not only vulnerable
produces a series of revisions that insert the messages and pos- to node failures but also to network failures and more impor-
sibly their following revisions back into the stream. During tantly to network partitions. We have preliminary research in
replay, all messages and revisions received by the connection this area that leverages Borealis mechanisms including con-
point are buffered and processed only after the replay termi- nection point versions, revision tuples, and time travel.
Retrospective on Aurora 861

5.4 Implementation plans 6. Barlett J, Gray J, Horst B (1986) Fault tolerance in tandem com-
puter systems. Technical Report TR-86.2, Tandem Computers
We have started building Borealis. As Borealis inherits much 7. Carney D, Çetintemel U, Cherniack M, Convey C, Lee S, Seid-
of its core stream-processing functionality from Aurora, we man G, Stonebraker M, Tatbul N, Zdonik S (2002) Monitoring
can effectively borrow many of the Aurora modules including streams – a new class of data management applications. In:
the GUI, the XML representation for query diagrams, por- VLDB conference, Hong Kong
tions of the runtime system, and much of the logic for boxes. 8. Carney D, Çetintemel U, Rasin A, Zdonik S, Cherniack M,
Similarly, we are borrowing some networking and distribution Stonebraker M (2003) Operator scheduling in a data stream
logic from Medusa. With this starting point, we hope to have manager. In: VLDB conference, Berlin, Germany
9. Chandrasekaran S, Deshpande A, Franklin M, Hellerstein J,
a working prototype within a year.
Hong W, Krishnamurthy S, Madden S, Raman V, Reiss F, Shah
M (2003) TelegraphCQ: Continuous dataflow processing for an
Acknowledgements. This work was supported in part by the Na- uncertain world. In: CIDR conference
tional Science Foundation under Grants IIS-0086057, IIS-0325525, 10. Cherniack M, Balakrishnan H, Balazinska M, Carney D,
IIS-0325703, IIS-0325838, and IIS-0205445 and by Army contract Çetintemel U, Xing Y, Zdonik S (2003) Scalable distributed
DAMD17-02-2-0048. We would like to thank all members of the stream processing. In: CIDR conference, Asilomar, CA
Aurora and the Medusa projects at Brandeis University, Brown Uni- 11. Congestion pricing: a report from intelligent transportation sys-
versity, and MIT. We are also grateful to the anonymous reviewers tems (ITS). http://www.path.berkeley.edu
for their invaluable comments. 12. DeWitt D, Naughton J, Schneider D (1991) An evaluation of
non-equijoin algorithms. In: VLDB conference, Barcelona, Cat-
alonia, Spain
References 13. Hwang J, Balazinska M, Rasin A, Çetintemel U, Stonebraker M,
Zdonik S (2003) A comparison of stream-oriented high-
1. A guide for hot lane development: A U.S. Depart- availability algorithms. Technical Report CS-03-17, Depart-
ment of Transportation Federal Highway Administration. ment of Computer Science, Brown University, Providence, RI
http://www.itsdocs.fhwa.dot.gov/JPODOCS/ 14. Lerner A, Shasha D (2003) AQuery: Query language for ordered
REPTS_TE/13668.html data, optimization techniques, and experiments. In: VLDB con-
2. Abadi D, Carney D, Çetintemel U, Cherniack M, Convey C, ference, Berlin, Germany
Erwin C, Galvez E, Hatoun M, Hwang J, Maskey A, Rasin A, 15. Motwani R, Widom J, Arasu A, Babcock B, Babu S, Datar
Singer A, Stonebraker M, Tatbul N, Xing Y, Yan R, Zdonik S M, Manku G, Olston C, Rosenstein J, Varma R (2003) Query
(2003) Aurora: A data stream management system (demo de- processing, approximation, and resource management in a data
scription). In: ACM SIGMOD stream management system. In: CIDR conference
3. Abadi D, Carney D, Çetintemel U, Cherniack M, Convey C, Lee 16. Poole RW (2002) Hot lanes prompted by federal program.
S, Stonebraker M, Tatbul N, Zdonik S (2003) Aurora: A new http://www.rppi.org/federalhotlanes.html
model and architecture for data stream management. VLDB J 17. Seshadri P, Livny M, Ramakrishnan R (1995) SEQ: A model for
12(2):120–139 sequence databases. In: IEEE ICDE conference, Taipei, Taiwan
4. Arasu A, Cherniack M, Galvez E, Maier D, Maskey A, Ryvkina 18. Tatbul N, Çetintemel U, Zdonik S, Cherniack M, Stonebraker M
E, Stonebraker M, Tibbetts R (2004) Linear Road: A benchmark (2003) Load shedding in a data stream manager. In: VLDB
for stream data management systems. In: VLDB conference, conference, Berlin, Germany
Toronto (in press) 19. The MITRE Corporation. http://www.mitre.org/
5. Balazinska M, Balakrishnan H, Stonebraker M (2004) Contract- 20. US Army Medical Research and Materiel Command.
based load management in federated distributed systems. In: https://mrmc-www.army.mil/
NSDI symposium
Sources

Chapter 1

Stonebraker, Michael and Joseph M. Hellerstein, “What Goes Around Comes Around.” Not previously
published.
Hellerstein, Joseph M. and Michael Stonebraker, “Anatomy of a Database System.” Not previously pub-
lished.

Chapter 2: Query Processing

Selinger, P. G., M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price, “Access Path Selection
in a Relational Database Management System.” In Proceedings of the 1979 ACM SIGMOD International
Conference on Management of Data. New York: ACM Press, 1979.
Shapiro, Leonard. “Join Processing in Database Systems with Large Main Memories.” ACM
Transactions on Database Systems 11:3 (1986).
DeWitt, David J., and Jim Gray. “Parallel Database Systems: The Future of High Performance Database
Processing.” Communications of the ACM 35:6 (1992): 85–98.
Graefe, Goetz. “Encapsulation of Parallelism in the Volcano Query Processing System.” In Proceedings
of the 1990 ACM SIGMOD International Conference on Management of Data. New York: ACM Press,
1990.
Nyberg, C., T. Barclay, Z. Cvetanovic, J. Gray, and D. Lomet. “AlphaSort: A RISC Machine Sort.” In
Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. New York:
ACM Press, 1994.
Mackert, Lothar F., and Guy M. Lohman. “R* Optimizer Validation and Performance Evaluation for
Distributed Queries.” In Proceedings of the Twelfth International Conference on Very Large Data Bases,
edited by Wesley W. Chu et al. San Francisco: Morgan Kaufmann, 1986.
Stonebraker, Michael, Paul M. Aoki, Witold Litwin, Avi Pfeffer, Adam Sah, Jeff Sidell, Carl Staelin, and
Andrew Yu. “Mariposa: A Wide-Area Distributed Database System.” The Very Large Data Base Journal
5 (1996): 48-63.

Chapter 3: Data Storage and Access Methods

Beckmann, Norbert, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. “The R*- tree : An effi-
cient and robust access method for points and rectangles,” In Proceedings of the 1990 ACM SIGMOD
International Conference on Management of Data. New York: ACM Press, 1990.
Stonebraker, Michael. “Operating System Support for Database Management.” Communications of the
ACM. 7:24 (1981): 412–418.
Gray, Jim, and Goetz Graefe. “The Five-Minute Rule Ten Years Later, and Other Computer Storage
Rules of Thumb.” ACM SIGMOD Record 26:4 (1997): 63–68.
Patterson, David A., Garth A. Gibson, and Randy H. Katz. “A Case for Redundant Arrays of Inexpensive
Sources 863

Disks (RAID).” In Proceedings of the 1988 ACM SIGMOD International Conference on Management of
Data. New York: ACM Press, 1988.

Chapter 4: Transaction Management

Gray, Jim, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger. “Granularity of Locks and
Degrees of Consistency in a Shared Data Base.” In Modelling in Data Base Management Systems:
Proceedings of the IFIP Working Conference on Modelling in Data Base Management Systems, edited by
G.M. Nijssen. North-Holland, 1976.
Kung, H. T., and John T. Robinson. “On Optimistic Methods for Concurrency Control,” ACM
Transactions on Database Systems 6:81 (1986): 213–226.
Agrawal, Rakesh, Michael J. Carey, and Miron Livny. “Concurrency control performance modeling:
alternatives and implications.” ACM Transactions on Database Systems 12:4 (1987): 609–654.
Lehman, Philip L., and S. Bing Yao. “Efficient Locking for Concurrent Operations on B-trees.” ACM
Transactions on Database Systems 6:4 (1981): 650–670.
Mohan, C., Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwartz. “ARIES: A Transaction
Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-ahead
Logging,” ACM Transactions on Database Systems 17:1 (1993): 94–162.
Mohan, C., Bruce Lindsay, and R. Obermarck. “Transaction Management in the R* Distributed
Database Management Systems.” ACM Transactions on Database Systems 11:4 (1986): 378–396.
Gray, Jim, Pat Helland, Patrick O’Neil, and Dennis Shasha. “The Dangers of Replication and a
Solution.” In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.
New York: ACM Press, 1996.

Chapter 5: Extensibility

Stonebraker, Michael. “Inclusion Of New Types In Relational Data Base Systems.” In Proceedings of
the Second International Conference on Data Engineering. Washington, D.C.: IEEE Computer Society,
1986.
Hellerstein, Joseph M., Jeffrey F. Naughton, and Avi Pfeffer. “Generalized Search Trees for Database
Systems.” In Proceedings of the Twenty-First International Conference on Very Large Data Bases, edited
by Umeshwar Dayal. San Francisco: Morgan Kaufmann, 1995.
Lohman, Guy M. “Grammar-like Functional Rules for Representing Query Optimization Alternatives.”
In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data. New
York: ACM Press, 1988.

Chapter 6: Database Evolution

Chaudhuri, Surajit, and Vivek Narasayya. “AutoAdmin ‘What-if’ Index Analysis Utility.” In Proceedings
of the 1988 ACM SIGMOD International Conference on Management of Data. New York: ACM Press,
1988.
Bernstein, Philip A. “Applying Model Management to Classical Meta Data Problems,” In Proceedings of
the First Biennial Conference on Innovative Data Systems Research. New York: ACM Press, 2003.
864 Sources

Mohan, C., and Inderpal Narang. “Algorithms for Creating Indexes for Very Large Tables Without
Quiescing Updates.” In Proceedings of the 1992 ACM SIGMOD International Conference on
Management of Data. New York: ACM Press, 1992.

Chapter 7: Data Warehousing

Chaudhuri, Surajit, and Umeshwar Dayal. “An Overview of Data Warehousing and OLAP Technology.”
SIGMOD Record 26:1 (1997) 65–74.
O’Neil, Patrick, and Dallan Quass. “Improved Query Performance with Variant Indexes.” In Proceedings
of the 1997 ACM SIGMOD International Conference on Management of Data. New York: ACM Press,
1997.
Gray, Jim, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, and Murali Venkatrao.
“DataCube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals.”
Data Mining and Knowledge Discovery 1 (1997): 29–53.
Zhao, Yihong, Prasad Deshpande, and Jeffrey F. Naughton. “An Array-Based Algorithm for Simultaneous
Multidimensional Aggregates.” In Proceedings of the 1997 ACM SIGMOD International Conference on
Management of Data. New York: ACM Press, 1997.
Ceri, Stefano, and Jennifer Widom. “Deriving Production Rules for Incremental View Maintenance.”
Proceedings of the Seventeenth International Conference on Very Large Data Bases, edited by Guy M.
Lohman et al. San Francisco: Morgan Kaufmann, 1991.
Hellerstein, Joseph M., Ron Avnur, and Vijayshankar Raman. “Informix under CONTROL: Online
Query Processing.” Data Mining and Knowledge Discovery 12 (2000): 281–314.
Kotidis, Yannis, and Nick Roussopoulos. “DynaMat: A Dynamic View Management System for Data
Warehouses.” In Proceedings of the 1999 ACM SIGMOD International Conference on Management of
Data. New York: ACM Press, 1999.

Chapter 8: Data Mining

Zhang, Tian, Raghu Ramakrishnan, and Miron Livny. “BIRCH: An Efficient Data Clustering Method for
Very Large Databases.” In Proceedings of the 1996 ACM SIGMOD International Conference on
Management of Data. New York: ACM Press, 1996.
Shafer, John, Rakesh Agrawal, and Manish Mehta. “SPRINT: A Scalable Parallel Classifier for Data
Mining.” In Proceedings of the Twenty-Second International Conference on Very Large Data Bases,
edited by T. M. Vijayaraman, et al. San Francisco: Morgan Kaufmann, 1996.
Agrawal, Rakesh and Ramakrishnan Srikant. “Fast Algorithms for Mining Association Rules.” In
Proceedings of the Twentieth International Conference on Very Large Data Bases, edited by Jorge B.
Bocca, et al. San Francisco: Morgan Kaufmann, 1994.
Chaudhuri, Surajit, Vivek Narasayya, and Sunita Sarawagi. “Efficient Evaluation of Queries with Mining
Predicates.” Proceedings of the Eighteenth International Conference on Data Engineering. Washington,
D.C.: IEEE Computer Society, 2002.
Sources 865

Chapter 9: Web Services and Data Bases

Brewer, Eric A. “Combining Systems and Databases: A Search Engine Retrospective.” Not previously
published.
Brin, Sergey, and Lawrence Page. “The Anatomy of a Large-Scale Hypertextual Web Search Engine.”
Proceedings of the Seventh International World Wide Web Conference (WWW7) on Computer Networks
30:1-7 (1998):107–117.
Sizov, Sergej, Michael Biwer, Jens Graupmann, Stefan Siersdorfer, Martin Theobald, Gerhard Weikum,
and Patrick Zimmer. “The BINGO! System for Information Portal Generation and Expert Web Search.”
In Proceedings of the First Biennial Conference on Innovative Data Systems Research. New York: ACM
Press, 2003.
Jacobs, Dean. “Data Management in Application Servers.” Datenbank-Spektrum 8 (2004): 5–11.
Abiteboul, Serge. “Querying Semi-Structured Data.” In Proceedings of the Sixth International
Conference on Database Theory, edited by Foto N. Afrati, et al. Springer-Verlag, 1997.
Goldman, Roy, and Jennifer Widom. “DataGuides: Enabling Query Formulation and Optimization in
Semistructured Databases.” In Proceedings of the Twenty-Third International Conference on Very Large
Data Bases, edited by Matthias Jarke, et al. San Francisco: Morgan Kaufmann, 1997.
Chen, Jianjun, David DeWitt, Fend Tian, and Yuan Wang. “NiagaraCQ: A Scalable Continuous Query
System for the Internet Databases.” In Proceedings of the 2000 ACM SIGMOD International Conference
on Management of Data. New York: ACM Press, 2000.

Chapter 10: Stream-Based Data Management

Hanson, Eric N., Chris Carnes, Lan Huang, Mohan Konyala, Lloyd Noronha, Sashi Parthasarathy, J. B.
Park, and Albert Vernon. “Scalable Trigger Processing.” In Proceedings of the Fifteenth International
Conference on Data Engineering. Washington, D.C.: IEEE Computer Society, 1999.
Seshadri, Praveen, Miron Livny, and Raghu Ramakrishnan. “The Design and Implementation of a
Sequence Database System.” In Proceedings of the Twenty-Second International Conference on Very
Large Data Bases, edited by T. M. Vijayaraman, et al. San Francisco: Morgan Kaufmann, 1996.
Avnur, Ron, and Joseph M. Hellerstein. “Eddies: Continuously Adaptive Query Processing.” In
Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York:
ACM Press, 2000.
Balakrishnan, Hari, Magdalena Balazinska, Don Carney, Uğur Çetintemel, Mitch Cherniack, Christian
Convey, Eddie Galvez, Jon Salz, Michael Stonebraker, Nesime Tatbul, Richard Tibbetts, and Stanley
Zdonik. Retrospective on Aurora. VLDB Journal (2004).

You might also like